Introduction: Personal Dataspace Management
Explosion of the amount of digital information made Personal Dataspace Management (PDSM) become a hot research topic. It pays specific attention to management of personal digital information. A great number of new data are created on web every year, most of which are not structured and exist in various data styles, such as email, image, html, xml, audio, video, and so on. People can easily share them through Internet. So the amount of personal data is increasing actively. On the contrary, the time and capability of people for managing information are stable and limited, so how to improve the efficiency of PDSM becomes an important problem. There are some promising and interesting problems such as PDS model, user attention model, data integration, data query and index, privacy and security, etc.
[Top]         
Motivation

Now, we are experiencing the information explosion from web pages, emails, files, contacts, blogs, wikis, mobile SMS. . . , but we have limited time and energy to manage them. How to manage versatile, heterogeneous, personalized personal information efficiently so as to find the right information in time becomes a challenging research field. For example, It always costs much time for a person to locate a specific data items kept before. Personal data items have the following characters: large volume, distribute storage and evolving with time, which make it hard for a person to manage personal dataspace efficiently.

[Top]         
Research work
  • Personal Dataspace
  • Corespace Framework on personal dataspace
    The explosion of digital information makes Personal Dataspace Management (PDSMS) an important research area, and the characters of personal data, such as distribution, heterogeneous, and so on, bring it great challenges. Although there are some works attacking the problem, most of them ignore the importance of owner of PDS. The relation between owner and other objects is the root characteristic of PDS, and may play an important role in data operation of PDS. Based on the assumption, we propose a new concept Corespace . CoreSpace is a subspace of PDS which is composed of the objects with close relation to the owner during a period. CoreSpace framework makes it more efficiently for users to locate a certain object or backup specific files from PDS. The framework also explores many promising research topics on Personal DataSpace Management.

    Pay-As-You-Go Evolution
    Pay-As-You-Go is a major feature that distinguishes Pay-As-You-Go from other data management systems. The ability of evolution measures how well the system improves its service quality as users invest more attention and experience into it. Capability of evolution labels the quality of the dataspace system to some extent.

    As one of the first groups following dataspace research, we have been focusing on the topic of evolution. Our approach is to improve the system by automatic approach combined with user attention and feedback. This is an incremental process that may involve several kinds of user interaction.

  • Efficient Approximate String Matching
  • Text data is ubiquitous. Management of string data in databases and information systems has taken on particular importance recently. Approximate search and matching is especially important as different data sources probably have different data quality and we can not make sure that the string data always keep the same when they refer to the same object in reality.

    Approximate matching problem arises in several important applications such as extracting named entities ( e.g., people, location, product names) from web pages, identifying biological concepts from biomedical literature, implementing data cleaning on databases and answering user query on web (such as google ), etc. In these scenes, exact matching will not catch all answers we need because there may exist some errors in web pages, database records and user input as well. Besides, these applications require a high real-time performance for each query to be answered, especially for those applications adopting a Web-based service model. So its important to study efficient approximate matching problem.

    Recently, we study the problem of approximate dictionary lookup: Given an input text string (documents) consisting of a sequence of tokens, identify all sub-strings that match with some string from a potentially large dictionary.

    [Top]         
    System
  • OrientSpace
  • Based on our understanding and research achievement, we developed a personal dataspace prototype system --- OrientSpace. This system is developed using Java and has implemented the basic functionalities for personal data management, including the following system features:

    Flexible Schema:

    Users are free to create, modify schemas to their like, e.g., Contact, Event.Theyre also free to create and modify instances of each schema. Schemas and instances can be modified any time users want. Use RDF to support storage, we can make modification to schemas lightweight.

    Content-based Association Construction and Utilization:

    Association information is indispensable in dataspace systems, therefore the construction and utilization of association information is a crucial problem. In OrientSpace, we produce several kinds of associations by analyzing contents of resources (including text content and meta-data).We leverage the association information to support graph-based browse and query.

    Selective Text Indexing:

    Instead of using full-text indexing, we use the selective indexing technique to index contents. We use selective indexing approach based on the following reasons:

    • Full-text takes up too much space, like Desktop Search Tools.

    • Users normally want close related ones rather than everything that is related, which is the case for typical Web search.

    • Actually, selective approach may be even faster.

    Pay-As-You-Go Evolution:

    Our approach is to improve the system by automatic approach combined with user attention and feedback. This is an incremental process that may involve several kinds of user interaction. We use a query-bridge based approach to realize the Pay-As-You-Go evolution.

    [Top]         
    Grant
    • 2007.7-2009.12 Techniques on Model, Query and Index of Dataspace
      Supported by the National High-Tech Research and Development Plan of China under grant number 2007AA01Z155
    • 2003 Semantic Grid Project
      Supported by the National Basic Research Program of China under grant number 2003CB317000
    [Top]         
    Patent
    • Efficient Merging and Filtering Algorithms for Approximate String
      United States Patent with patent no. 61/043,325
    [Top]        
    Publication
    • Alex Behm, SHenyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, IEEE International Conference on Data Engineering (ICDE) 2009 (full paper)
    • Y. Li, X. Meng, X. Zhang: Research on Dataspace. Journal of Software, 2008,19(8):2018-2031. 10.3724/SP.J.1001.2008.02018.
    • Y Li, X. Meng: Research on Personal Dataspace Management, The Second SIGMOD PhD Workshop on Innovative Dataspace Research (IDAR2008), Vancouver, BC, Canada, June 10-12, 2008.
    • X. Zhang, J. Chen, Y. Li, X. Meng: TEXEM : An Entity-based Task Extraction Approach for Emails, Journal of Computer Research and Development, Vol. 45 Suppl (NDBC2008 GuiLin)
    • Chen Li, Jiaheng Lu ,Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches, IEEE International Conference on Data Engineering (ICDE) 2008 (full paper)
    [Top]         
    Reference
    • Dong X, Halevey A: Data Integration with Uncertainty. VLDB 2007 687-698.
    • Dong X, Halevey A: Indexing Dataspace. SIGMOD 2007: 43-54.
    • Blunschi L, Dittrich J-P, Girard OR, Karakashian S.K and Salles MAV. A Dataspace Odyssey: The iMeMex Personal Dataspace Management System. CIDR 2007: 114-119
    • Salles MAV, Dittrich J-P, Karakashian S.K, Girard OR, Blunschi L. iTrails: Pay-as-you-go Information Integration in Dataspaces, VLDB 2007: 663-674
    • Halevy A , Franklin M. Maier D: Principles of dataspace systems, PODS 2006: 1-9.
    • Dittrich J-P and Salles MAV. iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB 2006: 367-378.
    • Dittrich J-P iMeMex: A Platform for Personal Dataspace Management. In SIGIR PIM Workshop, 2006.
    • A. Arasu, V. Ganti, and R. Kaushik. Ecient exact set-similarity joins. In VLDB, pages 918-929, 2006.
    • K. Chakrabarti, V. Ganti, J. Han, and D. Xin: Ranking objects based on relationships. In SIGMOD Conference, pages 371-382, 2006.
    • 2008-10-29
    • A. Chandel, P. C. Nagesh, and S. Sarawagi: Effcient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.
    • S. Chaudhuri, V. Ganti, and R. Kaushik: A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
    • X. Zhou, X. Zhang, and X. Hu. Maxmatcher: Biological concept extraction using approximate dictionary lookup. In PRICAI, pages 1145-1149, 2006.
    • Jones W and Bruce H: A Report on the NSF-Sponsored Workshop on Personal Information Management, Seattle, WA, 2005.
    • Franklin M, Halevy A, and Maier D. From databases to dataspaces: A New Abstraction for Information Management. SIGMOD Record, 34(4):27-33, 2005.
    • Dong X and Halevy A: A Platform for Personal Information Management and Integration. CIDR 2005:119-130.
    • Zhuge H: Resource space model, its design method and applications. The Journal of Systems and Software 72 (2004) 71-81.
    • M. Narayanan and R. M. Karp: Gapped local similarity search with provable guarantees. In WABI, pages 74-86, 2004.
    • S. Sarawagi and A. Kirpal: Effcient set joins on similarity predicates. In SIGMOD Conference, pages 743-754, 2004.
    • R. Fagin, A. Lotem, and M. Naor: Optimal aggregation algorithms for middleware. In PODS, 2001.
    • L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava: Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
    • A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001.
    • A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, and M. Rodeh: Text indexing and dictionary matching with one error. J. Algorithms, 37(2):309-325, 2000.
    • E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang: Finding interesting associations without support pruning. In ICDE, pages 489-499, 2000.
    • T. H. Haveliwala, A. Gionis, and P. Indyk: Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129-134, 2000.
    • A. Gionis, P. Indyk, and R: Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518-529, 1999.
    • I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.
    • Ricardo Baeza-Yates and Gonzalo Navarro: Fast Approximate String Matching in a Dictionary. Proc. SPIRE'98.
    • Freeman E and Gelernter D. Lifestreams: A Storage Model for Personal Data. In SIGMOD Record . 25(1):80-86, 1996.
    • G. S. Brodal and L: Gasieniec. Approximate dictionary queries. In CPM, pages 65-74, 1996.
    • A. C.-C. Yao and F. F. Yao: Dictionary loop-up with small errors. In CPM, pages 387-394, 1995.
    • U. Manber and S. Wu: An algorithm for approximate membership checking with application to password security. Inf. Process. Lett., 50(4):191-197, 1994.
    • R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed queries trees: In M. Crochemore and D. Gusfield, editors, 5th Combinatorial Pattern Matching, LNCS 807, pages 198-212, Asilomar, CA, June 1994.
    • A. V. Aho and M. J. Corasick: Effcient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975.
    • W. Burkhard and R. Keller: Some approaches to best-match file searching, CACM, 1973.
    • B. H. Bloom: Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422-426, 1970.
    [Top]         
    Maintained by Zhongyuan Wang( ) Copyright © 2007-2008 WAMDM, All rights reserved