XML Data Management
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879), originally designed to meet the challenges of large-scale electronic publishing. As a de facto standard of web data representation and exchange, many applications have been built on XML, for example, stock exchange, news subscribe, sensor network and so on. XML has made a great impact on Internet information management and integration. These changes have brought out many big challenges for managing semi-structured XML data..
  • Motivation
  • XML document can be modeled as a tree with labeled nodes, which is different from Relational Data Model, thus how to manage XML data in its’ native way is a new challenge for us. All existing methods about XML data management can be classified into two categories, one is storing XML data into relational databases, and the other is designing and storing XML data into native XML Databases. Native XML database system has attracted many researchers’ attention since it can keep the tree structure of XML data, take nodes or sub-tree as storage unit. The challenges include how to store XML data to support efficient query processing, how to update XML data, how to construct XML index and how to query matched results for a given query and so on.
  • XML Data Storage
  • By analyzing existing native XML storage technologies, we proposed storage XML data according to their schema information, and implemented four kinds of storage strategies, which can automatically cluster XML data to the same unit according to the features of the given XML document. Further, elements of the same type may be organized together or in document order according to query requirement. Such storage mechanism can greatly facilitate the query processing.The prototype, OrientStore, implemented by us was published in VLDB2003, which has been cited twice. Athena Vakal et al. said our method “The advantages of this approach are evident in several new applications” in their paper published in IEEE Internet Computing 9(2): 62-69 (2005).

    We have applied for a patent based on this method (No: 200410073869.5).

  • XML Numbering Scheme
  • We focus on range-code and XML Numbering update. The strategy for update is to reserve numbering space. The problem is how to reserve space and how to reallocate the space when necessary. We propose space reserving algorithm and renumbering algorithm respectively for this problem. Our experimental results show our method is effective and efficient, which can greatly reduce the cost of update and more than 85% data will not cause renumbering.

    This idea is published in WWWJ2005 and Journal of Software2005(5)

  • Indexing XML Data
  • We proposed sequence-based XML indexing, which aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. We further introduced a performance oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. We identify a class of sequencing methods for this purpose, and we presented a novel subsequence matching algorithm that observes query equivalence. Still, query equivalence is just a prerequisite for sequence-based XML indexing. Our goal is to find the best sequencing strategy with regard to the time and space complexity in indexing and querying XML data. To this end, we introduce a performance-oriented principle to guide the sequencing of tree structures. For any given XML dataset, the principle finds an optimal sequencing strategy according to its schema and its data distribution.

    This work was published in ICDE2005, the corresponding demo was published in SIGMOD2004, which was cited by two papers of VLDB2005 and considered as one of the representative work to sequence-based method. Until now, this work was cited 16 times. K.Hima Prasad et al. pointed out that “Recently sequence based query processing is gaining importance because of its holistic query processing feature” in their paper “K.Hima Prasad Ch.Rajesh P.Sreenivasa Kumar: Handling Updates in Sequence Based XML Query Processing. In the Proceeding of International Conference on Management of Data (COMAD 2005), Hyderabad, India, December 20-22, 2005.”

    Based on this work, we have applied for two patents (No: 200810056098.7, 200810056100.0).

  • Query Algebra
  • XQuery is the recommended standard for XML Query. XQuery processing strategies can be classified into two categories: core syntax based strategy (node-oriented) and algebra based strategy (set-oriented). Neither of them can handle XQuery well. The syntax based strategy is inefficient and hard to optimize, while the current algebra based strategies can not satisfy the flexible programming characteristics of XQuery. After summarizing the current stage and unsolved problems of former algebra based works, we proposed an effective XQuery algebra system, OrientXA, ideas from both strategies are embodied in it. OrientXA introduces the notion of Construct Pattern Tree for the first time. The Construct operator in it materializes the flexible characteristics of XQuery. Corresponding to its expressive operators, it can express all the queries in W3C use cases and XMark benchmark.

    This work is published in WAIM2004, Journal of Software, and Journal of Computer Research and Development.

  • Native XML DBMS: OrientX
  • OrientX is a schema based, integrated native XML database system built by WDMAM Lab, Renmin University of China under NSFC grant 60273018. It includes following functional modules: native storage, schema manager, index manager and query engines. Schema information, which plays a vital role in the system, affects storage granularity, indexing structure and query optimization; all these are combined together to support efficient XML query processing. OrientX was accepted in the XQuery Implementation List of W3C.

    The Chair of “Dagstuhl Seminar on XQuery Implementation Paradigms, 2006” has pointed out that “Your native XML database system OrientX is clearly recognized as a highly significant contribution in this research area and the seminar organizers are looking forward to your attendance”.

    Ontology Data Management
    Ontology is formal specification of shared conceptualization in certain domain. For it is explicit and shared, Ontology can be used as the semantic foundation of communication between different agents. Further, Ontology can help machines understand the semantics in documents. Semantic Web is an important application scenario of Ontology. Internet has an abundance of information, but continual and fast data increasing makes it hard to maintain and access required resources. Semantic Web is a new web concept proposed recently, aimed to make machines do web data search automatically, so as to provide most convenience to users. In the framework of Semantic Web, Ontology is used to describe the semantics of web resources, and enable machines do web information management automatically.

    The very large volume of data is a significant problem of Ontology data management in Semantic Web environment. In addition, the continual increase of web resources induces frequent update of Ontology data. How to support efficient update is another important problem of Ontology data management.

    Most existing work try to solve the problems by methods based on relational database. However, RDB is not designed for Ontology data features. There is great difference between the complex graph model of Ontology and simple flat model of relational data. RDB based Ontology data management needs divide Ontology graph into simple relations, and transform graph-based query into a set of join operations on relation tables. The mismatch between two models restricts RDB-based methods in managing large scale Ontology data. In addition, RDB based methods always pre-compute the implicit inference data and materialize them in storage. Though this method can guarantee query efficiency, it increases cost of update a lot. When update explicit data, the maintenance of materialized data is an expensive problem. In fact, most existing Ontology management systems can not support effective update.

    In order to efficiently manage large volume ontology data, [6] proposed a novel storage method, which designs native storage structure according to the characteristics of Ontology data and breaks through the restriction of RDB model. The most remarkable characteristic is that it leverages the XML data model, adopts tree structure (see figure1) to store the class and property hierarchies in Ontology. Tree structures can reserve the original hierarchies in ontology data, thus don’t need to materialize the implicit inference data brought by class and property hierarchy, which reduces the cost of update. Such storage structure can support update as well. It can maintain the consistency of data through simple operations and keep little cost of update. Based on the novel storage, [6] proposes relevant query processing method to support Ontology query in SPARQL. In addition to query processing, inference ability is also an important aspect in Ontology data management. [1] studies this problem and proposes initial inference algorithms and incremental inference algorithms to guarantee the completeness and efficiency of inference procedure.

    OrientX is a native XML data management, based on which an extended version is developed called OrientX/Ontology, which can be viewed as a special version for Ontology data. It implements the novel storage method and has the ability to query and inference, which has much significance to both researchers and engineers. At present, it can load more than 200M documents; the further work to improve the query engine and support larger volume documents is under development.
    XML Keyword Search

    Using structured query languages, e.g. XQuery and XPath, for query processing is too restrictive for users when they want to retrieve desired information from an XML document, while XML keyword search avoid the great burden of understanding the underlying schema and query languages, thus have been extensive studied in the past few years. However, there are still some problems that have not been addressed before, which forms our research points.

    Our research focus on the following problems:

  • User’s search intension
  • Inspired by the great success of information retrieval (IR) style keyword search on the web, keyword search on XML has emerged recently. The effectiveness of keyword search on XML depends on the ability of effectively identifying the user's search intention and the ability of measuring the relevance of results w.r.t. the query. Further, a keyword can appear both as the tag name and as the text value of some node; a keyword can appear as the text values of different XML node types and carry different meanings. Although finding the SLCA (Smallest Lowest Common Ancestor) of all keywords (adopted by existing approaches) is a reasonable way to answer XML keyword query, the retrieval of SLCA results without addressing search intention and the above ambiguities leads to low result quality in term of query relevance.
  • Node categories
  • The critical issue of XML keyword search is how to find meaningful query results. In both tree model and graph model, the main idea of existing approaches is to find a set of Connected Networks (CNs) where each CN is an acyclic subgraph T of D, T contains all the given keywords while any proper subgraph of T does not. In particular, in tree data model, Lowest Common Ancestor (LCA) semantics is first proposed, followed by SLCA (smallest LCA) and MLCA which apply additional constraints on LCA. In graph data model, methods proposed in focused on finding matched CNs where IDRefs are considered. In practice, however, most existing approaches only take into account the structure information among the nodes in XML data, but neglect the node categories; thus they suffer from the limited expressiveness, which makes them fail to provide an effective mechanism to describe how each part in the returned data fragments are connected in a meaningful way.
  • Identifying relevant XML fragments
  • Several search semantics are proposed: LCA (Lowest Common Ancestor), SLCA (Smallest LCA), Interconnection, VLCA (Valuable LCA). These works only consider the case that queries and relevant results match exactly, which means that each keyword in the query matches at least one node in the XML document. Because of ill-formed queries such as spelling errors and synonymy in natural language, the term mismatch between queries and relevant fragments in XML is ubiquitous. This may lead to missing a lot of relevant results and returning irrelevant results. We propose query refinement for XML keyword search to solve term mismatch in XML search.
    • 2006-2007 Supporting Context in XML Data Management Systems (Principle Investigator)
      Granted by China-Greece international cooperation project
    • 2005 Ontology based Data Management (Principle Investigator)
      Grangted by IBM University project
    • 2004-2007 XML Data Management (Principle Investigator)
      Granted by Program for New Century Excellent Talents in University(NCET)
    • J. Zhou, X. Meng, T. Ling: Efficient Processing of Partially Specified Twig Pattern Queries, Accepted by Science in China Series E: Information Sciences.
    • J. Zhu, W. Wang, X. Meng: Efficient Processing of Complex XML Twig Query. In Proceedings of 9th International  Conference on Web-Age Information Management (WAIM 2008), Zhangjajie, China
    • J. Huang, J. Xu, J. Zhou, X. Meng: MLCEA: An Entity Based Semantics for XML keyword Search, 2008.10(NDBC2008, Guilin)(in Chinese)
    • J.Zhu, W.Wang, J.Zhou, X.Meng: Efficient Processing of XML Twig Pattern Based on Related Semantics. Jouranl of Computer Research and Development(Suppl.). 2008.10 (NDBC2008, Guilin)(in Chinese)
    • X. Zhang, X. Meng, J. Zhu, W. Wang, J. Huang: OrientStore+: A Native XML Storage Strategy for Efficient Update. Journal of Computer Research and Development, Vol. 44 Suppl.: 368-373, 2007.10 (NDBC2007, Haikou, Best Paper Award) (in Chinese)
    • J. Zhou, X. Meng, X. Zhang, J. Huang: Keyword Based Multiple Query Processing over XML Streams. Journal of Computer Research and Development, Vol. 44 Suppl.: 392-397, 2007.10 (NDBC2007, Haikou) (in Chinese)
    • J. Zhou, M. Xie, X. Meng: TwigStack+: Holistic Twig Join Pruning Using Extended Solution Extension. Wuhan University Journal of Natural Sciences, Vol. 12, No. 5: 855-860, 2007.9 (4th Web Information System and Application(WISA2007), Beijing, Best Paper Award)
    • J. Zhou, X. Meng, Y. Jiang, M. Xie: F-Index: A Flattened Structural Index for Speeding up Twig Query Processing. Journal of Software, Vol.18(6):1429-1442, June, 2007.
    • Xiaofeng Meng, Xiaofeng Wang, Min Xie and et al: OrientX: An Integrated, Schema-Based Native XML Database System. Wuhan University Journal of Natural Sciences,11(5):1192-1196, Nov., 2006.(The Third Web Information System and Application(WISA2006), Nanjing, Nov 3-5, 2006.)
    • Xiaofeng Wang, Xin Zhang, Min Xie, Xiaofeng Meng, Junfeng Zhou,Keyword Search on XML Streams. Journal of Computer Research and Development, Volume43(Supplement), 2006.10, NDBC2006
    • Min Xie, Xiaofeng Wang, Xin Zhang, Xiaofeng Meng, Junfeng Zhou, Ordered XPath Query Processing on XML Stream,Journal of Computer Research and Development, Volume43(Supplement), 2006.10, NDBC2006
    • X. Wang, J. Ou, X. Meng, and Y. Chen: Abox Inference for Large Scale OWL-Lite Data. To appear in Proceedings of The 2th International Conference on Semantics, Knowledge, and Grids(SKG2006), Guilin, China, Oct. 31 - Nov. 3, 2006. (Regular paper 18%)会议,广州.)
    • X. Meng, X.Wang , M. Xie and et al: OrientX: An Integrated, Schema-Based Native XML Database System. Wuhan University Journal of Natural Sciences,11(5):1192-1196, Nov., 2006.(The Third Web Information System and Application(WISA2006), Nanjing, Nov 3-5, 2006.)
    • Y. Chen, J. Ou, Y. Jiang, X. Meng: HStar-a Semantic Repository for Large Scale OWL Documents. In Proceedings of the First Asian Semantic Web Conference (ASWC2006), page 415-428, Beijing, China, September 3-7, 2006. Lecture Notes in Computer Science 4185, Springer. (Full Paper 36/208=18%)
    • H.Wang, X. Meng: On the Sequencing of Tree Structures for XML Indexing. In Processdings of the 21st International Conference on Data Engineering (ICDE 2005), pages 372-373, Tokyo, Japan, April 2005.
    • J. X. Yu, D. Luo, X. Meng, H. Lu: Dynamically Updating XML Data: Numbering Scheme Revisited, World Wide Web, Vol 8( 1):5-26, March, 2005.004,11
    • X. Meng, D. Luo, J. Ou: Extended Role Based Access Control Method for XML Documents. Wuhan University Journal of Natural Science, Vol.9(5):740-744, Sept., 2004.
    • Y. Wang, H. Wang, X. Meng, S. Wang: Estimating the Selectivity of XML Path Expression with predicates by Histograms. In proceedings of the 5th International Conference Web-Age Information Management(WAIM 2004), pages 409-418, Dalian, China, July 15-17, 2004. Lecture Notes in Computer Science 3129 Springer 2004.
    • X. Meng, Y. Jiang, Y. Chen, H. Wang: XSeq: An Index Infrastructure for Tree Pattern (Demo). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), pages941-942, Paris, France, June 13-18, 2004.
    • D. Luo, T. Chen, T. W. Ling, X. Meng: On View Transformation Support for a Native XML Database. In Proceedings of the 9th International Conference on Database Systems for Advances Applications(DASFAA 2004), pages 226-231, Jeju Island, Korea, March 17-19, 2004. Lecture Notes in Computer Science 2973, Springer.
    • X. Meng, D. Luo, M.L. Lee, J. An: OrientStore: A Schema Based Native XML Storage System. (Demo).In Proceedings of 29th International Conference on Very Large Data Bases(VLDB2003), pages 1057~1060, Berlin, Germany, September 9-12, 2003.
    • J. Wang, X. Meng, S. Wang: Integrating Path Index with Value Index for XML Data. In Proceedings of the Fifth Asia Pacific Web Conference(APWeb2003), pages 95-100, Xi'an China, 27-29 September 2003. LNCS 2642.
    Maintained by Zhongyuan Wang() Copyright © 2007-2009 WAMDM, All rights reserved