Introduction: Web Data Integration
The rapid development of World Wide Web has dramatically changed the way in which information is managed and accessed. The information in Web is increasing at a striking speed. At present, there are more than 7,500 terabytes (or 4 billion web pages) of information in Web. Web information has covered all domains of human activities. This opened the opportunity for users to benefit from the available data. So Web is being concerned more and more.

Web can be divided into Surface Web and Deep Web. Traditional search engines create their indices by crawling Surface Web pages. Surface Web is the Web pages that are static and linked to other pages, while Deep Web refers to the Web pages created dynamically as the result of a specific search. Traditional search engines can not "see" or retrieve content in the Deep Web. On average, Deep Web receives fifty per cent greater monthly traffic than Surface Web. According to the survey released by UIUC in 2004, there are more than 300,000 Web databases and 450,000 query interfaces available at that time, and the two figures are still increasing quickly. Besides the scale of Web databases, the contents in Web databases are spanning well across all topics. Deep Web is being the largest growing category of new information on the Internet. The study on Deep Web will be one of the hottest areas in research.

[Top]        
Motivation
Though there are so abundant contents in Web, it is not easy to achieve them due to the scale, heterogeneity and no structure of Web data. Thus, to address this problem, a feasible way is extract data from Web sources to populate databases for further handling. In order to automatic integrate Web data sources and utilize them effectively, more and more researchers are doing their efforts on this field.

In the context of Deep Web, users often have difficulties in first finding the right WebDBs and then querying over them due to the large scale of Web databases in Deep Web. To enable effective access to databases on the Web, researchers are exploring feasible solutions to integrate WebDBs, which can provide people a unified access to them and achieve information automatically

[Top]        
Research work
  • Deep Web Integration
  • More and more accessible databases are available in the Web. In order to provide people a unified access to these Web databases and achieve information from them automatically, a comprehensive solution for Web database integration is proposed. The following figure is the architecture of the solution. This solution includes three primary modules: integrated interface generation module, query processing module and results processing module.

    Integrated interface generation module: Produce an integrated interface over the query interfaces of the Web databases to be integrated. There are four components in this module. The functions of them are described as follows:

    • Web database discovery: Search Web sites which have Web databases behind, and identify the query interfaces among the Web pages in these Web sites.
    • Query interface schema extraction: Extract the attributes in query interfaces, and the meta-information about each attribute.
    • Web database clustering by topic: Cluster all discovered Web databases into different groups. The Web databases in each group belong to the same topic.
    • Interface integration: Given the Web databases in the same topic, merge the same semantic attributes in different query interfaces into a global attribute, and finally form an integrated interface.

    Query processing module: Process a user's query filled in integrated interface, and submit the query to each Web databases. There are three components in this module. The functions of them are described as follows:

    • Web database selection: Select appropriate Web databases for a user's query in order to get the satisfying results at minimal cost.
    • Query translation: Try to translate the query on integrated interface equivalently into a set of local queries on the query interfaces of Web databases.
    • Query submission: Analyze the submission approaches of local query interfaces, and submit each local query automatically.

    Result processing module: Extract the query results achieved from Web databases, and merge the results together under a global schema. There are three components in this module. The functions of them are described as follows:

    • Result extraction: Identify and extract the pure results from the response pages returned by Web databases.
    • Result Annotation: Append the proper semantics for the extracted results.
    • Result merging: Merge the results extracted from different Web databases together under a global schema.

  • Web Data Extraction
  • In Web, the information is mainly presented by web pages. So the primary approach is to recognize the data of interest among many other uninteresting pieces of text in web pages. As well known, web pages are semi-structured documents. It is challenge to accomplish this task with high accuracy. All Web data extraction tools can be classified into three approaches: manual, semi-automatic and automatic. Until now, we have implemented several tools which are introduced briefly as follows.

    ViDRE: Vision-based Web data records extraction
    This tool focuses the problem of extracting data records on the response pages returned from Web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. Though these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. We propose a novel and language independent technique to solve the data extraction problem. We analyze several types of visual features which exist in all response pages, including position features, layout features, appearance features and content features. Based on these visual features, we implemented ViDR which can performs the extraction using only the visual information of the response pages when they are rendered on web browsers.

    TSReC: A hybrid method for automated news content extraction from the Web
    Web news search causes some non-trivial problems to traditional information retrieval techniques. One of them is how to differentiate Web news content from others in Web pages. Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction.

    RecipeCrawler: Collecting Recipe Data from WWW
    WWW has posed itself as the largest data repository ever available in the history of humankind. Utilizing the Internet as a data source seems to be natural and many efforts have been made. In this paper we focus on establishing a robust system to collect structured recipe data from the Web incrementally, which, as we believe, is a critical step towards practical, continuous, reliable web data extraction systems and therefore utilizing WWW as data sources for various database applications. The reasons for advocating such an incremental approach are two-fold: (1) it is unpractical to crawl all the recipe pages from relevant web sites as the Web is highly dynamic; (2) it is almost impossible to induce a general wrapper for future extraction from the initial batch of recipe web pages. In this paper, we describe such a system called RecipeCrawler which targets at incrementally collecting recipe data from WWW. General issues in establishing an incremental data extraction system are considered and techniques are applied to recipe data collection from the Web.

    SG-WRAP: A Schema-Guided Wrapper Generator
    Web wrapper technology has been developed to transform unstructured /semi-structured data to semi-structured/structured data, which can be queried and analyzed using matured techniques developed in database and other fields. SG-WRAP, adopts a novel, schema guided, approach for wrapper generation. With this approach, a user defines the schema of data to be extracted from an HTML page in terms of data type descriptors (DTD) of XML. The user also provides example mappings by associating data in the HTML page and elements in DTD. The system then induces the mapping rules and generates a wrapper that extracts data from the HTML page and produce an XML document conformed to the specified DTD.

    SG-WRAM: SG-WRAM Schema Guided Wrapper Maintenance
    A novel schema-guided approach to wrapper maintenance SG-WRAM, is based on our previous work of a schema-guided wrapper generator SG-WRAP. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and hyperlinks of the extracted data items. It is feasible to recognize data items in the changed pages using these features. In addition, the schema for the extracted data does not change. Specifically, we maintain wrappers in the following four steps. At first, features are obtained from the user-defined schema, the previous extraction rule, and the extracted results. Secondly, we recognize the data items in the changed page with these features. The third step groups the items according to the schema. Each group, called semantic block, is a possible instance of the given schema. Finally, the representative instances are selected to re-induce the extraction rule for the new page.

    A Data Driven Approach for Automatic Wrapper Generation and Maintenance
    Wrapper generation and maintenance is a crucial research topic in Deep Web data integration. Existing methods usually induced Wrappers by structures or features analyzing of the website. However, these methods rely heavily on website templates and may be ineffective for some websites. Moreover, previous research paid less attention to wrapper maintenance. These two problems block the implement of large-scale Deep Web data integration. This paper proposes a novel method to perform this issue automatically, which is called data driven approach. This approach matches date items between source pages and target pages by the same semantic record of different websites in one domain or different templates in one site. Then it generates or maintains wrappers with these mapping data items. This approach doesn’t rely on the template or set thresholds. Experimental results show that the accuracy and applicability of the method has improved greatly compared with the previous methods.

  • WDB Selection&Query Translation
  • Query processing in Deep Web data integration mainly focuses on Web database selection (i.e., Given a user query, the integration system can select the most relevant Web databases.) and query translation (i.e., translate the user query from integrated interface to Web database interfaces), which are challenging due to the autonomous, heterogeneous and dynamic natures of the Web database.Recently, we have some related works on this topic which are introduced briefly as follows.

    A Graph-Based Approach for Web Database Sampling
    A flood of information is hidden behind the Web-based query interfaces with specific query capabilities, which makes it difficult to capture the characteristics of the Web database, such as the topic and the frequency of updates. This poses a great challenge for Deep Web data integration. To address this problem, a graph-based approach WDB-Sampler for Web database sampling is proposed in this paper, which can incrementally obtain sample records from a Web database through its query interface. That is, a number of samples are obtained for the current query, and one of them is transformed into the next query. The important characteristic of this approach is it can adapt to different kinds of attributes on the query interfaces.

    An Attributes Correlation Based Approach for Estimating Size of Web Databases
    An approach based on the word frequency is proposed in this paper to estimate the size of Web database. It obtains a random sample on a certain attribute by analyzing the attribute correlations among all the textual attributes in the query interface. The size of a Web database can be estimated by submitting probing queries which are generated by top-k frequent words to the query interface of a Web database.

    Uncertain Schema Matching in Deep Web Integration Service
    With increasing of Deep Web, providing high quality data from autonomous, heterogeneous and dynamic Web databases to users is becoming a hot topic in recent research of Deep Web integration service. How to generate the reasonable schema matching between the keywords of the user request and schema of integrated interface as well as between the schema of integrated interface and that of Web database interface is essential. The related works about schema matching are generating the best schema matching which slide over its uncertainty. This paper analyzes the uncertainty of schema matching, and then proposes a series of similarity measures. To reduce the cost of execution, it proposes the type-based optimization method and schema matching pruning method of numeric data.

    EasyQuerier: A Keyword Query Interface For Web Database Integration System
    Recently a lot of work on integrating the search interfaces of multiple Web databases of the same domain into an integrated interface has been reported.Such integrated interfaces enable users to search multiple Web databases using one query. However, there are two potential problems when using these integrated interfaces in practice. First, if the number of domains is large, it may be difficult for users to find the correct domain. Second, the integrated interfaces can become too complicated for ordinary users to use. In this paper, we propose a system called EasyQuerier to tackle these problems. EasyQuerier allows the users to submit keyword-based queries to access the Web databases by first mapping a keyword-based user query to a suitable domain and then translating the user query to a well-formatted query on the integrated interface of the found domain.

  • Information Credibility on the Web
  • Motivation:
    With the rapid development of the World Wide Web, the amount of information on the Web has been growing at an incredible rate. Meanwhile, the network popularization and the electronic commerce development have changed the way of retrieving information and consumption. Everyday, people retrieve all kinds of information from the Web. The Web has been the most important way of retrieving information. Unfortunately, it is a double-edged sword. It is a great convenience to people, and also brought a series of problems. "Is the World Wide Web always trustable?" Unfortunately, the answer is "no". It has become increasingly prominent that how to discriminate the trustable information from massive information. It has troubled people for a long time that whether the Web is trustable, and has hampered the development of the Web. So,the study on information credibility is needed by further development of the Web.

    The problem of information credibility on the Web is prevalent in various Web applications. A lot of erroneous, out-of-date, false and bias information reduces the usability of the applications, and can also lead to a huge loss. Once the incredible information spreads on the Web, and will take a long time to clear. So the Web is flooded with a lot of incredible information.

    With the increasing maturity of network technology, the World Wide Web has become a huge and complex information source. According to the depth of data sources, the Web can be divided into two parts: the Surface Web and the Deep Web. So, the problem of information credibility can be divided into two problems based on the depth of data sources: information credibility on the Surface Web and on the Deep Web. We propose an estimation framework according to the depth of data sources. From the data model, algorithms and evaluation criteria, we research on the theories and methods to provide technical support for applications.

    Research Work
    C-Rank:an evaluation method for data records credibility on the Deep Web
    After analysising data records on the Deep Web, we present four intuitions:

    • A data record is more credible if it is provided by a more credible data source.
    • Different data sources have contributions in various degrees to the credibility of data records.
    • Information credibility can be propagated each other by the links between different data sources.
    • A data record is more credible if it is provided by more data sources.

    According to the relationship between data sources and data records on the Deep Web, we construct a S-R credibility network for each data record. In this network, there are two types of nodes: the nodes of data records and the nodes of data sources, and three kinds of edges: inner link edges, outer link edges and the edges associating data records of the same entity.

    Based on the idea of the credibility propagation, we use the out-degree of a record node to calculate a local confidence score for each node. By the in-degree of a record node and the confidence score of adjacent data source nodes, we compute the weight of the data record node. Then, we will get the confidence score of the S-R network for the data record as the global confidence score of the record node.

    [Top]        
    Systems
    ScholarSpace (http://www.c-dblp.cn) is a Chinese Computer Science Bibliography.

    OrientPrivacy is a Privacy Server under Mobile Environment.

    JobTong (http://www.jobtong.cn) is a Deep Web Data Integration System for Job Search.
    [Top]        
    Grant
    • 2003-2005 Web Data Extraction and Integration (Principle Investigator)
      Granted by the Natural Science Foundation of China(NSFC) under grant number 60273018
    • 2002-2005 Web Service based Data Integration (Principle investigator)
      Granted by the 863 High Technology Foundation of China under grant number 2002AA116030
    [Top]        
    Patent
    • 2008.1.11 Vision-based Web Data Extraction System and Method(pending)
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      200810056103.4
    • 2008.1.11 A Web Entity Identification Method Used in the Entity Identification System(pending)
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      200810056102.X
    • 2008.1.11 Domain_level Web Data Integration System and Method(pending)
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      200810056101.5
    • 2008.1.11 A Intelligent Web Query Interface System and Method(pending)
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      200810056104.9
    • 2007.9.19 Wrapper Maintenance
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      ZL 2004 1 0074546.8
    • 2007.7.11 Wrapper Generator
      Granted by the State Intellectual Property Office of the People's Republic of China(SIPO) under patent number
      ZL 2004 1 0074547.2
    [Top]        
    Publication
    • X. Zhang, X. Meng, R. Chen: Differential Private Set-Valued Data Release aganist Incremental Updates. Accepted for publication in the 18th International Conference on Database Systems for Advanced Applications (DASFAA 2013). April 22-25, 2013, Wuhan, China. (Regular paper)
    • M. Wang, X. Zhang, X. Meng: Inferring Sensitive Link in Large-Scale Social Networks. Journal of Frontiers of Computer Science and Technology.
    • W. Tong, W. Chen, X. Meng.EDM: An Efficient Algorithm for Event Detection in Microblogs. Journal of Frontiers of Computer Science and Technology, Vol 6(12): 1076-1086, 2012, 12.
    • R. Ma, X. Meng, Z. Wang: Preserving privacy on the searchable internet. International Journal of Web Information System, Vol.8(3),322-344,2012
    • J. Wen, X. Meng, X. Hao and J. Xu. An Efficient Approach for Continuous Density Queries. Frontiers of Computer Science, Vol 6(5): 581-595, 2012.
    • R. Ma, X. Meng, Z. Wang: Preserving Privacy on the Searchable Internet. In Proceedings of the 13th International Conference on Information Integration and Web-based Applications & Services (iiWAS2011): 238-245, December 5-7, 2011, Ho Chi Minh City, Vietnam.
    • W. Chen, Z. Wang, S. Yang, P. Zhang, X. Meng: ScholarSpace: An Academic Space for Computer Science Researchers (Demonstration). Journal of Computer Research and Development. Vol.48(suppl.): 395-399, 2011.(NDBC2011, Shanghai)
    • W. Liu, X. Meng, W. Meng: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering(TKDE). 22(3): 447-460 (2010).
    • W. Liu, X. Meng: A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration. In Proceedings of the 6th International Conference on Semantics, Knowledge & Grids(SKG2010): 267-274, Nov. 1-3, 2010, Ningbo, China
    • Y. Li, X. Zhang, X. Meng: Exploring desktop resources based on user activity analysis. In Proceedings of the 33rd Annual International ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval (SIGIR2010): 700, July 19-23, 2010, Geneva, Switzerland.
    • Y. Li, D. Elsweiler, X. Meng: Towards Task-Organised Desktop Collections. In Proceedings of the ACM SIGIR Workshop on Desktop Search: Understanding, Supporting, and Evaluating Personal Data Search (DS2010): 21-24, July 23, 2010, Geneva, Switzerland.
    • W. Liu, X. Meng, J. Yang, J. Xiao: Duplicate Identification in Deep Web Data Integration. In Proceedings of the 11th International Conference on Web-Age Information Management (WAIM2010): 5-17, July 15-17, 2010, Jiuzhaigou, China.
    • Y. Kou, Y. Li, X. Meng: DSI: A Method for Indexing Large Graphs Using Distance Set. In Proceedings of the 11th International Conference on Web-Age Information Management (WAIM2010):297-308, July 15-17, 2010, Jiuzhaigou, China.
    • Y. Li, X. Meng: Supporting Context-based Query in Personal DataSpace. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM2009):1437-1440, November 2-6, 2009, Hong Kong, China.(short paper)
    • J. Ai, Z. Wang and X. Meng: C-Rank: A Credibility Evaluation Method for Deep Web Records. In Proceedings of the 26th National Database Conference of China(NDBC2009):257-264,2009.10 (NDBC2009,Nanchang)(in Chinese).
    • Y. Kou, Y. Li, X. Zhang, J. Zhao and X. Meng: A Strategy for Task Mining in Personal Dataspace Management. Jouranl of Computer Research and Development,Vol.46 Suppl.:446-452, 2009.10.(in Chinese).
    • Y. Li, X. Meng: Exploring Personal CoreSpace for DataSpace Management. In Proceeding of the 2009 Fifth International Conference on Semantics, Knowledge and Grid (SKG 2009):168-175, October 12-14, 2009, Zhuhai, China.
    • Yukun Li, Xiaofeng Meng, Yubo Kou: An Efficient Method for Constructing Personal DataSpace. In Proceedings of the 6th Web Information Systems and Applications Conference(WISA2009): 3-8, September 18-20, 2009, Xuzhou, China.(WISA 2009 Best Paper Award)
    • F. Jiang, W. Meng, X. Meng: Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration, In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA 2009):595-600, April 21-23, 2009, Brisbane, Australia.(Short paper)
    • Z. Wang, J Ai, X. Meng :A Data Driven Approach for Automatic Wrapper Generation and Maintenance. Journal of Computer Research and Development, Vol.43, Suppl,2008.11(NDBC2008, Guilin) (in Chinese)
    • F. Jiang, X. Meng ,L. Jia: Uncertain Schema Matching in Deep Web Integration Service.Chinese Journal of Computers,Vol.31,No.8,2008.8.
    • W. Liu, X. Meng and Y. Lin: A Graph-based Approach for Web Database Sampling. Journal of Software, Vol.19,2008.
    • Y. Lin,X. Meng and W. Liu:An Attributes Correlation Based Approach for Estimating Size of Web Databases. Journal of Software, Vol.19,2008.
    • F. Jiang, L. Jia, X. Meng: Minimal-Superset-Based Query Translation in Deep Web Data Integration. Journal of Computer Research and Development, Vol. 44 Suppl.: 23-28, 2007.10 (NDBC2007, Haikou) (in Chinese)
    • F. Jiang, L. Jia, X. Meng: Query Translation on the Fly in Deep Web Integration.Wuhan University Journal of Natural Sciences, Vol. 12, No. 5: 819-824, 2007.9 (4th Web Information System and Application(WISA2007), Beijing)
    • W. Liu, X. Li, X. Meng, et al: A Deep Web Integration System for Job Search. Wuhan University Journal of Natural Sciences, 11(5):1197-1201, Nov., 2006. (The Third Web Information System and Application(WISA2006), Nanjing, Nov 3-5, 2006.)
    • W. Liu, C. Lin and X. Meng: Web Database Query Interface Annotation Based on User Collaboration. Wuhan University Journal of Natural Sciences, 11(5):1403-1406, Nov., 2006. (The Third Web Information System and Application(WISA2006), Nanjing, Nov 3-5, 2006.)
    • Y. Li, X. Meng, Q. Li, L. Wang: Hybrid Method for Automated News Content Extraction from the Web. In proceeding of 7th International Conference on Web Information Systems Engineering(WISE2006),pages 327-338, Wuhan, China, October 2006
    • W, Liu, X. Meng: Web Database Integration. In Proceedings of the Ph.D Workshop in conjunction with VLDB 06 (VLDB-PhD2006), Seoul, Korea, September 11, 2006.
    • W. liu, X. Meng, W. Meng: Vision-based Web Data Records Extraction. In Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB2006), Chicago, Illinois, June 30, 2006. (12/48=25%) [PDF]
    • Y. Ling, X. Meng, and W. Meng, Automated Extraction of Hit Numbers From Search Result Pages. In Proceedings of the Seventh International Conference on Web-Age Information Management(WAIM2006), pages 73-84, Hong Kong, China,17-19 June, 2006. Lecture Notes in Computer Science 4016, Springer 2006.
    • Y. Li, X. Meng, L. Wang, Q. Li, RecipeCrawler: Collecting Recipe Data from WWW Incrementally. In Proceedings of the Seventh International Conference on Web-Age Information Management(WAIM2006), pages 263-274, Hong Kong, China, 17-19 June, 2006. Lecture Notes in Computer Science 4016, Springer 2006.
    • D. Hu and X. Meng: Automatically extracting data from data-rich web pages. In proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA 2005), pages828-839, Beijing, China, April 17-20, 2005. Lecture Notes in Computer Science 3453, Springer. (Full paper)
    • C. Lin, Q. Zhang, X. Meng, W. Liu: Postal Address Detection from Web Documents. In Proceedings of the ICDE International Workshop on Challenges in Web Information Retrieval and Integration (ICDE-WIRI2005), pages 40-45 , Tokyo, Japan, April 8-9 2005.
    • X. Meng, D. Hu, C. Li: Schema-Guided Wrapper Maintenance for Web-Data Extraction. In Proceedings of ACM Fifth International Workshop on Web Information and Data Management (WIDM 2003), pages 1-8, New Orleans, Lousiana, USA, November 7-8, 2003
    • X. Meng, H. Wang,D. Hu, M. Gu: SG-WRAM Schema Guided Wrapper Maintenance: A Demonstration. In Proceedings of the 19th International Conference on Data Engineering (ICDE2003), pages 750-752, Bangalore, India, March 5-8, 2003
    • X. Meng, H. Lu , et al.: Data Extraction from the Web based on Pre-defined Schema. JCST, Vol.17(4):377-388, 2002,7
    • X. Meng, H. Lu, et al.: SG-WRAP: A Schema-Guided Wrapper Generator. In Proceedings of the 18th International Conference on Data Engineering (ICDE2002), pages 331-332,San Jose, CA., 26 February - 1 March 2002
    [Top]        
    Maintained by Zhongyuan Wang() Copyright © 2007-2009 WAMDM, All rights reserved