A Large-scale Dataset for Web Data Extraction
1. Why (do we need the dataset)
More and more information is being published in form of Web pages. As a result, Web data extraction (how to extract structured data from Web pages) is always a challenging problem in the field of Web data management, and until now, a great number of works have been issued for it.

However, there is few common dataset as the experimental testbed for the works in this field. So it is difficult to compare the performances among the works due to different datasets used by them. For this reason, a large-scale dataset which consists of thousands of Web pages gathered from Completeplanet is provided by us. We contribute this dataset and wish it could be adopted by researchers in this field as a common testbed.

2. What (is the dataset)
This dataset is used for Web data extraction. All the Web pages in the dataset are gathered from the largest Deep Web repository Completeplanet (http://www.completeplanet.com). Completeplanet covers many domains of the real world, and it provides the domain category.

The contents of the dataset are listed as follows:

  • More than 4,000 Deep Web sites (Web databases)
  • More than 20,000 Web pages (the query result pages of the Deep Web sites)
  • More than 300,000 structured data records contained in these Web pages

The whole dataset is compressed into 10 zip files, and each zip file contains 400 Deep Web sites. These ten zip files can be downloaded by clicking the following hyperlinks:

You can also download the entire dataset by the following link:
We believe that such large-scale dataset is enough to treat all kinds of situations of Web data extraction.

Additional information
This dataset was created in the end of 2005 by Liu Wei, Lin Can, Jia Linlin, Hu Haoyu and Hu Zhong. The gathering process was finished manually, so this is a very labor-intensive work.

The original dataset was actually larger than the current version. A small part of it was removed the antivirus software because some Web pages were infected with the virus. In other words, the current version is safe.

If you have any problem or you enrich the dataset, please contact us: zhywang (at) ruc (dot) edu (dot) cn

The dataset is owned by WAMDM (http://idke.ruc.edu.cn/) lab. Please provide the announcement for the copyright in your papers when using it.

Maintained by WAMDM Administrator () Copyright © 2007-2008 WAMDM, All rights reserved