> HOME > Data List > NTCIR Test Collection > Web Document Data Application

Application for NTCIR WEB Document Data for Research Purposes

Outline of data

Data comprizing vast amount of HTML and plain text files gathered from the Web, mainly in Japanese and in English, but sometimes in other languages.

  1. NW100G-01 (used at NTCIR-3 WEB and NTCIR-4 WEB)
    Web document data set of about 11 milion pages, 100GB data size gathered in 2001. It consists of list files containing collected site names, document URL's and links, and of document files containing raw data, EUC code data and text data. For more details, please refer to "NTCIR Project NTCIR-4 WEB (Web Retrieval Test Collection) Research Purpose Use of Test Collection".
  2. NW1000G-04 (used at NTCIR-5 WEB)
    Web document data set of about 100 milion pages, 1.4TB data size gathered in 2004 to 2005. It consists of list files containing collected site names, document URL's, links and anchor tests, and document files containing raw data, EUC code data, text data and morphological analysis result data. For more details, please refer to "NTCIR Project NTCIR-5 WEB (Web Retrieval Test Collection) Research Purpose Use of Test Collection".

Conditions to use data

The use of the data is restricted to research purposes.

For more details, please read "Agreement (sample)" below.

Application procedure

Please follow the procedure below. The data is available free for charge.

  1. Read the "Agreement (sample)" of the corresponding document data carefuly. Then, fill out the Application for Usage of "NTCIR-WEB Document Data" and send it to idr [at] nii.ac.jp by e-mail.

  2. In response, we inform you of the availability of the data after checking the application (It may take a few days). Please understand that there will be a case where we cannot privide you the data depending on your usage purpose.

  3. NII and the user conclude an agreement.

    1. We will send you the Agreement Form (PDF format) filled out with information written on the application form. Then, make two copies in double-sided print, sign (and seal, if available) on both copies and send them by postal mail or courier to the following address:
      IDR office
      National Institute of Informatics
      2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
      101-8430, JAPAN
      Phone: +81-3-4212-2009
    2. After counter-signed by NII side, one copy of the agreement will be sent back to you. Please keep it safe.
  4. We will provide you the data after the agreement is concluded.

Data provision

The data will be provided by downloading from the IDR's Web server. If you cannot download the data for some technical reason, please consult us.

Documents

NTCIR mailing list

NTCIR Project is distributing announcements on a mailing list. If you are interested, please visit "NTCIR Project Mailing lists" page and register by yourself.