Application for NTCIR WEB Document Data for Research Purposes
Outline of data
Data comprizing vast amount of HTML and plain text files gathered from the Web, mainly in Japanese and in English, but sometimes in other languages.
- NW100G-01 (used at NTCIR-3 WEB and NTCIR-4 WEB)
Web document data set of about 11 milion pages, 100GB data size gathered in 2001. It consists of list files containing collected site names, document URL's and links, and of document files containing raw data, EUC code data and text data. For more details, please refer to "NTCIR Project NTCIR-4 WEB (Web Retrieval Test Collection) Research Purpose Use of Test Collection". - NW1000G-04 (used at NTCIR-5 WEB)
Web document data set of about 100 milion pages, 1.4TB data size gathered in 2004 to 2005. It consists of list files containing collected site names, document URL's, links and anchor tests, and document files containing raw data, EUC code data, text data and morphological analysis result data. For more details, please refer to "NTCIR Project NTCIR-5 WEB (Web Retrieval Test Collection) Research Purpose Use of Test Collection".
Conditions to use data
The use of the data is restricted to research purposes.
For more details, please read "Agreement (sample)" below.
Application procedure
Please follow the procedure below. The data is available free for charge.
-
Read the "Agreement (sample)" of the corresponding document data carefuly. Then, fill out the Application for Usage of "NTCIR-WEB Document Data" and send it to idr [at] nii.ac.jp by e-mail.
-
In response, we inform you of the availability of the data after checking the application (It may take a few days). Please understand that there will be a case where we cannot privide you the data depending on your usage purpose.
-
NII and the user conclude an agreement.
- We will send you the Agreement Form (PDF format) filled out with information written on the application form. Then, make two copies in double-sided print, sign (and seal, if available) on both copies and send them by postal mail or courier to the following address:
IDR office
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
101-8430, JAPAN
Phone: +81-3-4212-2009 - After counter-signed by NII side, one copy of the agreement will be sent back to you. Please keep it safe.
- We will send you the Agreement Form (PDF format) filled out with information written on the application form. Then, make two copies in double-sided print, sign (and seal, if available) on both copies and send them by postal mail or courier to the following address:
-
We will provide you the data after the agreement is concluded.
Data provision
The data will be provided by downloading from the IDR's Web server. If you cannot download the data for some technical reason, please consult us.
Documents
- Application for Usage of "NTCIR-WEB Document Data"
- Agreement (sample)
NTCIR mailing list
NTCIR Project is distributing announcements on a mailing list. If you are interested, please visit "NTCIR Project Mailing lists" page and register by yourself.