Data Set List
This page presents the list of data sets that NII provides for informatcis-related researchers. Some of the data sets are under preparation.
update:2011-01-06
Yahoo! Data Set
Data sets that NII accepted from Yahoo! Japan Corporation and provides for researchers.
- "Yahoo! Chiebukuro" Data (2nd edition)
- "Yahoo! Blog" Data (under preparation)
NTCIR Test Collection
Test collections that NTCIR Project organized by NII built. IDR provides the following test collections. For other test collections that are provided by NTCIR secretariat, please refer to "Test Collections".
- NTCIR-8 CQA Test Collection
- Task Data (search topics and relevance judgements)
- NTCIR WEB Test Collection
- Document Data
- NW100G-01 (used at NTCIR-3 WEB and NTCIR-4 WEB)
- NW1000G-04 (used at NTCIR-5 WEB)
- Task Data (search topics and relevance judgements)
- NTCIR-3 WEB
- NTCIR-4 WEB
- NTCIR-5 WEB
- Document Data
Speech Corpus
Speech corpora that Speech Resources Consortium established in NII accepted from various institutions and groups. These are provided by Speech Resources Consortium for the time being.
- Priority Area Project on "Spoken Language" - Grant-in-Aid for Developmental Scientific Research on "Speech Database" Continuous Speech Corpus (PASL-DSR)
- University of Tsukuba Multilingual Speech Corpus (UT-ML)
- Tohoku University - Matsushita Isolated Word Database (TMW)
- GSR(A) "Regional Difference in Spoken Japanese Dialects" Spoken Japanese Dialect Corpus (GSR-JD)
- Real World Computing Project (RWCP) Speech Corpora
- RWCP-SP96 Spoken Dialogue Database (1996 edition)
- RWCP-SP97 Spoken Dialogue Database (1997 edition)
- RWCP-SP99 News Speech Database for Information Retrieval and Speech Summarization Research
- RWCP-SP01 Meeting Speech Corpus
- RWCP Real Environment Speech and Acoustic Database (RWCP-SSD)
- Priority Area "Spoken Dialogue" Spoken Dialogue Corpus (PASD)
- CIAIR Children Voice Speech Corpus (CIAIR-VCV)
- IPSJ SIG-SLP Corpora and Environments for Noisy Speech Recognition (CENSREC)
- CENSREC-1 (AURORA-2-J) Noisy Speech Recognition Evaluation Environment
- CENSREC-1-C Noisy Speech Detection Evaluation Environment
- CENSREC-2 In-car Connected Digit Data and Environment for Noisy Speech Recognition
- CENSREC-3 In-car Isolated Word Data and Environment for Noisy Speech Recognition
- CENSREC-4 Reverberant Speech Recognition Evaluation Environment
- Priority Areas "Advanced Utilization of Multimedia to Promote Higher Education Reform" Speech Database (UME)
- UME-ERJ English Speech Database Read by Japanese Students
- UME-JRF Japanese Speech Database Read by Foreign Students
- RIKEN Spoken Dialogue Corpus (Word processing task, Japanese) (RIKEN-DLG)
- Japanese Map Task Dialogue Corpus (MapTask)
- Utsunomiya University Spoken Dialogue Database for Paralinguistic Information Studies (UUDB)
- Japanese Phonetically-balanced Word Speech Database (ETL-WD)
- Speech Database for the 1991 Tsuruoka Survey (Tsuruoka91)
- ASJ Japanese Newspaper Article Sentences Read Speech Corpus (JNAS)
- Japanese Newspaper Article Sentences Read Speech Corpus of the Aged (S-JNAS)
- ASJ Continuous Speech Corpus for Research (ASJ-JIPDEC)
- NTT - Tohoku University Familiarity-controlled Word Lists (FW03)
- NTT - Tohoku University Familiarity-controlled Word Lists 2007 (FW07)
- NTT Infant Speech Database (INFANT)
- JEIDA Japanese Common Speech Data Corpus (JEIDA-JCSD)
- JEIDA Noise Database (JEIDA-NOISE)
Video Database
Video databases for evaluation of video processing built by VDBWG, SIG-PRMU, IEICE. Currently, acceptance of new applications is being stopped. When distribution is restarted, it will be announced at this site.

