> HOME > Data List

Data Set List

This page presents the list of data sets that NII provides for informatcis-related researchers. Some of the data sets are under preparation.

update:2011-01-06

Yahoo! Data Set

Data sets that NII accepted from Yahoo! Japan Corporation and provides for researchers.

  1. "Yahoo! Chiebukuro" Data (2nd edition)
  2. "Yahoo! Blog" Data (under preparation)

NTCIR Test Collection

Test collections that NTCIR Project organized by NII built. IDR provides the following test collections. For other test collections that are provided by NTCIR secretariat, please refer to "Test Collections".

  1. NTCIR-8 CQA Test Collection
    • Task Data (search topics and relevance judgements)
    Note: "Yahoo! Chiebukuro" data is used as document data.
  2. NTCIR WEB Test Collection
    • Document Data
      1. NW100G-01 (used at NTCIR-3 WEB and NTCIR-4 WEB)
      2. NW1000G-04 (used at NTCIR-5 WEB)
    • Task Data (search topics and relevance judgements)
      1. NTCIR-3 WEB
      2. NTCIR-4 WEB
      3. NTCIR-5 WEB

Speech Corpus

Speech corpora that Speech Resources Consortium established in NII accepted from various institutions and groups. These are provided by Speech Resources Consortium for the time being.

  1. Priority Area Project on "Spoken Language" - Grant-in-Aid for Developmental Scientific Research on "Speech Database" Continuous Speech Corpus (PASL-DSR)
  2. University of Tsukuba Multilingual Speech Corpus (UT-ML)
  3. Tohoku University - Matsushita Isolated Word Database (TMW)
  4. GSR(A) "Regional Difference in Spoken Japanese Dialects" Spoken Japanese Dialect Corpus (GSR-JD)
  5. Real World Computing Project (RWCP) Speech Corpora
    1. RWCP-SP96 Spoken Dialogue Database (1996 edition)
    2. RWCP-SP97 Spoken Dialogue Database (1997 edition)
    3. RWCP-SP99 News Speech Database for Information Retrieval and Speech Summarization Research
    4. RWCP-SP01 Meeting Speech Corpus
  6. RWCP Real Environment Speech and Acoustic Database (RWCP-SSD)
  7. Priority Area "Spoken Dialogue" Spoken Dialogue Corpus (PASD)
  8. CIAIR Children Voice Speech Corpus (CIAIR-VCV)
  9. IPSJ SIG-SLP Corpora and Environments for Noisy Speech Recognition (CENSREC)
    1. CENSREC-1 (AURORA-2-J) Noisy Speech Recognition Evaluation Environment
    2. CENSREC-1-C Noisy Speech Detection Evaluation Environment
    3. CENSREC-2 In-car Connected Digit Data and Environment for Noisy Speech Recognition
    4. CENSREC-3 In-car Isolated Word Data and Environment for Noisy Speech Recognition
    5. CENSREC-4 Reverberant Speech Recognition Evaluation Environment
  10. Priority Areas "Advanced Utilization of Multimedia to Promote Higher Education Reform" Speech Database (UME)
    1. UME-ERJ English Speech Database Read by Japanese Students
    2. UME-JRF Japanese Speech Database Read by Foreign Students
  11. RIKEN Spoken Dialogue Corpus (Word processing task, Japanese) (RIKEN-DLG)
  12. Japanese Map Task Dialogue Corpus (MapTask)
  13. Utsunomiya University Spoken Dialogue Database for Paralinguistic Information Studies (UUDB)
  14. Japanese Phonetically-balanced Word Speech Database (ETL-WD)
  15. Speech Database for the 1991 Tsuruoka Survey (Tsuruoka91)
  16. ASJ Japanese Newspaper Article Sentences Read Speech Corpus (JNAS)
  17. Japanese Newspaper Article Sentences Read Speech Corpus of the Aged (S-JNAS)
  18. ASJ Continuous Speech Corpus for Research (ASJ-JIPDEC)
  19. NTT - Tohoku University Familiarity-controlled Word Lists (FW03)
  20. NTT - Tohoku University Familiarity-controlled Word Lists 2007 (FW07)
  21. NTT Infant Speech Database (INFANT)
  22. JEIDA Japanese Common Speech Data Corpus (JEIDA-JCSD)
  23. JEIDA Noise Database (JEIDA-NOISE)

Video Database

Video databases for evaluation of video processing built by VDBWG, SIG-PRMU, IEICE. Currently, acceptance of new applications is being stopped. When distribution is restarted, it will be announced at this site.