The 5th edition of the "Corpus of Spontaneous Japanese" has been released. The 5th edition is recorded in a single USB flush disk. The data size is about 90GB which contains the following data and tools.
See: Major changes in the 5th edition
A total of 3302 recordings. Approximately 90% are monologues, with the remaining 10% consisting of conversation, recitations, and re-readings. (Voice samples can be found here)
All speech data has been transcribed. Two types of transcriptions are included: a unified kanji and kana mixed transcription, and katakana transcriptions for showing phonetic details.
All transcriptions are part-of-speech analyzed in terms of two morphological units: "Short Unit Words" (SUW) correspond roughly to the head-word found in dictionaries, while "Long Unit Words" (LUW) correspond to compound or composite words.
SUWs extracted from the "Corpus of Spontaneous Japanese" are compiled into an electronic dictionary.
Separates the transcriptions based on clause boundaries, providing grammatical labelling and classification.
Listeners' subjective impressions of the recorded data.
Labels of consonants and vowels, as well as linguistic-standard encoding of intonation labels (X-JToBI). This data is provided in Xwaves format (for use in Xwaves and Wavesurfer) and TextGrid format (for use in Praat). F0 information used in labelling is also included.
Modifying-modifed information among small syntactic units (bunsetsu) . The maximum domain of analysis is clauses in 5 above.
Free summaries of the contents of the speech, and excerpts of 10% to 50% of the transcribed text.
Discourse segmentation based upon the guessing of the speaker's intention.
The majority of the above information is integrated into XML format.
Statistical models for use in speech recognition research.
Contains information on speakers (3302 entries in total, 1417 unique speakers), such as gender, date of birth, age at the time of recording (in 5 year increments), place of birth, and residential history.
20 different electronic documents are included.
It is possible to listen to the relevant recordings while browsing the transcriptions. Simple acoustic analysis can also be performed. (Sample image)
In addition to simple full-text search, it is possible to use morphological (POS) information for the search. It is also possible to playback speech. (How to use CSJ in Himawari, In Japanese)
It is possible to visualize the dependency structure information by this tool. The classifications of clauses and important-sentence are also shown. (Sample image)
Sample image: Dependency structure, clause boundary, and important sentence information viewer
Dependency structure information is displayed based on the internal clause boundaries. Information on important statements is displayed at the top along with information on the timing of the clause. "2_50p" and "3_50p" mean that the 2nd and 3rd workers chose to include this clause in the summary task at the 50% level. The results of the sorting of the clauses are displayed within the text of the dependency structures. The recording can also be played using the "Play (再生)" button. This tool runs on an HTML browser (IE 6 in the image).