1. Welcome to the CSJ homepage
Welcome to the English homepage of the Corpus of Spontaneous Japanese!

This homepage describes the details of the CSJ.

2. What is CSJ?
CSJ, or Corpus of Spontaneous Japanese, is a large-scale annotated corpus of spontaneous Japanese. CSJ is an outcome of Japan's national priority-area research project known as Spontaneous Speech: Corpus and Processing Technology (1999-2003) [PDF | 88KB] supported by the Ministry of Education, Culture, Sports, Science and Technology. This is a collaborative work of the National Institute for Japanese Language (NIJLA), the Communications Research Laboratory (CRL), and the Tokyo Institute of Technology (TITech). The project supervisor is professor Sadaoki Furui of TITech.
3. The size and structure of CSJ
3.1 The whole corpus
The whole CSJ contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech material are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed using a two-way transcription scheme designed especially for CSJ. Also, POS (part-of-speech) analysis based upon two different kinds of 'word' is applied for the whole corpus.
3.2 The Core
There is a true subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech. Core is the part of CSJ to which we concentrate the cost of annotation. In addition to the two-way transcription and two-way POS analysis, segment label, intonation label, and other miscellaneous annotations are provided for the Core.