While the CSJ-RDB (version 2.0) is based upon the XML documents in the CSJ, we are adding some supplements and corrections. The major additions and modifications are as follows.
The 1st - 4th editions of the CSJ, the core contains speech data of various types: presentations and speeches (both at conferences and simulated), dialogues, and re-readings. For academic presentation speech and simulate public speaking, which account for the majority of the data, a large variety of annotations have been manually created. However, only part of this information is available for dialogues and re-readings. In the CSJ-RDB, we have added the following new information for these types of data:
It should be noted that the standards of classification for clause units have been extended for the clause unit information in dialogues. Please see the following for more details.
Other information corresponds with the standards of the CSJ (1st - 4th edition). Please refer to the following documents for more details.
In the prosodic labelling scheme X-JToBI which the CSJ employs, accent phrases (AP) and intonation phrases (AP) are not explicitly expressed as units, but are rather indirectly represented by information indicating the strength of the prosodic boundary (BI information). AP and IP boundaries are indicated by BI=2 and BI=3 respectively. This information is represented in the CSJ XML documents. Though it is possible to indentify the AP and the IP using BI information, there were issues using these methods to identify disfluency phenomena such as filled pauses or hesitation. Therefore, we have reviewed the standards for classifying AP and IP to improve the handling of these phenomena. In the CSJ-RDB we explicitly indicate units certified as AP or IP with these new standards. Please see the following for more details on the criteria.
See: On the classification standards of prosodic units in the "Corpus of Spontaneous Japanese", (Proceedings of the 3rd Japanese Corpus Llinguistic Workshop; In Japanese)
The information on phones, phonemes, and moras in the CSJ XML documents is mainly automatically generated from the provided prosodic and segmental sound labels. However there are some problems with the phoneme certification rules used up to now, resulting in some phonetically inappropriate unit assignments. Therefore, while retaining the hierarchical structure of the phonetic units (phone < phoneme < mora < word) certain assignment criteria were changes for phonemes (and phones) in order to obtain more phoneitcally reasonable units. Please see the following for more information on the changes.
The start and end times of IPU elements in the CSJ XML (defined in the transcriptions as a unit with a more than 200ms pause) documents are created based on the IPU timing information in the transcription texts. However, in some cases the pause units documented during transcription extended beyond the bounds of the actual pause in speech. Therefore, in order to maintain the hierarchy of the prosodic units in the CSJ-RDB (phone < phoneme < mora < word < IPU), the actual start and end times of the IPUs were determined based on the timing information of the surrounding phones.
In the CSJ, certain characters which can provide identifying information of speakers is redacted and replaced with a "x". Originally, the method used was to simply replace the characters in the transcriptions with an equal number of "x"s. For example, in the case of the name "Kyoko", if it were written in Kanji (京子) it would be replaced by "xx", while if written in katakana (キョーコ) it would be replaced with "xxxx". However, because this was not based on the number of characters rather than the number of actual moras, it could cause some issues with certain calculations, such as average mora length. Therefore, the method had been changed so that redacted texts are replaced with a number of "x"s equal to the total number of moras being replaced.
Before the correction "キョーコ" was replaced with "xxxx" * Replacing each character
After the correction: "キョーコ" is replaced by "xxx". * Replaced text represents the number of moras