Supplementary material to our article

Segmentation for Efficient Supervised Language Annotation with an Excplicit Cost-Utility-Tradeoff

Matthias Sperber, Mirjam Simantzik, Graham Neubig, Satoshi Nakamura, Alex Waibel
Transactions of the Association for Computational Linguistics (TACL); April 2014
Link to the PDF

Data

Transcription Task

Japanese Word Segmentation Task

For the Japanese Word Segmentation Task, we used the Balanced Corpus of Contemporary Written Japanese (BCCWJ) as our dataset. Specifically, we used the internet Q&A subcorpus as in-domain data, and the whitepaper subcorpus as background data. We do not possess the rights to redistribute this data, and for this reason cannot provide our experimental data here.

Software