Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.
Inspired by the Data Filtering and Data Pre-processing sections of Tilde’s WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.
pip install subword-nmt
pip install langid
If you use this tool, please cite the following paper:
Matīss Rikters (2018). “Impact of Corpora Quality on Neural Machine Translation.” In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).
@inproceedings{Rikters2018BalticHLT,
author = {Rikters, Matīss},
booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
title = ,
address={Tartu, Estonia},
year = {2018}
}