usage: tacl normalise [-h] [-t {cbeta,latin,pagel}] [-v] CORPUS MAPPING OUTPUT
Create a copy of a corpus normalised according to a supplied mapping.
positional arguments:
CORPUS Directory containing corpus to be normalised.
MAPPING Path to mapping file.
OUTPUT Directory to output normalised corpus to.
options:
-h, --help show this help message and exit
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
This is a generic normalisation process that is constrained only by the
possibilities of the mapping format. Lemmatisation could be performed in the
same way as normalisation of variant characters and words.
LIMITATIONS
Because the normalised forms in the mapping may only consist of a single
token, the normalisation and denormalisation processes are not able to handle
context. Eg, it is not possible to reflect "ABA" -> "ACA", where the
surrounding "A"s are themselves able to be normalised.
FILES
The mapping file follows a simple format of comma-separated values, with each
line having at least two values. The first is the normalised form, and all
subsequent values on the line being the unnormalised forms. During processing,
longer unnormalised forms are converted first.
The normalised form is mostly used internally, and so may be arbitrary. It may
never consist of more than a single token, however.