tacl normalise

usage: tacl normalise [-h] [-t {cbeta,latin,pagel}] [-v] CORPUS MAPPING OUTPUT

Create a copy of a corpus normalised according to a supplied mapping.

positional arguments:
  CORPUS                Directory containing corpus to be normalised.
  MAPPING               Path to mapping file.
  OUTPUT                Directory to output normalised corpus to.

options:
  -h, --help            show this help message and exit
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)

This is a generic normalisation process that is constrained only by the
possibilities of the mapping format. Lemmatisation could be performed in the
same way as normalisation of variant characters and words.

LIMITATIONS

Because the normalised forms in the mapping may only consist of a single
token, the normalisation and denormalisation processes are not able to handle
context. Eg, it is not possible to reflect "ABA" -> "ACA", where the
surrounding "A"s are themselves able to be normalised.

FILES

The mapping file follows a simple format of comma-separated values, with each
line having at least two values. The first is the normalised form, and all
subsequent values on the line being the unnormalised forms. During processing,
longer unnormalised forms are converted first.

The normalised form is mostly used internally, and so may be arbitrary. It may
never consist of more than a single token, however.