tacl highlight

usage: tacl highlight [-h] [-v] [-m NGRAMS] (-n NGRAMS | -r RESULTS)
                      [-l LABEL] [-t {cbeta,latin,pagel}]
                      CORPUS BASE_NAME OUTPUT

Output an HTML report for each witness to a work, showing the text of that
witness with supplied n-grams visually highlighted.

positional arguments:
  CORPUS                Path to corpus.
  BASE_NAME             Name of work to display.
  OUTPUT                Directory to output report to.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -m NGRAMS, --minus-ngrams NGRAMS
                        Path to file containing n-grams (one per line) to
                        remove highlighting from. This applies only when -n is
                        used. (default: None)
  -n NGRAMS, --ngrams NGRAMS
                        Path to file containing n-grams (one per line) to
                        highlight. This option may be specified multiple
                        times; the n-grams in each file will be displayed in a
                        distinct colour. (default: None)
  -r RESULTS, --results RESULTS
                        Path to CSV results; creates heatmap highlighting.
                        (default: None)
  -l LABEL, --label LABEL
                        Label used to identify the n-grams from a file
                        specified by -n/--ngrams. This option may be specified
                        multiple times, and provided as many times as the
                        -n/--ngrams option. (default: None)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

There are two possible outputs available, depending on whether the -n or -r
option is specified.

If n-grams are supplied via the -n/--ngrams option, the resulting HTML reports
show the specified work's witness texts with those n-grams highlighted. Any
n-grams that are specified via the -m/--minus-ngrams option will have had its
constituent tokens unhighlighted. The -n/--ngrams option may be specified
multiple times; each file's n-grams will be highlighted in a distinct colour.
The -l/--labels option can be used with -n/--ngrams in order to provide labels
for groups of n-grams. There must be as many instances of -l/--labels as there
are of -n/--ngrams. The order of the labels matches the order of the n-grams
files.

If results are supplied via the -r/--results option, the resulting HTML
reports contain an interactive heatmap of the results, allowing the user to
select which witness' matches should be highlighted in the text. Multiple
selections are possible, and the colour of the highlight of a token reflects
how many witnesses have matches containing that token.

examples:

  tacl highlight -r intersect.csv corpus/stripped/ T0001 report_dir

  tacl highlight -n author_markers.csv corpus/stripped/ T0001 report_dir

  tacl highlight -n Dhr_markers.csv -n ZQ_markers.csv corpus/stripped/ -l Dharmaraksa -l "Zhi Qian" T0474 report_dir