usage: tacl results [-h] [-v] [-b CORPUS] [--max-be-count COUNT]
[--denormalise MAPPING] [--denormalised-corpus CORPUS]
[-e CORPUS] [--excise NGRAM] [-l LABEL]
[--min-count COUNT] [--max-count COUNT]
[--min-count-work COUNT] [--max-count-work COUNT]
[--min-size SIZE] [--max-size SIZE] [--min-works COUNT]
[--max-works COUNT] [--ngrams NGRAMS] [--reciprocal]
[--reduce] [--relabel CATALOGUE] [--remove LABEL] [--sort]
[-t {cbeta,latin,pagel}] [-z CORPUS] [--add-label-count]
[--add-label-work-count] [--collapse-witnesses]
[--group-by-ngram CATALOGUE] [--group-by-witness]
RESULTS
Modify a query results file by adding, removing or otherwise manipulating
result rows. Outputs the new set of results.
positional arguments:
RESULTS Path to CSV results; use - for stdin.
options:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-e CORPUS, --extend CORPUS
Extend the results to list the highest size grams that
also count as matches, going beyond the maximum size
recorded in the database. This has no effect if the
results contain only 1-grams. (default: None)
--excise NGRAM Remove all results whose n-gram contains the supplied
n-gram within it. (default: None)
-l LABEL, --label LABEL
Label to restrict prune requirements to (default:
None)
--min-count COUNT Minimum total count per n-gram to include. (default:
None)
--max-count COUNT Maximum total count per n-gram to include. (default:
None)
--min-count-work COUNT
Minimum count per n-gram per work to include; if a
single witness meets this criterion for an n-gram, all
instances of that n-gram are kept. (default: None)
--max-count-work COUNT
Maximum count per n-gram per work to include; if a
single witness meets this criterion for an n-gram, all
instances of that n-gram are kept. (default: None)
--min-size SIZE Minimum size of n-grams to include. (default: None)
--max-size SIZE Maximum size of n-grams to include. (default: None)
--min-works COUNT Minimum count of works containing n-gram to include.
(default: None)
--max-works COUNT Maximum count of works containing n-gram to include.
(default: None)
--ngrams NGRAMS Path to file containing n-grams (one per line) to
exclude. (default: None)
--reciprocal Remove n-grams that are not attested by at least one
work in each labelled set of works. This can be useful
after reducing a set of intersection results.
(default: False)
--reduce Remove n-grams that are contained in larger n-grams.
(default: False)
--relabel CATALOGUE Relabel results according to the supplied catalogue.
(default: None)
--remove LABEL Remove labelled results. (default: None)
--sort Sort the results. (default: False)
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
-z CORPUS, --zero-fill CORPUS
Add rows with a count of 0 for each n-gram in each
witness of a work that has at least one witness
bearing that n-gram. (default: None)
bifurcated extend:
-b CORPUS, --bifurcated-extend CORPUS
Extend results to bifurcation points. Generates
results containing those n-grams, derived from the
original n-grams, that have a label count higher than
their containing (n+1)-grams, or that have a label
count of one and the constituent (n-1)-grams have a
higher label count. (default: None)
--max-be-count COUNT Maximum size of n-gram to extend to (default: None)
denormalise:
--denormalise MAPPING
Denormalise result n-grams using mapping at the
supplied path. The unnormalised corpus must also be
specified in the --denormalise-corpus option.
(default: None)
--denormalised-corpus CORPUS
Path to directory containing the original
(unnormalised) corpus. This option must be given along
with --denormalise in order for denormalisation to be
performed. (default: None)
format changing arguments:
These arguments change the format of the results, making them potentially
unsafe to use other operations on, or causing such operations to fail.
--add-label-count Output the supplied results with an additional column,
"label count", giving the total count for each n-gram
within the label. For each work, the maximum count
across all of that work's witnesses is used in the
sum. (default: False)
--add-label-work-count
Output the supplied results with an additional column,
"label work count", giving the total count of works
that contain the n-gram within the label. For each
work, any number of positive counts across all of that
work's witnesses is counted as one in the sum.
(default: False)
--collapse-witnesses Collapse result rows for multiple witnesses having the
same count for an n-gram. Instead of the "siglum"
column, all of the witnesses (per work) with the same
n-gram count are listed, comma separated, in the
"sigla" column. (default: False)
--group-by-ngram CATALOGUE
Group results by n-gram, providing summary information
of the works each n-gram appears in. Results are
sorted by n-gram and then order of occurrence of the
label in the supplied catalogue. (default: None)
--group-by-witness Group results by witness, providing summary
information of which n-grams appear in each witness.
(default: False)
If more than one modifier is specified, they are applied in the following
order: --extend, --bifurcated-extend, --denormalise, --reduce, --reciprocal,
--excise, --zero-fill, --ngrams, --min/max-works, --min/max-size, --min/max-
count, --min/max-count-work, --remove, --relabel, --sort. All of the options
that modify the format are performed at the end, and only one should be
specified. The one exception to this is denormalisation, which adds a column
to the results without disrupting any other operations - but see below.
--extend applies before --reduce because it may generate results that are also
amenable to reduction.
--extend applies before --remove because it depends on there being at least
two labels in the results in order to give correct results.
The denormalisation options together produce a set of results with all
denormalised forms that occur in each witness presented, along with an extra
column, "normalised ngram", giving the normalised form each was derived from.
Since denormalised intersect results may no longer conform to normal intersect
rules (that each n-gram occurs at least once within each label), running some
further operations (such as extend) is likely to cause unwanted removal of
results.
--denormalise should always be performed (if at all) before --reduce. The
counts of the denormalised n-grams will be the full count of all instances in
a witness, even if a --reduce on the normalised results had reduced counts.
It is important to be careful with the use of --reduce. Coupled with filters
such as --max-size, --min-count, etc, many results may be discarded without
trace (since the reduce occurs first). Note too that performing "reduce" on a
set of results more than once will make the results inaccurate!
--min-count and --max-count set the range within which the total count of each
n-gram, across all works, must fall. For each work, its count is taken as the
highest count among its witnesses.
--min-works and --max-works count works rather than witnesses.
If both --min-count-work and --max-count-work are specified, only those
n-grams are kept that have at least one witness whose count falls within that
range.
-l/--label causes --min/max-count, --min/max-count-work, and --min/max-works
to have their requirements apply within that labelled subset of results. All
n-grams, both within the subset and outside it, that meet the criteria are
kept, while all other n-grams are removed. Note that when applied to diff
results, no n-grams outside those in the labelled subset will be kept.
--relabel sets the label for each result row to the label for that row's work
as specified in the supplied catalogue. If the work is not labelled in the
catalogue, the label in the results is not changed.
Since this command outputs a valid results file (except when using one of
those options listed as changing the format), its output can be used as input
for a subsequent tacl results command. To chain commands together without
creating an intermediate file, pipe the commands together and use - instead of
a filename, as:
tacl results --reciprocal results.csv | tacl results --reduce -
examples:
Extend CBETA results and set a minimum total count.
tacl results -e corpus/cbeta/ --min-count 9 output.csv > mod-output.csv
Zero-fill CBETA results.
tacl results -z corpus/cbeta/ output.csv > mod-output.csv
Reduce Pagel results.
tacl results --reduce -t pagel output.csv > mod-output.csv
Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".