tacl results

usage: tacl results [-h] [-v] [-b CORPUS] [--max-be-count COUNT]
                    [--denormalise MAPPING] [--denormalised-corpus CORPUS]
                    [-e CORPUS] [--excise NGRAM] [-l LABEL]
                    [--min-count COUNT] [--max-count COUNT]
                    [--min-count-work COUNT] [--max-count-work COUNT]
                    [--min-size SIZE] [--max-size SIZE] [--min-works COUNT]
                    [--max-works COUNT] [--ngrams NGRAMS] [--reciprocal]
                    [--reduce] [--relabel CATALOGUE] [--remove LABEL] [--sort]
                    [-t {cbeta,latin,pagel}] [-z CORPUS] [--add-label-count]
                    [--add-label-work-count] [--collapse-witnesses]
                    [--group-by-ngram CATALOGUE] [--group-by-witness]
                    RESULTS

Modify a query results file by adding, removing or otherwise manipulating
result rows. Outputs the new set of results.

positional arguments:
  RESULTS               Path to CSV results; use - for stdin.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -e CORPUS, --extend CORPUS
                        Extend the results to list the highest size grams that
                        also count as matches, going beyond the maximum size
                        recorded in the database. This has no effect if the
                        results contain only 1-grams. (default: None)
  --excise NGRAM        Remove all results whose n-gram contains the supplied
                        n-gram within it. (default: None)
  -l LABEL, --label LABEL
                        Label to restrict prune requirements to (default:
                        None)
  --min-count COUNT     Minimum total count per n-gram to include. (default:
                        None)
  --max-count COUNT     Maximum total count per n-gram to include. (default:
                        None)
  --min-count-work COUNT
                        Minimum count per n-gram per work to include; if a
                        single witness meets this criterion for an n-gram, all
                        instances of that n-gram are kept. (default: None)
  --max-count-work COUNT
                        Maximum count per n-gram per work to include; if a
                        single witness meets this criterion for an n-gram, all
                        instances of that n-gram are kept. (default: None)
  --min-size SIZE       Minimum size of n-grams to include. (default: None)
  --max-size SIZE       Maximum size of n-grams to include. (default: None)
  --min-works COUNT     Minimum count of works containing n-gram to include.
                        (default: None)
  --max-works COUNT     Maximum count of works containing n-gram to include.
                        (default: None)
  --ngrams NGRAMS       Path to file containing n-grams (one per line) to
                        exclude. (default: None)
  --reciprocal          Remove n-grams that are not attested by at least one
                        work in each labelled set of works. This can be useful
                        after reducing a set of intersection results.
                        (default: False)
  --reduce              Remove n-grams that are contained in larger n-grams.
                        (default: False)
  --relabel CATALOGUE   Relabel results according to the supplied catalogue.
                        (default: None)
  --remove LABEL        Remove labelled results. (default: None)
  --sort                Sort the results. (default: False)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)
  -z CORPUS, --zero-fill CORPUS
                        Add rows with a count of 0 for each n-gram in each
                        witness of a work that has at least one witness
                        bearing that n-gram. (default: None)

bifurcated extend:
  -b CORPUS, --bifurcated-extend CORPUS
                        Extend results to bifurcation points. Generates
                        results containing those n-grams, derived from the
                        original n-grams, that have a label count higher than
                        their containing (n+1)-grams, or that have a label
                        count of one and the constituent (n-1)-grams have a
                        higher label count. (default: None)
  --max-be-count COUNT  Maximum size of n-gram to extend to (default: None)

denormalise:
  --denormalise MAPPING
                        Denormalise result n-grams using mapping at the
                        supplied path. The unnormalised corpus must also be
                        specified in the --denormalise-corpus option.
                        (default: None)
  --denormalised-corpus CORPUS
                        Path to directory containing the original
                        (unnormalised) corpus. This option must be given along
                        with --denormalise in order for denormalisation to be
                        performed. (default: None)

format changing arguments:
  These arguments change the format of the results, making them potentially
  unsafe to use other operations on, or causing such operations to fail.

  --add-label-count     Output the supplied results with an additional column,
                        "label count", giving the total count for each n-gram
                        within the label. For each work, the maximum count
                        across all of that work's witnesses is used in the
                        sum. (default: False)
  --add-label-work-count
                        Output the supplied results with an additional column,
                        "label work count", giving the total count of works
                        that contain the n-gram within the label. For each
                        work, any number of positive counts across all of that
                        work's witnesses is counted as one in the sum.
                        (default: False)
  --collapse-witnesses  Collapse result rows for multiple witnesses having the
                        same count for an n-gram. Instead of the "siglum"
                        column, all of the witnesses (per work) with the same
                        n-gram count are listed, comma separated, in the
                        "sigla" column. (default: False)
  --group-by-ngram CATALOGUE
                        Group results by n-gram, providing summary information
                        of the works each n-gram appears in. Results are
                        sorted by n-gram and then order of occurrence of the
                        label in the supplied catalogue. (default: None)
  --group-by-witness    Group results by witness, providing summary
                        information of which n-grams appear in each witness.
                        (default: False)

If more than one modifier is specified, they are applied in the following
order: --extend, --bifurcated-extend, --denormalise, --reduce, --reciprocal,
--excise, --zero-fill, --ngrams, --min/max-works, --min/max-size, --min/max-
count, --min/max-count-work, --remove, --relabel, --sort. All of the options
that modify the format are performed at the end, and only one should be
specified. The one exception to this is denormalisation, which adds a column
to the results without disrupting any other operations - but see below.

--extend applies before --reduce because it may generate results that are also
amenable to reduction.

--extend applies before --remove because it depends on there being at least
two labels in the results in order to give correct results.

The denormalisation options together produce a set of results with all
denormalised forms that occur in each witness presented, along with an extra
column, "normalised ngram", giving the normalised form each was derived from.
Since denormalised intersect results may no longer conform to normal intersect
rules (that each n-gram occurs at least once within each label), running some
further operations (such as extend) is likely to cause unwanted removal of
results.

--denormalise should always be performed (if at all) before --reduce. The
counts of the denormalised n-grams will be the full count of all instances in
a witness, even if a --reduce on the normalised results had reduced counts.

It is important to be careful with the use of --reduce. Coupled with filters
such as --max-size, --min-count, etc, many results may be discarded without
trace (since the reduce occurs first). Note too that performing "reduce" on a
set of results more than once will make the results inaccurate!

--min-count and --max-count set the range within which the total count of each
n-gram, across all works, must fall. For each work, its count is taken as the
highest count among its witnesses.

--min-works and --max-works count works rather than witnesses.

If both --min-count-work and --max-count-work are specified, only those
n-grams are kept that have at least one witness whose count falls within that
range.

-l/--label causes --min/max-count, --min/max-count-work, and --min/max-works
to have their requirements apply within that labelled subset of results. All
n-grams, both within the subset and outside it, that meet the criteria are
kept, while all other n-grams are removed. Note that when applied to diff
results, no n-grams outside those in the labelled subset will be kept.

--relabel sets the label for each result row to the label for that row's work
as specified in the supplied catalogue. If the work is not labelled in the
catalogue, the label in the results is not changed.

Since this command outputs a valid results file (except when using one of
those options listed as changing the format), its output can be used as input
for a subsequent tacl results command. To chain commands together without
creating an intermediate file, pipe the commands together and use - instead of
a filename, as:

    tacl results --reciprocal results.csv | tacl results --reduce -

examples:

  Extend CBETA results and set a minimum total count.
    tacl results -e corpus/cbeta/ --min-count 9 output.csv > mod-output.csv

  Zero-fill CBETA results.
    tacl results -z corpus/cbeta/ output.csv > mod-output.csv

  Reduce Pagel results.
    tacl results --reduce -t pagel output.csv > mod-output.csv

Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".