Welcome to TACL’s documentation!¶
TACL is a tool for performing basic text analysis on a corpus of texts. It can, with minor modifications, be used for any texts, though it is designed specifically for the texts available from the Chinese Buddhist Electronic Text Association (CBETA).
The basis of the analysis it enables is to divide up the corpus texts into their consistuent n-grams, and allow querying for the differences and intersections of these n-grams between arbitrary groupings of texts.
The documentation here concentrates on the specifics of using TACL. Michael Radich has written a user’s guide that focuses on “questions of Buddhological method bearing upon rigorous and effective application of the tool to research questions”.
The TACL suite of tools operates on a corpus of texts via an analysis of their n-grams. There are several steps in the preparation and analysis of the corpus, as listed with example commands:
Preprocess the files in the corpus in order to remove material that is not relevant to the analysis (the tacl prepare and tacl strip commands). This creates modified files in a separate directory, and it is this directory and these files that are the considered the corpus for the remaining steps.
tacl prepare path/XML/dir path/prepared/dir tacl strip path/prepared/dir path/stripped/dir
Note that the output format is simply plain text. If you already have plain text files, then this step is not necessary. The processing currently expects the style of TEI XML used by the CBETA corpus as per their GitHub repository.
Generate the n-grams that will be used in the analysis (tacl ngrams). This is typically the slowest part of the entire process.
tacl ngrams path/db/file path/stripped/dir 2 10
Categorise some or all of the works in the corpus into two or more groups. These groups (identified by arbitrary, user-chosen labels) are defined in a catalogue file that is initially generated from the corpus (tacl catalogue).
The catalogue file lists each work on its own line, followed optionally by whitespace and the label. If the label contains a space, it must be quoted.
Works that have no label are not used in an analysis.
tacl catalogue -l "base" path/stripped/dir path/catalogue/file
An example catalogue:
T0237 Vaj T0097 AV T0667 P-ref T1461 P-ref T1559 T2137
tacl diff path/db/file path/stripped/dir path/catalogue/file > diff-results.csv tacl intersect path/db/file path/stripped/dir path/catalogue/file > intersect-results.csv
Optionally perform functions on the results of a difference or intersection query, to limit the scope of the results (tacl results).
tacl results --reduce --min-count 5 diff-results.csv > reduced-diff-results.csv
Display a side by side comparison of matching parts of pairs of texts in a set of intersection query results (tacl align).
tacl align path/stripped/dir path/output/dir intersect-results.csv
Display one text with the option to highlight matches from other texts in a set of intersection query results, producing a heatmap visualisation (tacl highlight).
tacl highlight path/stripped/dir intersect-results.csv text-name witness-siglum
Other tacl commands can be found using the command tacl -h or reading the documentation for the tacl script.
Those wishing to do sophisticated operations with catalogues may wish to install tacl-catalogue-manager.