tacl split

usage: tacl split [-h] [-v] [-t {cbeta,latin,pagel}] CORPUS CONF [CONF ...]

Split an existing work into multiple works that are subsets of its content.

positional arguments:
  CORPUS                Path to corpus.
  CONF                  XML configuration file defining the contents of each
                        witness split from the source work.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

Each split configuration file must be named according to the work that it
defines the splits for (eg, T0278.xml is the name of the configuration file
for the work T0278). Its format is a simple XML structure, as illustrated in
the example below:

<splits delete="true">
  <work>
    <name>T0278-paralleled-earlier</name>
    <parts>
      <part>
        <witnesses>大,宋,元,明,聖</witnesses>
        <start>佛在摩竭提國寂滅道場初始得佛普光法</start>
        <end>最勝或稱能度如是等稱佛名號其數一萬</end>
      </part>
      <part>
        <witnesses>宮</witnesses>
        <start>佛在摩竭提國寂滅道場初始得佛普光法</start>
        <end>最勝或稱能度如是稱佛名號其數一萬</end>
      </part>
      <part>
        <witnesses>ALL</witnesses>
        <start>爾時世尊從兩足相輪放百億光明遍照</start>
        <end>百億色究竟天此世界所有一切悉現</end>
      </part>
    </parts>
  </work>
  <work>
    <name>T0278-ex-earlier-parallels</name>
    <parts>
      <part>
        <witnesses>ALL</witnesses>
        <whole>如此見佛坐蓮華藏師子座上有十佛世界塵數菩薩眷屬圍遶百億閻浮提</whole>
      </part>
      <part>
        <witnesses>ALL</witnesses>
        <start>佛子是為菩薩身口意業能得一切勝妙功</start>
        <end>善哉善哉真佛子快說是法我隨喜</end>
      </part>
    </parts>
  </work>
  <work rename="true">
    <name>Renamed T0278</name>
  </work>
</splits>

Each split work is created, under the supplied name, in the corpus directory -
an error will be raised if there is already a work with the same name as the
split work. Each of the original work's witnesses are recreated, using the
subset of its content defined in the parts. The parts are processed in the
order listed, and a witness includes a part only if its siglum is listed in
witnesses, or the keyword ALL is given in witnesses.

Each part defines either a start and end piece of text, or a whole piece of
text. In the former case, the first remaining instance of the start text, and
everything following it until the first remaining instance of the end text, is
copied into each applicable witness of the new work. In the latter case, the
first instance of the whole provided text is copied. In both cases, after the
specified text is copied, it is removed from consideration in the future parts
of this split work.

The source work can be output in its entirety under a new name, if a "rename"
attribute with the value "true" is added to a work element, which must contain
only a name.

The source work is left unchanged by the splitting process, unless a "delete"
attribute with the value "true" is added to the root splits element, in which
case the work is deleted.