Command-Line Interface
chunk-torch-spect-data-dir
usage: chunk-torch-spect-data-dir [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--feat-subdir FEAT_SUBDIR]
[--ali-subdir ALI_SUBDIR]
[--ref-subdir REF_SUBDIR]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
[--policy {fixed,ali,ref}]
[--lobe-size LOBE_SIZE]
[--window-type {symmetric,causal,future}]
[--pad-mode {constant,reflect,replicate}]
[--pad-constant PAD_CONSTANT]
[--partial-tokens]
[--retain-token-boundaries] [--quiet]
[--format-utt FORMAT_UTT]
in_dir out_dir
Create a new SpectDataSet directory by chunking another
This command breaks SpectDataSet sequences into sub-sequences (chunks), storing the
results in a new directory. New utterances are named according to "--format-utt".
Sequences are sliced according to one of three policies set by the "--policy" flag
(default "fixed"). They are:
- fixed: extract a fixed-sized window at fixed-length intervals along the feature
sequence.
- ali: use per-frame alignments to segment the feature sequence into intervals with
matching labels. Requires per-frame alignments (data in the "ali/" subdirectory).
- ref: use reference token sequence segments as slices. Requires reference sequences
(data in the "ali/" subdirectory) and for them to contain segment boundary
information.
Overlapping chunks may be created by specifying "--lobe-size" (default "0") and
"--window-type" (default "symmetric"). More details on the policies and windowing can
be found in the Python module pydrobert.torch.modules.SliceSpectData.
By default, only valid slices (i.e. those entirely within the boundaries of the input
sequences) are counted. Specifying "--pad-mode" will include slices partially within
boundaries as well as how to pad features and per-frame alignments to fill the
remainder.
See the command "get-torch-spect-data-dir-info" for more info SpectDataSet directories.
positional arguments:
in_dir The torch data directory to chunk (input)
out_dir The torch data directory to store chunks (output)
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--feat-subdir FEAT_SUBDIR
Subdirectory where features are stored.
--ali-subdir ALI_SUBDIR
Subdirectory where per-frame alignments are stored.
--ref-subdir REF_SUBDIR
Subdirectory where reference token sequences are
stored.
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
--policy {fixed,ali,ref}
The policy for determining slices from the data. See
SliceSpectData.
--lobe-size LOBE_SIZE
Size of a side lobe of a slice. See SliceSpectData.
--window-type {symmetric,causal,future}
Type of window used in slicing. See SliceSpectData.
--pad-mode {constant,reflect,replicate}
If specified, determines how to chunks of features and
alignments exceeding the original sequence boundaries.
constant: pad with the value of '--pad-constant'.
reflect: padded values are the reflection around
sequence boundaries. replicate: padded values match
the first and final sequence values.
--pad-constant PAD_CONSTANT
Constant used when padding with '--pad-mode=constant'
--partial-tokens If set, reference token sequences which only partly
overlap with a chunk will still be included with the
chunk.
--retain-token-boundaries
If set, segment boundaries of reference token
sequences will keep their original values rather than
being made relative to the chunk.
--quiet Suppress any warnings.
--format-utt FORMAT_UTT
Format string with which to format utterance ids of
chunks. Available keys are 'utt_id': the old utterance
id, 'start': the start frame of the chunk (inclusive),
'end': the end frame of the chunk (exclusive), and
'idx': the 0-index of the chunk within the utterance
compute-mvn-stats-for-torch-feat-data-dir
usage: compute-mvn-stats-for-torch-feat-data-dir [-h]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--dim DIM] [--id2gid ID2GID]
[--bessel]
dir out
Compute mean and standard deviation over a torch feature directory
A feature directory is of the form
dir/
<file_prefix><id_1><file_suffix>
<file_prefix><id_2><file_suffix>
...
where each file contains a dynamically-sized tensor whose last dimension (by default) is
a feature vector. Letting F be a feature vector, this command computes the mean and
standard deviation of the features in the directory, storing them as a pickled
dictionary of tensors (with keys 'mean' and 'std') to the file 'out'. Those statistics
may be used with a pydrobert.torch.modules.MeanVarianceNormalization layer.
positional arguments:
dir The feature directory
out Output path
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--dim DIM The dimension of the feature vector
--id2gid ID2GID Path to a file mapping feature tensors to groups. See
below for more info
--bessel Apply Bessel's correction
(https://en.wikipedia.org/wiki/Bessel's_correction) to
estimates.
If --id2gid is specified, it points to a file which maps file ids to groups. Each group
gets its own statistics which are estimated using only the feature vectors from the
files assigned to them. With <id_1>, <id_2>, etc. part of the file names in the feature
directory as above and <gid_1>, <gid_2>, etc. strings without spaces representing group
ids, then the argument passed to --id2gid is a file with lines
<id_x> <gid_y>
defining a surjective mapping from file ids to group ids. 'out' will then store a
pickled, nested dictionary
{
<gid_1>: {'mean': ..., 'var': ...},
<gid_2>: {'mean': ..., 'var': ...},
...
}
of the statistics of all groups.
compute-torch-token-data-dir-error-rates
usage: compute-torch-token-data-dir-error-rates [-h] [--id2token ID2TOKEN]
[--replace REPLACE]
[--ignore IGNORE]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--swap] [--warn-missing]
[--distances] [--per-utt]
[--batch-size BATCH_SIZE]
[--quiet]
[--costs INS DEL SUB | --nist-costs]
dir [hyp] [out]
Compute error rates between reference and hypothesis token data dirs
WARNING!!!!
The error rates reported by this command have changed since version v0.3.0 of
pydrobert-pytorch when the insertion, deletion, and substitution costs do not all equal
1. Consult the documentation of "pydrobert.torch.functional.error_rate" for more
information.
This is a very simple script that computes and prints the error rates between the "ref/"
(reference/gold standard) token sequences and "hyp/" (hypothesis/generated) token
sequences in a SpectDataSet directory. Consult the Wikipedia article on the Levenshtein
distance (https://en.wikipedia.org/wiki/Levenshtein_distance>) for more info on error
rates. The error rate for the entire partition will be calculated as the total number of
insertions, deletions, and substitutions made in all transcriptions divided by the sum
of lengths of reference transcriptions.
Error rates are printed as ratios, not by "percentage."
While convenient and accurate, this script has very few features. Consider pairing the
command "torch-token-data-dir-to-trn" with sclite
(http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm) instead.
Many tasks will ignore some tokens (e.g. silences) or collapse others (e.g. phones).
Please consult a standard recipe (such as those in Kaldi http://kaldi-asr.org/) before
performing these computations.
positional arguments:
dir If the 'hyp' argument is not specified, this is the
parent directory of two subdirectories, 'ref/' and
'hyp/', which contain the reference and hypothesis
transcripts, respectively. If the '--hyp' argument is
specified, this is the reference transcript directory
hyp The hypothesis transcript directory
out Where to print the error rate to. Defaults to stdout
optional arguments:
-h, --help show this help message and exit
--id2token ID2TOKEN A file containing mappings from unique IDs to tokens
(e.g. words or phones). Each line has the format "<id>
<token>". The flag "--swap" can be used to swap the
expected ordering (i.e. to "<token> <id>")
--replace REPLACE A file containing pairs of elements per line. The
first is the element to replace, the second what to
replace it with. If '--id2token' is specified, the
file should contain tokens. If '--id2token' is not
specified, the file should contain IDs (integers).
This is processed before '--ignore'
--ignore IGNORE A file containing a whitespace-delimited list of
elements to ignore in both the reference and
hypothesis transcripts. If '--id2token' is specified,
the file should contain tokens. If '--id2token' is not
specified, the file should contain IDs (integers).
This is processed after '--replace'
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--warn-missing If set, warn and exclude any utterances that are
missing either a reference or hypothesis transcript.
The default is to error
--distances If set, return the average distance per utterance
instead of the total errors over the number of
reference tokens
--per-utt If set, return lines of ``<utt_id> <error_rate>``
denoting the per-utterance error rates instead of the
average
--batch-size BATCH_SIZE
The number of error rates to compute at once. Reduce
if you run into memory errors
--quiet Suppress warnings which arise from edit distance
computations
--costs INS DEL SUB The costs of an insertion, deletion, and substitution,
respectively
--nist-costs Use NIST (sclite, score) default costs for insertions,
deletions, and substitutions (3/3/4)
ctm-to-torch-token-data-dir
usage: ctm-to-torch-token-data-dir [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--unk-symbol UNK_SYMBOL]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
[--skip-frame-times | --feat-sizing | --frame-shift-ms FRAME_SHIFT_MS]
[--wc2utt WC2UTT | --utt2wc UTT2WC]
ctm token2id dir
Convert a NIST "ctm" file to a SpectDataSet token data dir
A "ctm" file is a transcription file with token alignments (a.k.a. a time-marked
conversation file) used in the sclite
(http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm>) toolkit. Here is the
format
utt_1 A 0.2 0.1 hi
utt_1 A 0.3 1.0 there ;; comment
utt_2 A 0.0 1.0 next
utt_3 A 0.1 0.4 utterance
Where the first number specifies the token start time (in seconds) and the second the
duration.
This command reads in a "ctm" file and writes its contents as token sequences compatible
with the "ref/" directory of a SpectDataSet. See the command
"get-torch-spect-data-dir-info" for more info about a SpectDataSet directory.
positional arguments:
ctm The "ctm" file to read token segments from
token2id A file containing mappings from tokens (e.g. words or
phones) to unique IDs. Each line has the format
"<token> <id>". The flag "--swap" can be used to swap
the expected ordering (i.e. to "<id> <token>")
dir The directory to store token sequences to. If the
directory does not exist, it will be created
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--unk-symbol UNK_SYMBOL
If set, will map out-of-vocabulary tokens to this
symbol
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
--skip-frame-times If true, will store token tensors of shape (R,)
instead of (R, 3), foregoing segment start and end
times.
--feat-sizing If true, will store token tensors of shape (R, 1)
instead of (R, 3), foregoing segment start and end
times (which trn does not have). The extra dimension
will allow data in this directory to be loaded as
features in a SpectDataSet.
--frame-shift-ms FRAME_SHIFT_MS
The number of milliseconds that have passed between
consecutive frames. Used to convert between time in
seconds and frame index. If your features are the raw
samples, set this to 1000 / sample_rate_hz
--wc2utt WC2UTT A file mapping wavefile name and channel combinations
(e.g. 'utt_1 A') to utterance IDs. Each line of the
file has the format '<wavefile_name> <channel>
<utt_id>'. If neither '--wc2utt' nor '--utt2wc' has
been specied, the wavefile name will be treated as the
utterance ID
--utt2wc UTT2WC A file mapping utterance IDs to wavefile name and
channel combinations (e.g. 'utt_1 A'). Each line of
the file has the format '<utt_id> <wavefile_name>
<channel>'. If neither '--wc2utt' nor '--utt2wc' has
been specied, the wavefile name will be treated as the
utterance ID
get-torch-spect-data-dir-info
usage: get-torch-spect-data-dir-info [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--feat-subdir FEAT_SUBDIR]
[--ali-subdir ALI_SUBDIR]
[--ref-subdir REF_SUBDIR]
[--strict | --fix [N]]
dir [out_file]
Write info about the specified SpectDataSet data dir
NOTE: additional keys (6, 8-10) have been added since pydrobert-pytorch v0.3.0. In
addition, validation now allows for empty reference segments.
A torch SpectDataSet data dir is of the form
dir/
feat/
<file_prefix><utt1><file_suffix>
<file_prefix><utt2><file_suffix>
...
[ali/
<file_prefix><utt1><file_suffix>
<file_prefix><utt1><file_suffix>
...
]
[ref/
<file_prefix><utt1><file_suffix>
<file_prefix><utt1><file_suffix>
...
]
Where "feat/" contains float tensors of shape (T, F), where T is the number of frames
(variable) and F is the number of filters (fixed). "ali/" if there, contains long
tensors of shape (T,) indicating the appropriate per-frame class labels (likely pdf-ids
for discriminative training in an DNN-HMM). "ref/", if there, contains long tensors of
shape (R, 3) indicating a sequence of reference tokens where element indexed by "[i, 0]"
is a token id, "[i, 1]" is the inclusive start frame of the token (or a negative value
if unknown), and "[i, 2]" is the exclusive end frame of the token. Token sequences may
instead be of shape (R,) if no segment times are available in the corpus.
This command writes the following space-delimited key-value pairs to an output file in
sorted order:
1. "max_ali_class", the maximum inclusive class id found over "ali/"
(if available, -1 if not).
2. "max_ref_class", the maximum inclussive class id found over "ref/"
(if available, -1 if not).
3. "num_utterances", the total number of listed utterances.
4. "num_filts", F.
5. "total_frames", the sum of T over the data dir.
6. "total_tokens", the sum of R over the data dir (if available, -1 if not).
7. "count_<i>", the number of instances of the class "<i>" that appear in "ali/"
(if available).
8. "segs_<i>". The number of segments of the class "<i>" that appear in "ali/"
(if available). A segment of "<i>" is a maximal run of instances of "<i>" which
appear sequentially in an alignment. For example, the alignment "0 1 0 1 1 1" would
have "count_0 = 2" and "count_1 = 4", but "segs_0 = segs_1 = 2".
9. "rcount_<i>", the total number of frames reference tokens with type index "<i>"
occupy according to the segment boundaries listed in the sequences in "ref/" (if
available). If any token sequence containing index "<i>" does not provide segment
boundaries (or "<i>" never occurs), "rcount_<i>" is set to "-1".
10. "rsegs_<i>", the total number of segments (i.e. tokens) with type index "<i>"
that appear in "ref/" (if available).
If "max_ali_class" was found (>= 0), all key/value pairs for "count_0-<max_ali_class>"
and "segs_0-<max_ali_class>" will be specified in the file, even if they aren't found
in the directory. Indices "<i>" will be left-padded with zeros so that keys are sorted
in increasing index. The same holds for "max_ref_class", "rcount_<i>", and "rsegs_<i>".
In an invalid data directory, the stored key/value pairs are not guaranteed to be
correct. Passing the "--strict" flag will validate the directory first. Passing "--fix"
instead will validate the directory and fix any small issues. See the function
"validate_spect_data_set" in the pydrobert.torch.data Python module for more
information on the validation process.
Note that the output can be parsed as a Kaldi (http://kaldi-asr.org/) text table of
integers.
positional arguments:
dir The torch data directory
out_file The file to write to. If unspecified, stdout
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--feat-subdir FEAT_SUBDIR
Subdirectory where features are stored.
--ali-subdir ALI_SUBDIR
Subdirectory where per-frame alignments are stored.
--ref-subdir REF_SUBDIR
Subdirectory where reference token sequences are
stored.
--strict If set, validate the data directory before collecting
info. The process is described in
pydrobert.torch.data.validate_spect_data_set
--fix [N] If set, validate the data directory before collecting
info, potentially fixing small errors in the
directory. An optional integer argument controls the
cropping threshold for ali/ and ref/ (defaults to 1).
The process is described in
pydrobert.torch.validate_spect_data_set.
print-torch-ali-data-dir-length-moments
usage: print-torch-ali-data-dir-length-moments [-h] [--precision PRECISION]
[--bessel] [--std]
[--exclude-ids EXCLUDE_IDS [EXCLUDE_IDS ...]]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
dir [out]
Compute the mean and variance of segment lengths from an ali data dir
A segment in an "ali/" directory tensor is a maximal sequence of frames with the same
id. This command computes the mean and variance of segment lengths, printing them on one
line as
<mean> (<var>)
The input to this command is the "ali/" subdirectory of the SpectDataSet, not its root.
See the command "get-torch-spect-data-dir-info" for more info about a SpectDataSet
directory.
positional arguments:
dir The ali/ dir (input)
out Where to print statistics. Defaults to stdout
optional arguments:
-h, --help show this help message and exit
--precision PRECISION
Precision with which to print stats
--bessel Perform Bessel correction on the variance estimate
--std Print standard deviation instead of variance
--exclude-ids EXCLUDE_IDS [EXCLUDE_IDS ...]
If specified, segments with ali ids in this list will
be excluded fromcounts
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
print-torch-ref-data-dir-length-moments
usage: print-torch-ref-data-dir-length-moments [-h] [--strict | --quiet]
[--precision PRECISION]
[--bessel] [--std]
[--exclude-ids EXCLUDE_IDS [EXCLUDE_IDS ...]]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
dir [out]
Compute the mean and variance of segment lengths from an ali data dir
A segment in an "ali/" directory tensor is a maximal sequence of frames with the same
id. This command computes the mean and variance of segment lengths, printing them on one
line as
<mean> (<var>)
The input to this command is the "ali/" subdirectory of the SpectDataSet, not its root.
See the command "get-torch-spect-data-dir-info" for more info about a SpectDataSet
directory.
positional arguments:
dir The ref/ dir (input)
out Where to print statistics. Defaults to stdout
optional arguments:
-h, --help show this help message and exit
--strict Error when boundary info is not available
--quiet Suppress warnings about missing boundary info
--precision PRECISION
Precision with which to print stats
--bessel Perform Bessel correction on the variance estimate
--std Print standard deviation instead of variance
--exclude-ids EXCLUDE_IDS [EXCLUDE_IDS ...]
If specified, segments with token ids in this list
will be excluded fromcounts
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
subset-torch-spect-data-dir
usage: subset-torch-spect-data-dir [-h] [--copy | --symlink]
(--utt-list UTTID [UTTID ...] | --utt-list-file PATH | --first-n N | --first-ratio R | --last-n N | --last-ratio R | --shortest-n N | --shortest-ratio R | --longest-n N | --longest-ratio R | --rand-n N | --rand-ratio R)
[--only] [--seed SEED]
[--feat-subdir FEAT_SUBDIR]
[--ali-subdir ALI_SUBDIR]
[--ref-subdir REF_SUBDIR]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
src dest
Make a new SpectDataDir from a subset of utterances of another
This command determines a set of utterances via a flag, then hard links all files in the
"feat/", "ali/" and "ref/" subdirectories matching the utterance id to in the "src"
directory to the "dest" directory.
See the command "get-torch-spect-data-dir-info" for more info about a SpectDataSet
directory.
positional arguments:
src The directory to extract from
dest The directory to extract to
optional arguments:
-h, --help show this help message and exit
--copy Copy extracted files (instead of hard link)
--symlink Symlink extracted files (instead of hard link).
Symlinks will be relative to the destination.
--utt-list UTTID [UTTID ...]
Extract the utterances listed directly after this flag
--utt-list-file PATH Extract the utterances listed in the passed file, one-
per-line
--first-n N Extract this number of utterances listed first by id
--first-ratio R Extract this ratio of utterances (rounding down)
listed first by id
--last-n N Extract this number of utterances listed last by id
--last-ratio R Extract this ratio of utterances (rounding down)
listed last by id
--shortest-n N Extract this number of utterances listed first by
increasing length, then by id
--shortest-ratio R Extract this ratio of utterances listed first by
increasing length, then by id
--longest-n N Extract this number of utterances listed first by
decreasing length, then by id
--longest-ratio R Extract this ratio of utterances listed first by
decreasing length, then by id
--rand-n N Extract this number of utterances listed randomly
--rand-ratio R Extract this ratio of utterances listed randomly
--only If set, extract only the data directly stored in 'src'
--seed SEED Seed used in --rand-* flags for determinism. If
unspecified, non-deterministic
--feat-subdir FEAT_SUBDIR
Subdirectory where features are stored.
--ali-subdir ALI_SUBDIR
Subdirectory where per-frame alignments are stored.
--ref-subdir REF_SUBDIR
Subdirectory where reference token sequences are
stored.
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
Available utterances to extract are determined by the contents of the "feat/"
subdirectory, unless "--only" was specified. Any extra or missing utterances in "ali/"
and "ref/" will be ignored.
If "--utt-list" or "--utt-list-file" is chosen, this command ignores any missing
utterances.
When a criterion involves extracting some number of utterances which exceeds the total
number of utterances, that total is extracted instead.
Ratios are rounded down to the nearest utterance.
Sorting by id is performed according to python's sort method, i.e. by locale.
When "--only" is paired with "--shortest-*" or "--longest-*", "src" is assumed to also
be the directory to extract lengths from. Otherwise it's "feat/".
This command has a similar functionality to Kaldi's (https://github.com/kaldi-asr)
subset_data_dir.sh script, but defaults to hard links for cross-compatibility.
textgrids-to-torch-token-data-dir
usage: textgrids-to-torch-token-data-dir [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--unk-symbol UNK_SYMBOL]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
[--textgrid-suffix TEXTGRID_SUFFIX]
[--fill-symbol FILL_SYMBOL]
[--skip-frame-times | --feat-sizing | --frame-shift-ms FRAME_SHIFT_MS]
[--tier-name TIER_ID | --tier-idx TIER_ID]
tg_dir token2id dir
Convert a directory of TextGrid files into a SpectDataSet ref/ dir
A "TextGrid" file is a transcription file for a single utterance used by the Praat
software (https://www.fon.hum.uva.nl/praat/).
This command accepts a directory of TextGrid files
tg_dir/
<file-prefix>utt_1.<textgrid_suffix>
<file-prefix>utt_2.<textgrid_suffix>
...
and writes each file as a separate token sequence compatible with the "ref/" directory
of a SpectDataSet. If the extracted tier is an IntervalTier, the start and end points
will be saved with each token. If a TextTier (PointTier), the start and end points of
each segment will be identified with the point.
See the command "get-torch-spect-data-dir-info" for more info about a SpectDataSet
directory.
positional arguments:
tg_dir The directory containing the TextGrid files
token2id A file containing mappings from tokens (e.g. words or
phones) to unique IDs. Each line has the format
"<token> <id>". The flag "--swap" can be used to swap
the expected ordering (i.e. to "<id> <token>")
dir The directory to store token sequences to. If the
directory does not exist, it will be created
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--unk-symbol UNK_SYMBOL
If set, will map out-of-vocabulary tokens to this
symbol
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
--textgrid-suffix TEXTGRID_SUFFIX
The file suffix in tg_dir indicating a TextGrid file.
--fill-symbol FILL_SYMBOL
If set, unlabelled intervals in the TextGrid files
will be assigned this symbol. Relevant only if a point
grid.
--skip-frame-times If true, will store token tensors of shape (R,)
instead of (R, 3), foregoing segment start and end
times.
--feat-sizing If true, will store token tensors of shape (R, 1)
instead of (R, 3), foregoing segment start and end
times (which trn does not have). The extra dimension
will allow data in this directory to be loaded as
features in a SpectDataSet.
--frame-shift-ms FRAME_SHIFT_MS
The number of milliseconds that have passed between
consecutive frames. Used to convert between time in
seconds and frame index. If your features are the raw
samples, set this to 1000 / sample_rate_hz
--tier-name TIER_ID The name of the tier to extract.
--tier-idx TIER_ID The index of the tier to extract.
torch-ali-data-dir-to-torch-token-data-dir
usage: torch-ali-data-dir-to-torch-token-data-dir [-h]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
ali_dir ref_dir
Convert an ali/ dir to a ref/ dir
This command converts a "ali/" directory from a SpectDataSet to an "ref/" directory.
The former contains frame-wise alignments; the latter contains token sequences. The
frame-wise labels are set to the token ids.
To construct the token sequence, the alignment sequence is partitioned into segments,
each segment corresponding to the longest contiguous span of the same frame-wise label.
See the command "get-torch-spect-data-dir-info" for more info SpectDataSet directories.
positional arguments:
ali_dir The frame alignment data directory (input)
ref_dir The token sequence data directory (output)
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
torch-spect-data-dir-to-wds
usage: torch-spect-data-dir-to-wds [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--feat-subdir FEAT_SUBDIR]
[--ali-subdir ALI_SUBDIR]
[--ref-subdir REF_SUBDIR] [--shard]
[--max-samples-per-shard MAX_SAMPLES_PER_SHARD]
[--max-size-per-shard MAX_SIZE_PER_SHARD]
dir tar_path
Convert a SpectDataSet to a WebDataset
A torch SpectDataSet data dir is of the form
dir/
feat/
<file_prefix><utt1><file_suffix>
<file_prefix><utt2><file_suffix>
...
[ali/
<file_prefix><utt1><file_suffix>
<file_prefix><utt1><file_suffix>
...
]
[ref/
<file_prefix><utt1><file_suffix>
<file_prefix><utt1><file_suffix>
...
]
Where "feat/" contains float tensors of shape (N, F), where N is the number of
frames (variable) and F is the number of filters (fixed). "ali/" if there, contains
long tensors of shape (N,) indicating the appropriate class labels (likely pdf-ids
for discriminative training in an DNN-HMM). "ref/", if there, contains long tensors
of shape (R, 3) indicating a sequence of reference tokens where element indexed by
"[i, 0]" is a token id, "[i, 1]" is the inclusive start frame of the token (or a
negative value if unknown), and "[i, 2]" is the exclusive end frame of the token.
This command converts the data directory into a tar file to be used as a
WebDataset (https://github.com/webdataset/webdataset), whose contents are files
<utt1>.feat.pth
[<utt1>.ali.pth]
[<utt1>.ref.pth]
<utt2>.feat.pth
[<utt2>.ali.pth]
[<utt2>.ref.pth]
...
holding tensors with the same interpretation as above.
This command does not require WebDataset to be installed.
positional arguments:
dir The torch data directory
tar_path The path to store files to
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--feat-subdir FEAT_SUBDIR
Subdirectory where features are stored.
--ali-subdir ALI_SUBDIR
Subdirectory where per-frame alignments are stored.
--ref-subdir REF_SUBDIR
Subdirectory where reference token sequences are
stored.
--shard Split samples among multiple tar files. 'tar_path'
will be extended with a suffix '.x', where x is the
shard number.
--max-samples-per-shard MAX_SAMPLES_PER_SHARD
If sharding ('--shard' is specified), dictates the
number of samples in each file.
--max-size-per-shard MAX_SIZE_PER_SHARD
If sharding ('--shard' is specified), dictates the
maximum size in bytes of each file.
torch-token-data-dir-to-ctm
usage: torch-token-data-dir-to-ctm [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--frame-shift-ms FRAME_SHIFT_MS]
[--wc2utt WC2UTT | --utt2wc UTT2WC | --channel CHANNEL]
dir id2token ctm
Convert a SpectDataSet token data directory to a NIST "ctm" file
A "ctm" file is a transcription file with token alignments (a.k.a. a time-marked
conversation file) used in the sclite
(http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm) toolkit. Here is the
format::
utt_1 A 0.2 0.1 hi
utt_1 A 0.3 1.0 there ;; comment
utt_2 A 0.0 1.0 next
utt_3 A 0.1 0.4 utterance
Where the first number specifies the token start time (in seconds) and the second the
duration.
This command scans the contents of a directory like "ref/" in a SpectDataSet and
converts each such file into a transcription. Every token in a given transcription must
have information about its duration. Each such transcription is then written to the
"ctm" file. See the command "get-torch-spect-data-dir-info" for more info about a
SpectDataSet directory.
positional arguments:
dir The directory to read token sequences from
id2token A file containing mappings from unique IDs to tokens
(e.g. words or phones). Each line has the format "<id>
<token>". The flag "--swap" can be used to swap the
expected ordering (i.e. to "<token> <id>")
ctm The "ctm" file to write token segments to
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--frame-shift-ms FRAME_SHIFT_MS
The number of milliseconds that have passed between
consecutive frames. Used to convert between time in
seconds and frame index. If your features are the raw
samples, set this to 1000 / sample_rate_hz
--wc2utt WC2UTT A file mapping wavefile name and channel combinations
(e.g. 'utt_1 A') to utterance IDs. Each line of the
file has the format '<wavefile_name> <channel>
<utt_id>'.
--utt2wc UTT2WC A file mapping utterance IDs to wavefile name and
channel combinations (e.g. 'utt_1 A'). Each line of
the file has the format '<utt_id> <wavefile_name>
<channel>'.
--channel CHANNEL If neither "--wc2utt" nor "--utt2wc" is specified,
utterance IDs are treated as wavefile names and are
given the value of this flag as a channel
torch-token-data-dir-to-textgrids
usage: torch-token-data-dir-to-textgrids [-h] (--feat-dir FEAT_DIR | --infer)
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--frame-shift-ms FRAME_SHIFT_MS]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
[--textgrid-suffix TEXTGRID_SUFFIX]
[--tier-name TIER_NAME]
[--precision PRECISION] [--quiet]
[--force-method {1,2,3}]
ref_dir id2token tg_dir
Convert a SpectDataSet ref/ dir into a directory of TextGrid files
A "TextGrid" file is a transcription file for a single utterance used by the Praat
software (https://www.fon.hum.uva.nl/praat/).
This command accepts a directory of token sequences compatible with the "ref/"
directory of a SpectDataSet and outputs a directory of TextGrid files
tg_dir/
<file-prefix>utt_1.<textgrid_suffix>
<file-prefix>utt_2.<textgrid_suffix>
...
A token sequence ref is a tensor of shape either (R, 3) or just (R,). The latter has no
segment information and is just the tokens. The former contains triples "tok, start,
end", where "tok" is the token id, "start" is the starting frame inclusive, and "end" is
the ending frame exclusive. A negative value for either boundary means the information
is not available.
By default, this command tries to save the sequence as a tier preserving as much
information in the token sequence as possible in a consistent way. The following methods
are attempted in order:
1. If ref is of shape (R, 3), all segments boundaries are available, and all segments
are of nonzero length, the sequence will be saved as an IntervalTier containing
segment boundaries.
2. If ref is of shape (R, 3) and either the start or end boundary is available for every
token, the sequence will be saved as a TextTier (PointTier) with points set to the
available boundary (with precedence going to the greater).
3. Otherwise, the token sequence is written as an interval tier with a single segment
spanning the recording and containing all tokens.
In addition, the total length of the features in frames must be determined. Either the
flag "--feat-dir" must be specified in order to get the length directly from the feature
sequences, or "--infer" must be specified. The latter guesses the length to be the
maximum end boundary of the token sequence available, or 0 (with a warning if "--quiet"
unset) if none are.
Note that Praat usually works either with point data or with intervals which
collectively partition the audio. It can parse TextGrid files with non-contiguous
intervals, but they are rendered strangely.
See the command "get-torch-spect-data-dir-info" for more info about a SpectDataSet
directory.
positional arguments:
ref_dir The token sequence data directory (input)
id2token A file containing mappings from unique IDs to tokens
(e.g. words or phones). Each line has the format "<id>
<token>". The flag "--swap" can be used to swap the
expected ordering (i.e. to "<token> <id>")
tg_dir The TextGrid directory (output)
optional arguments:
-h, --help show this help message and exit
--feat-dir FEAT_DIR Path to features
--infer Infer lengths based on maximum segment boundaries
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--frame-shift-ms FRAME_SHIFT_MS
The number of milliseconds that have passed between
consecutive frames. Used to convert between time in
seconds and frame index. If your features are the raw
samples, set this to 1000 / sample_rate_hz
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
--textgrid-suffix TEXTGRID_SUFFIX
The file suffix in tg_dir indicating a TextGrid file.
--tier-name TIER_NAME
The name to save the tier with
--precision PRECISION
Precision with which to save floating point values in
TextGrid files
--quiet If set, suppresses warnings when lengths cannot be
determined
--force-method {1,2,3}
Force a specific method of writing to TextGrid (1-3
above). Not enough information will lead to an error.
torch-token-data-dir-to-torch-ali-data-dir
usage: torch-token-data-dir-to-torch-ali-data-dir [-h] [--feat-dir FEAT_DIR]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
ref_dir ali_dir
Convert a ref/ dir to an ali/ dir
This command converts a "ref/" directory from a SpectDataSet to an "ali/" directory. The
former contains sequences of tokens; the latter contains frame-wise alignments. The
token ids are set to the frame-wise labels.
A reference token sequence "ref" partitions a frame sequence of length T if
1. ref is of shape (R, 3), with R > 1 and all ref[r, 1:] >= 0 (it contains segment
boundaries).
2. ref[0, 1] = 0 (it starts at frame 0).
3. for all 0 <= r < R - 1, ref[r, 2] = ref[r + 1, 1] (boundaries contiguous).
4. ref[R - 1, 2] = T (it ends after T frames).
When ref partitions the frame sequence, it can be converted into a per-frame alignment
tensor "ali" of shape (T,), where ref[r, 1] <= t < ref[r, 2] implies ali[t] = ref[r, 0].
WARNING! This operation is potentially destructive: a per-frame alignment cannot
distinguish between two of the same token next to one another and one larger token.
See the command "get-torch-spect-data-dir-info" for more info SpectDataSet directories.
positional arguments:
ref_dir The token sequence data directory (input)
ali_dir The frame alignment data directory (output)
optional arguments:
-h, --help show this help message and exit
--feat-dir FEAT_DIR The feature data directory. While not necessary for
the conversion, specifying this directory will allow
the total number of frames in each utterance to be
checked by loading the associated feature matrix.
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
torch-token-data-dir-to-trn
usage: torch-token-data-dir-to-trn [-h] [--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--num-workers NUM_WORKERS]
dir id2token trn
Convert a SpectDataSet token data dir to a NIST trn file
A "trn" file is the standard transcription file without alignment information used
in the sclite (http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm)
toolkit. It has the format
here is a transcription (utterance_a)
here is another (utterance_b)
This command scans the contents of a directory like "ref/" in a SpectDataSeet and
converts each such file into a transcription. Each such transcription is then
written to a "trn" file. See the command "get-torch-spect-data-dir-info" for more
info about a SpectDataSet directory.
positional arguments:
dir The directory to read token sequences from
id2token A file containing mappings from unique IDs to tokens
(e.g. words or phones). Each line has the format "<id>
<token>". The flag "--swap" can be used to swap the
expected ordering (i.e. to "<token> <id>")
trn The "trn" file to write transcriptions to
optional arguments:
-h, --help show this help message and exit
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
trn-to-torch-token-data-dir
usage: trn-to-torch-token-data-dir [-h] [--alt-handler {error,first}]
[--file-prefix FILE_PREFIX]
[--file-suffix FILE_SUFFIX] [--swap]
[--unk-symbol UNK_SYMBOL]
[--num-workers NUM_WORKERS]
[--mp-chunk-size MP_CHUNK_SIZE]
[--skip-frame-times | --feat-sizing]
trn token2id dir
Convert a NIST "trn" file to the specified SpectDataSet data dir
A "trn" file is the standard transcription file without alignment information used in
the sclite (http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm) toolkit. It
has the format
here is a transcription (utterance_a)
here is another (utterance_b)
This command reads in a "trn" file and writes its contents as token sequences compatible
with the "ref/" directory of a SpectDataSet. See the command
"get-torch-spect-data-dir-info" for more info about a SpectDataSet directory.
positional arguments:
trn The input trn file
token2id A file containing mappings from tokens (e.g. words or
phones) to unique IDs. Each line has the format
"<token> <id>". The flag "--swap" can be used to swap
the expected ordering (i.e. to "<id> <token>")
dir The directory to store token sequences to. If the
directory does not exist, it will be created
optional arguments:
-h, --help show this help message and exit
--alt-handler {error,first}
How to handle transcription alternates. If "error",
error if the "trn" file contains alternates. If
"first", always treat the alternate as canon
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
--swap If set, swaps the order of the key and value in
token/id mapping
--unk-symbol UNK_SYMBOL
If set, will map out-of-vocabulary tokens to this
symbol
--num-workers NUM_WORKERS
The number of workers to spawn to process the data. 0
is serial. Defaults to the CPU count
--mp-chunk-size MP_CHUNK_SIZE
The number of utterances that a multiprocessing worker
will process at once. Impacts speed and memory
consumption.
--skip-frame-times If true, will store token tensors of shape (R,)
instead of (R, 3), foregoing segment start and end
times.
--feat-sizing If true, will store token tensors of shape (R, 1)
instead of (R, 3), foregoing segment start and end
times (which trn does not have). The extra dimension
will allow data in this directory to be loaded as
features in a SpectDataSet.