data
Classes and functions related to storing/retrieving speech data
- class pydrobert.torch.data.AbstractEpochSampler(data_source, init_epoch=0, on_uneven_distributed='raise')[source]
ABC for sampling based on epoch
- epoch
- class pydrobert.torch.data.BucketBatchSampler(sampler, idx2bucket, bucket2size, drop_incomplete=False)[source]
Batch samples into buckets, yielding as soon as the bucket is full
- Parameters:
sampler (
Collection
[int
]) – Determines the order in which samples are put into buckets.idx2bucket (
Dict
[int
,TypeVar
(H
, bound=Hashable
)]) – A map specifying which bucket each sample belongs to. The keys are the indices yielded by sampler; the values are the ids of the corresponding buckets.bucket2size (
Dict
[TypeVar
(H
, bound=Hashable
),int
]) – A map from the bucket ids (the values in idx2bucket) to the corresponding batch size. Values must be positive.drop_incomplete (
bool
) – IfTrue
, any batches which are incomplete (smaller than the bucket’s batch size) at the end of an epoch will be discarded. Otherwise, the incomplete batches will be yielded in the order of their bucket ids’ hashes.
- Yields:
batch (
list
ofint
) – A list of indices from sampler all belonging to the same bucket. The batch is yielded as soon as it is full (or the epoch has ended with drop_incomplete set toFalse
).
Warning
BucketBatchSampler
has no__len__()
method. Correctly determining the length of the batched sampler requires knowledge of which indices of sampler are being iterated over which can only be determined by iterating over the sampler.Examples
>>> N = 14 >>> dataset = torch.utils.data.TensorDataset(torch.rand(N)) >>> ssampler = torch.utils.data.SequentialSampler(dataset) >>> idx2bucket = dict((n, int(n % 3 == 0)) for n in range(N)) >>> bucket2size = {0: 2, 1: 2} >>> bsampler = BucketBatchSampler(ssampler, idx2bucket, bucket2size, True) >>> print(list(bsampler)) [[1, 2], [0, 3], [4, 5], [7, 8], [6, 9], [10, 11]] >>> bsampler = BucketBatchSampler(ssampler, idx2bucket, bucket2size, False) >>> print(list(bsampler)) [[1, 2], [0, 3], [4, 5], [7, 8], [6, 9], [10, 11], [13], [12]]
- class pydrobert.torch.data.ContextWindowDataLoader(data, params, data_params=None, shuffle=True, init_epoch=0, seed=None, **kwargs)[source]
DataLoader for
ContextWindowDataSet
- Parameters:
data (
Union
[str
,ContextWindowDataSet
]) – Either aContextWindowDataSet
or a path to the data directory.params (
Union
[ContextWindowDataLoaderParams
,DataLoaderParams
]) – Contains at least the parameters specific to the loader. May also contain data set params — see data_params.data_params (
Optional
[ContextWindowDataParams
]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlyingContextWindowDataset
. IfNone
, params is assumed to also contain the data set parameters.shuffle (
bool
) – Whether utterances are shuffled at every epoch or presented sequentially.sort_batch – Whether utterances in a batch are sorted by feature length.
init_epoch (
int
) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.seed (
Optional
[int
]) – The initial seed used for shuffling data. If unset, a random one will be generated.**kwargs – Additional keyword arguments to initialize
ContextWindowDataSet
andtorch.utils.data.DataLoader
. The former is only relevant when data is a path.
- Yields:
batch
– A tuplewindows, alis[, window_sizes, uttids]
, with window_sizes and uttids included if suppress_uttids isFalse
. Seecontext_window_seq_to_batch()
for more information on the elements.
Warning
This class does not currently support
torch.distributed
. Each process will return the same batches.
- class pydrobert.torch.data.ContextWindowDataLoaderParams(*, context_left, context_right, reverse, delta_order, do_mvn, eos, sos, subset_ids, batch_size, drop_last, name)[source]
Parameters for a
ContextWindowDataLoader
This implements the
pydrobert.param.optuna.TunableParameterized
interface.
- class pydrobert.torch.data.ContextWindowDataParams(*, context_left, context_right, reverse, delta_order, do_mvn, eos, sos, subset_ids, name)[source]
Parameters for ContextWindowDataSet
This implements the
pydrobert.param.optuna.TunableParameterized
interface
- class pydrobert.torch.data.ContextWindowDataSet(data_dir, left=None, right=None, file_prefix='', file_suffix='.pt', warn_on_missing=True, subset_ids=None, feat_subdir='feat', ali_subdir='ali', reverse=None, params=None, feat_mean=None, feat_std=None, suppress_uttids=True)[source]
SpectDataSet, extracting fixed-width windows over the utterance
Like a
SpectDataSet
, but replaces the feat tensor with window, which runs a sliding window over the frame dimension of feat.- Parameters:
- Yields:
tup
– For a given utterance, a tuple:window, windowed spectral features of shape
(T, 1 + left + right, F)
, where theT
axis indexes the so-called center frame and the1 + left + right
axis contains frame vectors (sizeF
) including the center frame and the those to the left and right.ali, window-level alignments, or
None
if not available.uttid (if suppress_uttid is
False
), the string representing the utterance id.
Examples
>>> # see 'SpectDataSet' to set up data directory >>> data = ContextWindowDataSet('data') >>> data[0] # random access returns (window, ali) pairs >>> for window, ali in data: >>> pass # so does the iterator >>> data.get_utterance_tuple(3) # gets the original (feat, ali) pair
- class pydrobert.torch.data.DataLoaderParams(*, batch_size, drop_last, name)[source]
General parameters for a DataSet from pydrobert.torch.data
This implements the
pydrobert.param.optuna.TunableParameterized
interface.
- class pydrobert.torch.data.DynamicLengthDataLoaderParams(*, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]
Parameters for a data loader whose elements have dynamic lengths
- class pydrobert.torch.data.EpochRandomSampler(data_source, init_epoch=0, base_seed=None, on_uneven_distributed='raise')[source]
A deterministic RandomSampler which handles
torch.distributed
- Parameters:
data_source (
Sized
) – The dataset to draw the sample from.init_epoch (
int
) – The initial epoch.base_seed (
Optional
[int
]) – Determines the starting seed of the sampler. Sampling is seeded with(base_seed, epoch)`
. If unset, a seed is randomly generated from the default pytorch generator.on_uneven_distributed (
Literal
['raise'
,'drop'
,'uneven'
,'ignore'
]) –What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:
'raise'
raise aValueError
.'drop'
drop the remainder. The dropped samples will be randomized each epoch.'uneven'
allow some processes to yield fewer samples.'ignore'
ignore the distributed context. Each process will yield all samples.
Warning
The default means of seeding the shuffler changed from version 0.3. Previously the shuffler was seeded on each epoch with the value
base_seed + epoch
. The change means training a network in this version will yield different results from that trained in version 0.3 even if base_seed is the same.The change was made because, if repeated experiments were seeded sequentially, then the
n
-th epoch of them
-th run would see samples in the same order as them
-th epoch of then
-th run. Thus, repeated trials were unintentionally correlated.Examples
>>> sampler = EpochRandomSampler( ... torch.utils.data.TensorDataset(torch.arange(100))) >>> samples_ep0 = tuple(sampler) # random >>> samples_ep1 = tuple(sampler) # random, probably not same as first >>> assert tuple(sampler.get_samples_for_epoch_ignoring_distributed(0)) == samples_ep0 >>> assert tuple(sampler.get_samples_for_epoch_ignoring_distributed(1)) == samples_ep1
- base_seed
- class pydrobert.torch.data.EpochSequentialSampler(data_source, init_epoch=0, on_uneven_distributed='raise')[source]
A SequentialSampler which handles
torch.distributed
Yields samples
[1, 2, ...]
- Parameters:
data_source (
Sized
) – The dataset to draw the sample from.init_epoch (
int
) – The initial epoch.on_uneven_distributed (
Literal
['raise'
,'drop'
,'uneven'
,'ignore'
]) –What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:
'raise'
raise aValueError
.'drop'
drop the last few samples.'uneven'
allow some processes to yield fewer samples.'ignore'
ignore the distributed context. Each process will yield all samples.
See the below note for more information.
Notes
The following note regards how the sampler handles
torch.distributed
.Sequential sampling in a distributed, parallel environment is not well defined. When on_uneven_distributed is
'ignore'
, each process sees all data sequentially. As such, every process repeats the same work and returns the same value. Though wasteful, results are likely correct, and hence easiest to adapt to from a non-distributed codebase (e.g. withpydrobert.torch.training.TraningStateController
). Distributed sequential sampling may still be appropriate otherwise when ordering does not matter, such as when an evaluation metric is computed in aggregate.When in a distributed environment and on_uneven_distributed is not
'ignore'
processr
ofW
processes will be responsible for samples[r, r + W, r + 2W, ...]
(assuming shifting isFalse
). When the total number of samplesN
is divisble byW
, each process sees the same number of samples and all samples are yielded by exactly one process. Assuming the quantity of interest is an average over all samples, computing the average per process and then that averaged over processes should yield the same results.When
W
does not divideN
and on_uneven_distributed is'uneven'
, all samples will be yielded by exactly one process but not all processes will yield the same number of samples. Averaging must be performed with specialized logic; seetorch.distributed.algorithms.Join
for one option.Finally, when
W
does not divideN
and on_uneven_distributed is'drop'
, the lastN % W
samples are dropped to ensure divisibility. Each process will see the same number of samples, but the last few samples will never be yielded. While averaging will almost always yield a different result from the distributed case, it may nonetheless be close whenN % W
is small.
- class pydrobert.torch.data.LangDataLoader(data, params, data_params=None, shuffle=True, batch_first=True, sort_batch=False, init_epoch=0, on_uneven_distributed='raise', seed=None, **kwargs)[source]
DataLoader for a
LangDataSet
- Parameters:
data (
Union
[str
,LangDataSet
]) – Either aLangDataSet
or a path to the data directory.params (
Union
[LangDataLoaderParams
,DynamicLengthDataLoaderParams
]) – Contains at least the parameters specific to the loader. May also contain data set params – see data_params.data_params (
Optional
[LangDataParams
]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlyingLangDataSet
. IfNone
, params is assumed to also contain the data set parameters.shuffle (
bool
) – Whether utterances are shuffled at every epoch or presented sequentially.batch_first (
bool
) – Whether the batch dimension comes before the sequence dimension in refs.sort_batch (
bool
) – Whether utterances in a batch are sorted by feature length.init_epoch (
int
) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.seed (
Optional
[int
]) – The initial seed used for shuffling data. If unset, a random one will be generated.on_uneven_distributed (
Literal
['raise'
,'unordered'
,'ignore'
]) –What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:
'raise'
raise aValueError
.'uneven'
allow some processes to yield fewer samples.'ignore'
ignore the distributed context. Each process will yield all samples.
**kwargs – Additional keyword arguments to initialize
LangDataSet
andtorch.utils.data.DataLoader
. The former is only relevant when data is a path.
- Yields:
batch (
Union[tuple
,torch.Tensor]
) – A tuplerefs, ref_lens[, utt_ids]
, with the presence of utt_ids dependent on suppress_uttids in the underlyingLangDataSet
isTrue
). Seelang_seq_to_batch()
for more information on the elements.
- class pydrobert.torch.data.LangDataLoaderParams(*, eos, sos, subset_ids, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]
Parameters for a
LangDataLoader
This implements the
pydrobert.param.optuna.TunableParameterized
interface.
- class pydrobert.torch.data.LangDataParams(*, eos, sos, subset_ids, name)[source]
Parameters for LangDataSet
- class pydrobert.torch.data.SpectDataLoader(data, params, data_params=None, shuffle=True, batch_first=True, sort_batch=False, init_epoch=0, on_uneven_distributed='raise', seed=None, **kwargs)[source]
DataLoader for a
SpectDataSet
- Parameters:
data (
Union
[str
,SpectDataSet
]) – Either aSpectDataSet
or a path to the data directory.params (
Union
[SpectDataLoaderParams
,DynamicLengthDataLoaderParams
]) – Contains at least the parameters specific to the loader. May also contain data set params – see data_params.data_params (
Optional
[SpectDataParams
]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlyingSpectDataSet
. IfNone
, params is assumed to also contain the data set parameters.shuffle (
bool
) – Whether utterances are shuffled at every epoch or presented sequentially.batch_first (
bool
) – Whether the batch dimension comes before the sequence dimension in feats and refs.sort_batch (
bool
) – Whether utterances in a batch are sorted by feature length.init_epoch (
int
) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.seed (
Optional
[int
]) – The initial seed used for shuffling data. If unset, a random one will be generated.on_uneven_distributed (
Literal
['raise'
,'unordered'
,'ignore'
]) –What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:
'raise'
raise aValueError
.'uneven'
allow some processes to yield fewer samples.'ignore'
ignore the distributed context. Each process will yield all samples.
**kwargs – Additional keyword arguments to initialize
SpectDataSet
andtorch.utils.data.DataLoader
. The former is only relevant when data is a path.
Warning
SpectDataLoader
uses the defaultTrue
for suppress_alis and tokens_only while the current, deprecated default used bySpectDataSet
isFalse
.- Yields:
batch
– A tuplefeats[, alis,] refs, feat_sizes, ref_sizes[, uttids]
, with alis included if suppress_alis isFalse
and uttids included if suppress_uttids isFalse
. Seespect_seq_to_batch()
for more information on the elements.
- class pydrobert.torch.data.SpectDataLoaderParams(*, delta_order, do_mvn, eos, sos, subset_ids, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]
Parameters for a
SpectDataLoader
This implements the
pydrobert.param.optuna.TunableParameterized
interface.
- class pydrobert.torch.data.SpectDataParams(*, delta_order, do_mvn, eos, sos, subset_ids, name)[source]
Parameters for SpectDataSet
- class pydrobert.torch.data.SpectDataSet(data_dir, file_prefix='', file_suffix='.pt', warn_on_missing=True, subset_ids=None, sos=None, eos=None, feat_subdir='feat', ali_subdir='ali', ref_subdir='ref', params=None, feat_mean=None, feat_std=None, suppress_alis=None, suppress_uttids=True, tokens_only=None)[source]
Accesses spectrographic filter data stored in a data directory
- Parameters:
data_dir (
str
) – A path to the data directoryfile_prefix (
str
) – The prefix that indicates that the file counts toward the data setfile_suffix (
str
) – The suffix that indicates that the file counts toward the data setwarn_on_missing (
bool
) – Ifali/
orref/
exist, there’s a mismatch between the utterances in the directories, and warn_on_missing isTrue
, a warning will be issued (viawarnings
) regarding each such mismatch.subset_ids (
Optional
[Set
[str
]]) – Deprecated. Use params.subset_ids.feat_subdir (
str
) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir isNone
, they will not be searched forali_subdir (
Optional
[str
]) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir isNone
, they will not be searched forref_subdir (
Optional
[str
]) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir isNone
, they will not be searched forparams (
Optional
[SpectDataParams
]) – Populates the parameters of this class with the instance. If unset, a new SpectDataParams instance is initialized.feat_mean (
Optional
[Tensor
]) – If specified andparams.do_mvn
isTrue
, this tensor will be used as the mean in mean-variance normalization.feat_std (
Optional
[Tensor
]) – If specified andparams.do_mvn
isTrue
, this tensor will be used as the standard deviation in mean-variance normalization.suppress_alis (
bool
) – IfTrue
, ali will not be yielded, nor will alignment information be counted towards the list of utterance ids if available.suppress_uttids (
bool
) – IfTrue
, uttid will not be yielded.tokens_only (
bool
) – IfTrue
, ref will drop the segment information if present, always yielding tuples of shape(R,)
.
- Yields:
tup
– For a given utterance, a tuple:
Examples
Creating a spectral data directory with random data
>>> import os >>> data_dir = 'data' >>> os.makedirs(data_dir + '/feat', exist_ok=True) >>> os.makedirs(data_dir + '/ali', exist_ok=True) >>> os.makedirs(data_dir + '/ref', exist_ok=True) >>> num_filts, min_frames, max_frames, min_ref, max_ref = 40, 10, 20, 3, 10 >>> num_ali_classes, num_ref_classes = 100, 2000 >>> for utt_idx in range(30): >>> num_frames = torch.randint( ... min_frames, max_frames + 1, (1,)).long().item() >>> num_tokens = torch.randint( ... min_ref, max_ref + 1, (1,)).long().item() >>> feats = torch.randn(num_frames, num_filts) >>> torch.save(feats, data_dir + '/feat/{:02d}.pt'.format(utt_idx)) >>> ali = torch.randint(num_ali_classes, (num_frames,)).long() >>> torch.save(ali, data_dir + '/ali/{:02d}.pt'.format(utt_idx)) >>> # usually these would be sorted by order in utterance. Negative >>> # values represent "unknown" for start end end frames >>> ref_tokens = torch.randint(num_tokens, (num_tokens,)) >>> ref_starts = torch.randint(1, num_frames // 2, (num_tokens,)) >>> ref_ends = 2 * ref_starts >>> ref = torch.stack([ref_tokens, ref_starts, ref_ends], -1).long() >>> torch.save(ref, data_dir + '/ref/{:02d}.pt'.format(utt_idx))
Accessing individual elements in a spectral data directory
>>> data = SpectDataSet('data') >>> data[0] # random access feat, ali, ref >>> for feat, ali, ref in data: # iterator >>> pass
Writing evaluation data back to the directory
>>> data = SpectDataSet('data') >>> num_ali_classes, num_ref_classes, min_ref, max_ref = 100, 2000, 3, 10 >>> num_frames = data[3][0].shape[0] >>> # pdfs (or more accurately, pms) are likelihoods of classes over data >>> # per frame, used in hybrid models. Usually logits >>> pdf = torch.randn(num_frames, num_ali_classes) >>> data.write_pdf(3, pdf) # will share name with data.utt_ids[3] >>> # both refs and hyps are sequences of tokens, such as words or phones, >>> # with optional frame alignments >>> num_tokens = torch.randint(min_ref, max_ref, (1,)).long().item() >>> hyp = torch.full((num_tokens, 3), INDEX_PAD_VALUE).long() >>> hyp[..., 0] = torch.randint(num_ref_classes, (num_tokens,)) >>> data.write_hyp('special', hyp) # custom name
- pydrobert.torch.data.context_window_seq_to_batch(seq, has_uttids=False)[source]
Convert a sequence of context window elements to a batch
This function is used to collate sequences of elements from a
ContextWindowDataSet
into batches.Assume seq is a finite length sequence of pairs of
window, ali
, wherewindow
is of size(T, C, F)
, whereT
is some number of windows (which can vary across elements in the sequence),C
is the window size, andF
is some number filters, andali
is of size(T,)
. This method batches all the elements of the sequence into a pair ofwindows, alis
, where windows and alis will have shapes(N, C, F)
and(N,)
resp., where \(N = \sum T\) is the total number of context windows over the utterances.If
ali
isNone
in any element, alis will also beNone
- Parameters:
- Returns:
batch
– A tuple containing the following elements:windows, a tensor of shape
(sum_n T_n, C, F)
containing the concatenated set of windows[window_1, window_2, ..., window_N]
alis, either
None
or a tensor of shape(sum_n T_n,)
containing the concatenated alignment ids[ali_1, ali_2, ..., ali_N]
.window_sizes (if has_uttids is
True
), a tensor of shape(N,)
containing the sequence[T_1, T_2, ..., T_N]
.uttids (if has_uttids is
True
), anN
-tuple of utterance ids.
- pydrobert.torch.data.extract_window(feat, frame_idx, left, right, reverse=False)[source]
Slice the feature matrix to extract a context window
- Parameters:
feat (
Tensor
) – Of shape(T, F)
, whereT
is the time/frame axis, andF
is the frequency axisframe_idx (
int
) – The “center frame”0 <= frame_idx < T
left (
int
) – The number of frames in the window to the left (before) the center frame. Any frames below zero are edge-paddedright (
int
) – The number of frames in the window to the right (after) the center frame. Any frames aboveT
are edge-paddedreverse (
bool
) – IfTrue
, flip the window along the time/frame axis
- Returns:
window (
torch.Tensor
) – Of shape(1 + left + right, F)
- pydrobert.torch.data.lang_seq_to_batch(seq, batch_first=True, sort=True, has_uttids=False)[source]
Convert a sequence of reference sequences to a batch
This function is used to collate sequences of elements from a
LangDataSet
into batches.- Parameters:
seq (
Sequence
[Union
[Tensor
,Tuple
[Tensor
,str
]]]) –A finite-length (
N
) sequence of either just ref_n or tuplesref_n, utt_n
, whereref_n is a tensor of size
(R_n[, 3])
representing reference token sequences and optionally their frame shifts. Either all ref_n must contain the frame shift info (the3
dimension) or none of them.utt_n (if has_uttids is
True
) is the utterance id.
batch_first (
bool
) – IfTrue
, the batch dimensionN
comes before the sequence dimensionR
in refs.sort (
bool
) – IfTrue
, the elements of seq are ordered in descending order ofR_n
before being batched.has_uttids (
bool
) – Whether utt_n is part of the input values and uttids is part of the output values.
- Returns:
batch (
tuple
) – A tuple ofrefs, ref_sizes[, uttids]
, where: refs is a tensor of shape(max_n R_n, N[, 3])
containing the right-padded sequences[ref_1, ref_2, ..., ref_N]
and padded withpydrobert.torch.config.INDEX_PAD_VALUE
; ref_sizes is a tensor of shape(N,)
containing the sequence[R_1, R_2, ..., R_N]
; and uttids (if has_uttids isTrue
), is anN
-tuple of strings matching the utterance ids.
- pydrobert.torch.data.parse_arpa_lm(file_, token2id=None, to_base_e=None, ftype=<class 'float'>, logger=None)[source]
Parse an ARPA statistical language model
An ARPA language model is an n-gram model with back-off probabilities. It is formatted as:
\data\ ngram 1=<count> ngram 2=<count> ... ngram <N>=<count> \1-grams: <logp> <token[t]> <logb> <logp> <token[t]> <logb> ... \2-grams: <logp> <token[t-1]> <token[t]> <logb> ... \<N>-grams: <logp> <token[t-<N>+1]> ... <token[t]> ... \end\
- Parameters:
file – Either the path or a file pointer to the file.
token2id (
Optional
[Dict
[str
,signedinteger
]]) – A dictionary whose keys are token strings and values are ids. If set, tokens will be replaced with ids on readto_base_e (
Optional
[bool
]) – ARPA files store log-probabilities and log-backoffs in base-10. Thisftype (
Type
[TypeVar
(F
, bound=Union
[float
,floating
])]) – The floating-point type to store log-probabilities and backoffs aslogger (
Optional
[Logger
]) – If specified, progress will be written to this logger at INFO level
- Returns:
prob_dicts (
list
) – A list of the same length as there are orders of n-grams in the file (e.g. if the file contains up to tri-gram probabilities then prob_dicts will be of length 3). Each element is a dictionary whose key is the word sequence (earliest word first). For 1-grams, this is just the word. For n > 1, this is a tuple of words. Values are either a tuple oflogp, logb
of the log-probability and backoff log-probability, or, in the case of the highest-order n-grams that don’t need a backoff, just the log probability.
Warning
Version
0.3.0
and prior do not have the option to_base_e, always returning values in log base 10. While this remains the default, it is deprecated and will be removed in a later version.This function is not safe for JIT scripting or tracing.
- pydrobert.torch.data.read_ctm(ctm, wc2utt=None)[source]
Read a NIST sclite “ctm” file into a list of transcriptions
sclite is a commonly used scoring tool for ASR.
This function converts a time-marked conversation file (“ctm” format) into a list of transcripts. Each element is a tuple of
utt_id, transcript
, wheretranscript
is itself a list of triplestoken, start, end
,token
being a string,start
being the start time of the token (in seconds), andend
being the end time of the token (in seconds)- Parameters:
ctm (
Union
[TextIO
,str
]) – The time-marked conversation file pointer. Will open if ctm is a pathwc2utt (
Optional
[dict
]) – “ctm” files identify utterances by waveform file name and channel. If specified, wc2utt consists of keyswfn, chan
(e.g.'940328', 'A'
) to unique utterance IDs. If wc2utt is unspecified, the waveform file names are treated as the utterance IDs, and the channel is ignored
- Returns:
transcripts (
list
) – Each element is a tuple ofutt_id, transcript
. utt_id is a string identifying the utterance. transcript is a list of triplestoken, start, end
, token being the token (a string), start being a float of the start time of the token (in seconds), and end being the end time of the token.
Notes
“ctm”, like “trn”, has “support” for alternate transcriptions. It is, as of sclite version 2.10, very buggy. For example, it cannot handle multiple alternates in the same utterance. Plus, tools like Kaldi use the Unix command that the sclite documentation recommends to sort a ctm,
sort +0 -1 +1 -2 +2nb -3
, which does not maintain proper ordering for alternate delimiters. Thus,read_ctm()
will error if it comes across those delimiters
- pydrobert.torch.data.read_textgrid(tg, tier_id=0, fill_token=None)[source]
Read TextGrid file as a transcription
TextGrid is the transcription format of Praat.
- Parameters:
tg (
Union
[TextIO
,str
]) – The TextGrid file. Will open if tg is a path.tier_id (
Union
[str
,int
]) – Either the name of the tier (first occurence) or the index of the tier to extract.fill_token (
Optional
[str
]) – If set, any intervals missing from the tier will be filled with an interval of this token before being returned.
- Returns:
transcript (
list
) – A list of triples oftoken, start, end
, token being the token (a string), start being a float of the start time of the token (in seconds), and end being the end time of the token. If the tier is a PointTier, the start and end times will be the same.start_time (
float
) – The start time of the tier (in seconds)end_time (
float
) – The end time of the tier (in seconds)
Notes
This function does not check for whitespace in or around token labels. This may cause issues if writing as another file type, like
write_trn()
.Start and end times (including any filled intervals) are determined from the tier’s values, not necessarily those of the top-level container. This is most likely a technicality, however: they should not differ normally.
- pydrobert.torch.data.read_trn(trn, warn=True, processes=0, chunk_size=1000)[source]
Read a NIST sclite transcript file into a list of transcripts
sclite is a commonly used scoring tool for ASR.
This function converts a transcript input file (“trn” format) into a list of transcripts, where each element is a tuple of
utt_id, transcript
.transcript
is a list split by spaces.- Parameters:
trn (
Union
[TextIO
,str
]) – The transcript input file. Will open if trn is a path.warn (
bool
) – The “trn” format uses curly braces and forward slashes to indicate transcript alterations. This is largely for scoring purposes, such as swapping between filled pauses, not for training. If warn isTrue
, a warning will be issued via thewarnings
module every time an alteration appears in the “trn” file. Alterations appear in transcripts as elements of([[alt_1_word_1, alt_1_word_2, ...], [alt_2_word_1, alt_2_word_2, ...], ...], -1, -1)
so thattranscript_to_token()
will not attempt to process alterations as token start and end times.processes (
int
) – The number of processes used to parse the lines of the trn file. If0
, will be performed on the main thread. Otherwise, the file will be read on the main thread and parsed using processes many processes.chunk_size (
int
) – The number of lines to be processed by a worker process at a time. Applicable whenprocesses > 0
- Returns:
transcripts (
list
) – A list of pairsutt_id, transcript
where utt_id is a string identifying the utterance and transcript is a list of tokens in the utterance’s transcript.
Notes
Any null words (
@
) in the “trn” file are encoded verbatim.
- pydrobert.torch.data.read_trn_iter(trn, warn=True, processes=0, chunk_size=1000)[source]
Read a NIST sclite transcript file, yielding individual transcripts
Identical to
read_trn()
, but yields individual transcript entries rather than a full list. Ideal for large transcript files.
- pydrobert.torch.data.spect_seq_to_batch(seq, batch_first=True, sort=True, has_alis=True, has_uttids=False)[source]
Convert a sequence of spectral data to a batch
This function is used to collate sequences of elements from a
SpectDataSet
into batches.- Parameters:
seq (
Sequence
[Tuple
[Union
[Tensor
,str
,None
],...
]]) –A finite-length (
N
) sequence of tuples, each tuple corresponding to an utterance and containing, in order:feat_n, a tensor of size
(T_n, F)
representing per-frame spectral features.ali_n (if has_alis is
True)
, eitherNone
or a tensor of shape(T_n)
representing per-frame alignment ids.ref_n, either
None
or a tensor of size(R_n[, 3])
representing reference token sequences and optionally their frame shifts. Either all ref_n must contain the frame shift info (the3
dimension) or none of them.utt_n (if has_uttids is
True
), the utterance id.
batch_first (
bool
) – IfTrue
, the batch dimensionN
comes before the sequence dimensionT
orR
in the return values.sort (
bool
) – IfTrue
, the tuples in seq are first sorted in descending order ofT_n
before being batched.has_alis (
bool
) – Whether ali_n is part of the input values and alis is part of the output values. Note that has_alis should still beTrue
if ali_n is present in seq but isNone
.has_uttids (
bool
) – Whether utt_n is part of the input values and uttids is part of the output values.
- Returns:
batch
– A tuple containing the following elements:feats, a tensor of shape
(max_n T_n, N, F)
containing the right-padded sequences[feat_1, feat_2, ..., feat_N]
. Padded with zeros.alis (if has_alis is
True
), eitherNone
or a tensor of shape(max_n T_n, N)
containing the right-padded sequence[ali_1, ali_2, ... ali_N]
. Padded withpydrobert.torch.config.INDEX_PAD_VALUE
.- refs, either
None
or a tensor of shape(max_n R_n, N[, 3])
containing the right-padded sequences
[ref_1, ref_2, ..., ref_N]
. Padded withpydrobert.torch.config.INDEX_PAD_VALUE
.
- refs, either
feat_sizes, a tensor of shape
(N,)
containing the sequence[T_1, T_2, ..., T_N]
.ref_sizes, a tensor of shape
(N,)
containing the sequence[R_1, R_2, ..., R_N]
.uttids (if has_uttids is
True
), anN
-tuple of the utterance ids.
- pydrobert.torch.data.token_to_transcript(ref, id2token=None, frame_shift_ms=None)[source]
Convert a token sequence to a transcript
The inverse operation of
transcript_to_token()
.- Parameters:
- Returns:
transcript
Warning
The time interval inferred using frame_shift_ms is unlikely to be perfectly correct. See the note in
transcript_to_token()
for more details about the ambiguity in converting between seconds and frames.
- pydrobert.torch.data.transcript_to_token(transcript, token2id=None, frame_shift_ms=None, unk=None, skip_frame_times=False)[source]
Convert a transcript to a token sequence
This function converts transcript of length
R
to a long tensor tok of shape(R, 3)
, the latter suitable as a reference or hypothesis token sequence for an utterance ofSpectDataSet
. An element of transcript can either be atoken
or a 3-tuple of(token, start, end)
. If token2id is notNone
, the token id is determined by checkingtoken2id[token]
. If the token does not exist in token2id and unk is notNone
, the token will be replaced with unk. If unk isNone
, token will be used directly as the id. If token2id is not specified, token will be used directly as the identifier. If frame_shift_ms is specified,start
andend
are taken as the start and end times, in seconds, of the token, and will be converted to frames for tok. If frame_shift_ms is unspecified,start
andend
are assumed to already be frame times. Ifstart
andend
were unspecified, values of-1
, representing unknown, will be inserted intotok[r, 1:]
- Parameters:
transcript (
Sequence
[Union
[str
,Tuple
[str
,float
,float
]]]) –unk (
Union
[str
,int
,None
]) – The out-of-vocabulary token, if specified. If unk exists in token2id, thetoken2id[unk]
will be used as the out-of-vocabulary identifier. Iftoken2id[unk]
does not exist, unk will be assumed to be the identifier already. If token2id isNone
, unk has no effect.skip_frame_times (
bool
) – IfTrue
, tok will be of shape(R,)
and contain only the token ids. Suitable forBitextDataSet
.
- Returns:
tok (
torch.Tensor
)
Warning
The frame index bounds inferred using frame_shift_ms should not be used directly in evaluation. See the below note.
Notes
If you are dealing with raw audio, each “frame” is just a sample. The appropriate value for frame_shift_ms is
1000 / sample_rate_hz
(since there aresample_rate_hz / 1000
samples per millisecond).Converting to frame indices from start and end times follows an overly-simplistic equation. Letting \((s_s, e_s)\) be the start and end times in seconds, \((s_f, e_f)\) be the corresponding start and end frames, \(\Delta\) be the frame shift in milliseconds, and \(I[\cdot]\) be the indicator function. Then
\[\begin{split}s_f = floor\left(\frac{1000s_s}{\Delta}\right) \\ e_f = \max\left(s_s + I[s_s = e_s], round\left(\frac{1000e_s}{\Delta}\right)\right)\end{split}\]For a given token index,
tok[r, 1] = s_f
andtok[r, 2] = e_f
.tok[r, 1]
is supposed to be the inclusive start frame of the segment andtok[r, 2]
the exclusive end frame. \((s_f, e_f)\) fail to be these on two accounts. First, they do not consider the frame length. First, while frames may be spaced \(\Delta\) milliseconds apart, they will usually be overlapping. Because of this overlap, the coefficients of frames \(s_f - 1\) and \(e_f\) may be in part dependent on the audio samples within the segment. Second, ignoring frame length, \(e_f = ceil(1000e_s/\Delta)\) would be more appropriate for an exclusive upper bound. However,pydrobert.speech.compute
(and other, mainstream feature computation packages), the total number of frames in the utterance is calculated as \(T_f = ceil(1000T_s/\Delta)\), where \(T_s\) is the length of the utterance in seconds. The above equation ensures \(\max(e_f) \leq T_f\), which is a neccessary criterion for a validSpectDataSet
(seevalidate_spec_data_set()
).Accounting for both of these assumptions would involve computing the support of each existing frame in seconds and intersecting that with the provided interval in seconds. As such, the derived frame bounds should not be used for an official evaluation. This function should suffice for most training objectives, however.
- pydrobert.torch.data.validate_spect_data_set(data_set, fix=None)[source]
Validate SpectDataSet data directory
The data directory is valid if the following conditions are observed:
All tensors are on the CPU.
All features are tensor instances of the same dtype.
All features have two dimensions.
All features have the same size second dimension.
If alignments are present.
All alignments are long tensor instances.
All alignments have one dimension.
Features and alignments have the same size first axes for a given utterance id (same number of frames).
If reference sequences are present:
All references are long tensor instances.
All references have the same number of dimensions: either 1 or 2.
If 2-dimensional:
The second dimension has length 3
For the start and end points of a reference token,
r[i, 1:]
, either both of them are negative (indicating no alignment), or0 <= r[i, 1] <= r[i, 2] <= T
, whereT
is the number of frames in the utterance. We do not enforce that tokens be non-overlapping.
Raises a
ValueError
if a condition is violated.If fix is not
None
, the following changes to the data will be permitted instead of raising an error. Any of these changes will be warned of usingwarnings
and then written back to disk.Any CUDA tensors will be converted into CPU tensors
A reference or alignment of bytes or 32-bit integers can be upcast to long tensors.
A reference token with only a start or end bound (but not both) will have the existing one removed.
A reference token with an exclusive boundary exceeding the number of frames by at most fix will be decreased by that amount. This is only possible if the exclusive end remains above or at the inclusive start.
Alignments exceeding the total number of frames by at most fix will be cropped to that amount.
Notes
The behaviour of condition 6.3.2. has changed slightly since version 0.3.0. We now allow for empty reference token segments (i.e.
r[i, 1]
can equalr[i, 2]
).
- pydrobert.torch.data.write_textgrid(transcript, tg, start_time=None, end_time=None, tier_name='transcript', point_tier=None, precision=3)[source]
Write a transcription as a TextGrid file
TextGrid is the transcription format of Praat.
This function saves transcript as a tier within a TextGrid file.
- Parameters:
transcript (
Sequence
[Tuple
[str
,float
,float
]]) – The transcription to write. Contains triplestok, start, end
, where tok is the token, start is its start time, and end is its end time. transcript must be non-empty.tg (
Union
[TextIO
,str
]) – The file to write. Will open if tg is a path.start_time (
Optional
[float
]) – The start time of the recording (in seconds). If not specified, it will be inferred from the minimum start time of the intervals in transcript.end_time (
Optional
[float
]) – The end time of the recording (in seconds). If not specified, it will be inferred from the maximum end time of the intervals in transcript.tier_name (
str
) – What name to save the tier with.point_tier (
Optional
[bool
]) – Whether to save as a point tier (True
) or an interval tier. If unset, the value is inferred to be a point tier if all segments are length 0 (within precision precision); an interval tier otherwise.precision (
int
) – The precision of floating-point values to save times with.
- pydrobert.torch.data.write_trn(transcripts, trn)[source]
From an iterable of transcripts, write to a NIST “trn” file
This is largely the inverse operation of
read_trn()
. In general, elements of a transcript (transcripts contains pairs ofutt_id, transcript
) could be tokens or tuples ofx, start, end
(providing the start and end times of tokens, respectively). However,start
andend
are ignored when writing “trn” files.x
could be the token or a list of alternates, as described inread_trn()
.