data

Classes and functions related to storing/retrieving speech data

class pydrobert.torch.data.AbstractEpochSampler(data_source, init_epoch=0, on_uneven_distributed='raise')[source]

ABC for sampling based on epoch

epoch
get_samples_for_epoch(epoch)[source]

Get all samples for the provided epoch

abstract get_samples_for_epoch_ignoring_distributed(epoch)[source]

Get all samples for the provided epoch, ignoring the distruted environment

Ignores the distributed environment. All replicas should return the same value.

class pydrobert.torch.data.BucketBatchSampler(sampler, idx2bucket, bucket2size, drop_incomplete=False)[source]

Batch samples into buckets, yielding as soon as the bucket is full

Parameters:
  • sampler (Collection[int]) – Determines the order in which samples are put into buckets.

  • idx2bucket (Dict[int, TypeVar(H, bound= Hashable)]) – A map specifying which bucket each sample belongs to. The keys are the indices yielded by sampler; the values are the ids of the corresponding buckets.

  • bucket2size (Dict[TypeVar(H, bound= Hashable), int]) – A map from the bucket ids (the values in idx2bucket) to the corresponding batch size. Values must be positive.

  • drop_incomplete (bool) – If True, any batches which are incomplete (smaller than the bucket’s batch size) at the end of an epoch will be discarded. Otherwise, the incomplete batches will be yielded in the order of their bucket ids’ hashes.

Yields:

batch (list of int) – A list of indices from sampler all belonging to the same bucket. The batch is yielded as soon as it is full (or the epoch has ended with drop_incomplete set to False).

Warning

BucketBatchSampler has no __len__() method. Correctly determining the length of the batched sampler requires knowledge of which indices of sampler are being iterated over which can only be determined by iterating over the sampler.

Examples

>>> N = 14
>>> dataset = torch.utils.data.TensorDataset(torch.rand(N))
>>> ssampler = torch.utils.data.SequentialSampler(dataset)
>>> idx2bucket = dict((n, int(n % 3 == 0)) for n in range(N))
>>> bucket2size = {0: 2, 1: 2}
>>> bsampler = BucketBatchSampler(ssampler, idx2bucket, bucket2size, True)
>>> print(list(bsampler))
[[1, 2], [0, 3], [4, 5], [7, 8], [6, 9], [10, 11]]
>>> bsampler = BucketBatchSampler(ssampler, idx2bucket, bucket2size, False)
>>> print(list(bsampler))
[[1, 2], [0, 3], [4, 5], [7, 8], [6, 9], [10, 11], [13], [12]]
class pydrobert.torch.data.ContextWindowDataLoader(data, params, data_params=None, shuffle=True, init_epoch=0, seed=None, **kwargs)[source]

DataLoader for ContextWindowDataSet

Parameters:
  • data (Union[str, ContextWindowDataSet]) – Either a ContextWindowDataSet or a path to the data directory.

  • params (Union[ContextWindowDataLoaderParams, DataLoaderParams]) – Contains at least the parameters specific to the loader. May also contain data set params — see data_params.

  • data_params (Optional[ContextWindowDataParams]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlying ContextWindowDataset. If None, params is assumed to also contain the data set parameters.

  • shuffle (bool) – Whether utterances are shuffled at every epoch or presented sequentially.

  • sort_batch – Whether utterances in a batch are sorted by feature length.

  • init_epoch (int) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.

  • seed (Optional[int]) – The initial seed used for shuffling data. If unset, a random one will be generated.

  • **kwargs – Additional keyword arguments to initialize ContextWindowDataSet and torch.utils.data.DataLoader. The former is only relevant when data is a path.

Yields:

batch – A tuple windows, alis[, window_sizes, uttids], with window_sizes and uttids included if suppress_uttids is False. See context_window_seq_to_batch() for more information on the elements.

Warning

This class does not currently support torch.distributed. Each process will return the same batches.

property epoch

the current epoch

Type:

int

class pydrobert.torch.data.ContextWindowDataLoaderParams(*, context_left, context_right, reverse, delta_order, do_mvn, eos, sos, subset_ids, batch_size, drop_last, name)[source]

Parameters for a ContextWindowDataLoader

This implements the pydrobert.param.optuna.TunableParameterized interface.

class pydrobert.torch.data.ContextWindowDataParams(*, context_left, context_right, reverse, delta_order, do_mvn, eos, sos, subset_ids, name)[source]

Parameters for ContextWindowDataSet

This implements the pydrobert.param.optuna.TunableParameterized interface

classmethod get_tunable()[source]

Returns a set of tunable parameters

classmethod suggest_params(trial, base=None, only=None, prefix='')[source]

Populate a parameterized instance with values from trial

class pydrobert.torch.data.ContextWindowDataSet(data_dir, left=None, right=None, file_prefix='', file_suffix='.pt', warn_on_missing=True, subset_ids=None, feat_subdir='feat', ali_subdir='ali', reverse=None, params=None, feat_mean=None, feat_std=None, suppress_uttids=True)[source]

SpectDataSet, extracting fixed-width windows over the utterance

Like a SpectDataSet, but replaces the feat tensor with window, which runs a sliding window over the frame dimension of feat.

Parameters:
Yields:

tup – For a given utterance, a tuple:

  1. window, windowed spectral features of shape (T, 1 + left + right, F), where the T axis indexes the so-called center frame and the 1 + left + right axis contains frame vectors (size F) including the center frame and the those to the left and right.

  2. ali, window-level alignments, or None if not available.

  3. uttid (if suppress_uttid is False), the string representing the utterance id.

Examples

>>> # see 'SpectDataSet' to set up data directory
>>> data = ContextWindowDataSet('data')
>>> data[0]  # random access returns (window, ali) pairs
>>> for window, ali in data:
>>>     pass  # so does the iterator
>>> data.get_utterance_tuple(3)  # gets the original (feat, ali) pair
class pydrobert.torch.data.DataLoaderParams(*, batch_size, drop_last, name)[source]

General parameters for a DataSet from pydrobert.torch.data

This implements the pydrobert.param.optuna.TunableParameterized interface.

classmethod get_tunable()[source]

Returns a set of tunable parameters

classmethod suggest_params(trial, base=None, only=None, prefix='')[source]

Populate a parameterized instance with values from trial

class pydrobert.torch.data.DynamicLengthDataLoaderParams(*, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]

Parameters for a data loader whose elements have dynamic lengths

class pydrobert.torch.data.EpochRandomSampler(data_source, init_epoch=0, base_seed=None, on_uneven_distributed='raise')[source]

A deterministic RandomSampler which handles torch.distributed

Parameters:
  • data_source (Sized) – The dataset to draw the sample from.

  • init_epoch (int) – The initial epoch.

  • base_seed (Optional[int]) – Determines the starting seed of the sampler. Sampling is seeded with (base_seed, epoch)`. If unset, a seed is randomly generated from the default pytorch generator.

  • on_uneven_distributed (Literal['raise', 'drop', 'uneven', 'ignore']) –

    What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:

    • 'raise' raise a ValueError.

    • 'drop' drop the remainder. The dropped samples will be randomized each epoch.

    • 'uneven' allow some processes to yield fewer samples.

    • 'ignore' ignore the distributed context. Each process will yield all samples.

Warning

The default means of seeding the shuffler changed from version 0.3. Previously the shuffler was seeded on each epoch with the value base_seed + epoch. The change means training a network in this version will yield different results from that trained in version 0.3 even if base_seed is the same.

The change was made because, if repeated experiments were seeded sequentially, then the n-th epoch of the m-th run would see samples in the same order as the m-th epoch of the n-th run. Thus, repeated trials were unintentionally correlated.

Examples

>>> sampler = EpochRandomSampler(
...     torch.utils.data.TensorDataset(torch.arange(100)))
>>> samples_ep0 = tuple(sampler)  # random
>>> samples_ep1 = tuple(sampler)  # random, probably not same as first
>>> assert tuple(sampler.get_samples_for_epoch_ignoring_distributed(0)) == samples_ep0
>>> assert tuple(sampler.get_samples_for_epoch_ignoring_distributed(1)) == samples_ep1
base_seed
class pydrobert.torch.data.EpochSequentialSampler(data_source, init_epoch=0, on_uneven_distributed='raise')[source]

A SequentialSampler which handles torch.distributed

Yields samples [1, 2, ...]

Parameters:
  • data_source (Sized) – The dataset to draw the sample from.

  • init_epoch (int) – The initial epoch.

  • on_uneven_distributed (Literal['raise', 'drop', 'uneven', 'ignore']) –

    What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:

    • 'raise' raise a ValueError.

    • 'drop' drop the last few samples.

    • 'uneven' allow some processes to yield fewer samples.

    • 'ignore' ignore the distributed context. Each process will yield all samples.

    See the below note for more information.

Notes

The following note regards how the sampler handles torch.distributed.

Sequential sampling in a distributed, parallel environment is not well defined. When on_uneven_distributed is 'ignore', each process sees all data sequentially. As such, every process repeats the same work and returns the same value. Though wasteful, results are likely correct, and hence easiest to adapt to from a non-distributed codebase (e.g. with pydrobert.torch.training.TraningStateController). Distributed sequential sampling may still be appropriate otherwise when ordering does not matter, such as when an evaluation metric is computed in aggregate.

When in a distributed environment and on_uneven_distributed is not 'ignore' process r of W processes will be responsible for samples [r, r + W, r + 2W, ...] (assuming shifting is False). When the total number of samples N is divisble by W, each process sees the same number of samples and all samples are yielded by exactly one process. Assuming the quantity of interest is an average over all samples, computing the average per process and then that averaged over processes should yield the same results.

When W does not divide N and on_uneven_distributed is 'uneven', all samples will be yielded by exactly one process but not all processes will yield the same number of samples. Averaging must be performed with specialized logic; see torch.distributed.algorithms.Join for one option.

Finally, when W does not divide N and on_uneven_distributed is 'drop', the last N % W samples are dropped to ensure divisibility. Each process will see the same number of samples, but the last few samples will never be yielded. While averaging will almost always yield a different result from the distributed case, it may nonetheless be close when N % W is small.

class pydrobert.torch.data.LangDataLoader(data, params, data_params=None, shuffle=True, batch_first=True, sort_batch=False, init_epoch=0, on_uneven_distributed='raise', seed=None, **kwargs)[source]

DataLoader for a LangDataSet

Parameters:
  • data (Union[str, LangDataSet]) – Either a LangDataSet or a path to the data directory.

  • params (Union[LangDataLoaderParams, DynamicLengthDataLoaderParams]) – Contains at least the parameters specific to the loader. May also contain data set params – see data_params.

  • data_params (Optional[LangDataParams]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlying LangDataSet. If None, params is assumed to also contain the data set parameters.

  • shuffle (bool) – Whether utterances are shuffled at every epoch or presented sequentially.

  • batch_first (bool) – Whether the batch dimension comes before the sequence dimension in refs.

  • sort_batch (bool) – Whether utterances in a batch are sorted by feature length.

  • init_epoch (int) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.

  • seed (Optional[int]) – The initial seed used for shuffling data. If unset, a random one will be generated.

  • on_uneven_distributed (Literal['raise', 'unordered', 'ignore']) –

    What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:

    • 'raise' raise a ValueError.

    • 'uneven' allow some processes to yield fewer samples.

    • 'ignore' ignore the distributed context. Each process will yield all samples.

  • **kwargs – Additional keyword arguments to initialize LangDataSet and torch.utils.data.DataLoader. The former is only relevant when data is a path.

Yields:

batch (Union[tuple, torch.Tensor]) – A tuple refs, ref_lens[, utt_ids], with the presence of utt_ids dependent on suppress_uttids in the underlying LangDataSet is True). See lang_seq_to_batch() for more information on the elements.

property epoch

the current epoch

Type:

int

class pydrobert.torch.data.LangDataLoaderParams(*, eos, sos, subset_ids, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]

Parameters for a LangDataLoader

This implements the pydrobert.param.optuna.TunableParameterized interface.

class pydrobert.torch.data.LangDataParams(*, eos, sos, subset_ids, name)[source]

Parameters for LangDataSet

class pydrobert.torch.data.SpectDataLoader(data, params, data_params=None, shuffle=True, batch_first=True, sort_batch=False, init_epoch=0, on_uneven_distributed='raise', seed=None, **kwargs)[source]

DataLoader for a SpectDataSet

Parameters:
  • data (Union[str, SpectDataSet]) – Either a SpectDataSet or a path to the data directory.

  • params (Union[SpectDataLoaderParams, DynamicLengthDataLoaderParams]) – Contains at least the parameters specific to the loader. May also contain data set params – see data_params.

  • data_params (Optional[SpectDataParams]) – Data set parameters. Relevant only when data is a path. Used to initialize the underlying SpectDataSet. If None, params is assumed to also contain the data set parameters.

  • shuffle (bool) – Whether utterances are shuffled at every epoch or presented sequentially.

  • batch_first (bool) – Whether the batch dimension comes before the sequence dimension in feats and refs.

  • sort_batch (bool) – Whether utterances in a batch are sorted by feature length.

  • init_epoch (int) – The epoch to resume from. When combined with a fixed seed, ensures the same batches are always delivered for a given epoch.

  • seed (Optional[int]) – The initial seed used for shuffling data. If unset, a random one will be generated.

  • on_uneven_distributed (Literal['raise', 'unordered', 'ignore']) –

    What to do if the sampler detects that it’s in a distributed environment and the number of processes does not evenly divide the number of samples:

    • 'raise' raise a ValueError.

    • 'uneven' allow some processes to yield fewer samples.

    • 'ignore' ignore the distributed context. Each process will yield all samples.

  • **kwargs – Additional keyword arguments to initialize SpectDataSet and torch.utils.data.DataLoader. The former is only relevant when data is a path.

Warning

SpectDataLoader uses the default True for suppress_alis and tokens_only while the current, deprecated default used by SpectDataSet is False.

Yields:

batch – A tuple feats[, alis,] refs, feat_sizes, ref_sizes[, uttids], with alis included if suppress_alis is False and uttids included if suppress_uttids is False. See spect_seq_to_batch() for more information on the elements.

property epoch

the current epoch

Type:

int

class pydrobert.torch.data.SpectDataLoaderParams(*, delta_order, do_mvn, eos, sos, subset_ids, num_length_buckets, size_batch_by_length, batch_size, drop_last, name)[source]

Parameters for a SpectDataLoader

This implements the pydrobert.param.optuna.TunableParameterized interface.

class pydrobert.torch.data.SpectDataParams(*, delta_order, do_mvn, eos, sos, subset_ids, name)[source]

Parameters for SpectDataSet

classmethod suggest_params(trial, base=None, only=None, prefix='')[source]

Populate a parameterized instance with values from trial

class pydrobert.torch.data.SpectDataSet(data_dir, file_prefix='', file_suffix='.pt', warn_on_missing=True, subset_ids=None, sos=None, eos=None, feat_subdir='feat', ali_subdir='ali', ref_subdir='ref', params=None, feat_mean=None, feat_std=None, suppress_alis=None, suppress_uttids=True, tokens_only=None)[source]

Accesses spectrographic filter data stored in a data directory

Parameters:
  • data_dir (str) – A path to the data directory

  • file_prefix (str) – The prefix that indicates that the file counts toward the data set

  • file_suffix (str) – The suffix that indicates that the file counts toward the data set

  • warn_on_missing (bool) – If ali/ or ref/ exist, there’s a mismatch between the utterances in the directories, and warn_on_missing is True, a warning will be issued (via warnings) regarding each such mismatch.

  • subset_ids (Optional[Set[str]]) – Deprecated. Use params.subset_ids.

  • sos (Optional[int]) – Deprecated. Use params.sos.

  • eos (Optional[int]) – Deprecated. Use params.eos.

  • feat_subdir (str) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir is None, they will not be searched for

  • ali_subdir (Optional[str]) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir is None, they will not be searched for

  • ref_subdir (Optional[str]) – Change the names of the subdirectories under which feats, alignments, and references are stored. If ali_subdir or ref_subdir is None, they will not be searched for

  • params (Optional[SpectDataParams]) – Populates the parameters of this class with the instance. If unset, a new SpectDataParams instance is initialized.

  • feat_mean (Optional[Tensor]) – If specified and params.do_mvn is True, this tensor will be used as the mean in mean-variance normalization.

  • feat_std (Optional[Tensor]) – If specified and params.do_mvn is True, this tensor will be used as the standard deviation in mean-variance normalization.

  • suppress_alis (bool) – If True, ali will not be yielded, nor will alignment information be counted towards the list of utterance ids if available.

  • suppress_uttids (bool) – If True, uttid will not be yielded.

  • tokens_only (bool) – If True, ref will drop the segment information if present, always yielding tuples of shape (R,).

Yields:

tup – For a given utterance, a tuple:

  1. feat, the filter bank data.

  2. ali (if suppress_ali is False), frame-level alignments or None if not available.

  3. ref, a sequence of reference tokens or None if not available.

  4. uttid (if suppress_uttid is False), the string representing the utterance id.

Examples

Creating a spectral data directory with random data

>>> import os
>>> data_dir = 'data'
>>> os.makedirs(data_dir + '/feat', exist_ok=True)
>>> os.makedirs(data_dir + '/ali', exist_ok=True)
>>> os.makedirs(data_dir + '/ref', exist_ok=True)
>>> num_filts, min_frames, max_frames, min_ref, max_ref = 40, 10, 20, 3, 10
>>> num_ali_classes, num_ref_classes = 100, 2000
>>> for utt_idx in range(30):
>>>     num_frames = torch.randint(
...         min_frames, max_frames + 1, (1,)).long().item()
>>>     num_tokens = torch.randint(
...         min_ref, max_ref + 1, (1,)).long().item()
>>>     feats = torch.randn(num_frames, num_filts)
>>>     torch.save(feats, data_dir + '/feat/{:02d}.pt'.format(utt_idx))
>>>     ali = torch.randint(num_ali_classes, (num_frames,)).long()
>>>     torch.save(ali, data_dir + '/ali/{:02d}.pt'.format(utt_idx))
>>>     # usually these would be sorted by order in utterance. Negative
>>>     # values represent "unknown" for start end end frames
>>>     ref_tokens = torch.randint(num_tokens, (num_tokens,))
>>>     ref_starts = torch.randint(1, num_frames // 2, (num_tokens,))
>>>     ref_ends = 2 * ref_starts
>>>     ref = torch.stack([ref_tokens, ref_starts, ref_ends], -1).long()
>>>     torch.save(ref, data_dir + '/ref/{:02d}.pt'.format(utt_idx))

Accessing individual elements in a spectral data directory

>>> data = SpectDataSet('data')
>>> data[0]  # random access feat, ali, ref
>>> for feat, ali, ref in data:  # iterator
>>>     pass

Writing evaluation data back to the directory

>>> data = SpectDataSet('data')
>>> num_ali_classes, num_ref_classes, min_ref, max_ref = 100, 2000, 3, 10
>>> num_frames = data[3][0].shape[0]
>>> # pdfs (or more accurately, pms) are likelihoods of classes over data
>>> # per frame, used in hybrid models. Usually logits
>>> pdf = torch.randn(num_frames, num_ali_classes)
>>> data.write_pdf(3, pdf)  # will share name with data.utt_ids[3]
>>> # both refs and hyps are sequences of tokens, such as words or phones,
>>> # with optional frame alignments
>>> num_tokens = torch.randint(min_ref, max_ref, (1,)).long().item()
>>> hyp = torch.full((num_tokens, 3), INDEX_PAD_VALUE).long()
>>> hyp[..., 0] = torch.randint(num_ref_classes, (num_tokens,))
>>> data.write_hyp('special', hyp)  # custom name
find_utt_ids(warn_on_missing, subset_ids={})[source]

Returns a set of all utterance ids from data_dir

pydrobert.torch.data.context_window_seq_to_batch(seq, has_uttids=False)[source]

Convert a sequence of context window elements to a batch

This function is used to collate sequences of elements from a ContextWindowDataSet into batches.

Assume seq is a finite length sequence of pairs of window, ali, where window is of size (T, C, F), where T is some number of windows (which can vary across elements in the sequence), C is the window size, and F is some number filters, and ali is of size (T,). This method batches all the elements of the sequence into a pair of windows, alis, where windows and alis will have shapes (N, C, F) and (N,) resp., where \(N = \sum T\) is the total number of context windows over the utterances.

If ali is None in any element, alis will also be None

Parameters:
  • seq (Sequence[Tuple[Union[Tensor, str, None], ...]]) –

    A finite-length (N) sequence of tuples, each tuple corresponding to an utterance and containing, in order:

    1. window_n, a tensor of size (T_n, C, F) representing windowed spectral features.

    2. ali_n, either None or a tensor of shape (T_n,) representing per-window alignment ids.

    3. uttid_n (if has_refs is True), the utterance id.

  • has_uttids (bool) – Whether utt_n is part of the input values and both window_sizes and uttids are part of the output values.

Returns:

batch – A tuple containing the following elements:

  1. windows, a tensor of shape (sum_n T_n, C, F) containing the concatenated set of windows [window_1, window_2, ..., window_N]

  2. alis, either None or a tensor of shape (sum_n T_n,) containing the concatenated alignment ids [ali_1, ali_2, ..., ali_N].

  3. window_sizes (if has_uttids is True), a tensor of shape (N,) containing the sequence [T_1, T_2, ..., T_N].

  4. uttids (if has_uttids is True), an N-tuple of utterance ids.

pydrobert.torch.data.extract_window(feat, frame_idx, left, right, reverse=False)[source]

Slice the feature matrix to extract a context window

Parameters:
  • feat (Tensor) – Of shape (T, F), where T is the time/frame axis, and F is the frequency axis

  • frame_idx (int) – The “center frame” 0 <= frame_idx < T

  • left (int) – The number of frames in the window to the left (before) the center frame. Any frames below zero are edge-padded

  • right (int) – The number of frames in the window to the right (after) the center frame. Any frames above T are edge-padded

  • reverse (bool) – If True, flip the window along the time/frame axis

Returns:

window (torch.Tensor) – Of shape (1 + left + right, F)

pydrobert.torch.data.lang_seq_to_batch(seq, batch_first=True, sort=True, has_uttids=False)[source]

Convert a sequence of reference sequences to a batch

This function is used to collate sequences of elements from a LangDataSet into batches.

Parameters:
  • seq (Sequence[Union[Tensor, Tuple[Tensor, str]]]) –

    A finite-length (N) sequence of either just ref_n or tuples ref_n, utt_n, where

    • ref_n is a tensor of size (R_n[, 3]) representing reference token sequences and optionally their frame shifts. Either all ref_n must contain the frame shift info (the 3 dimension) or none of them.

    • utt_n (if has_uttids is True) is the utterance id.

  • batch_first (bool) – If True, the batch dimension N comes before the sequence dimension R in refs.

  • sort (bool) – If True, the elements of seq are ordered in descending order of R_n before being batched.

  • has_uttids (bool) – Whether utt_n is part of the input values and uttids is part of the output values.

Returns:

batch (tuple) – A tuple of refs, ref_sizes[, uttids], where: refs is a tensor of shape (max_n R_n, N[, 3]) containing the right-padded sequences [ref_1, ref_2, ..., ref_N] and padded with pydrobert.torch.config.INDEX_PAD_VALUE; ref_sizes is a tensor of shape (N,) containing the sequence [R_1, R_2, ..., R_N]; and uttids (if has_uttids is True), is an N-tuple of strings matching the utterance ids.

pydrobert.torch.data.parse_arpa_lm(file_, token2id=None, to_base_e=None, ftype=<class 'float'>, logger=None)[source]

Parse an ARPA statistical language model

An ARPA language model is an n-gram model with back-off probabilities. It is formatted as:

\data\
ngram 1=<count>
ngram 2=<count>
...
ngram <N>=<count>

\1-grams:
<logp> <token[t]> <logb>
<logp> <token[t]> <logb>
...

\2-grams:
<logp> <token[t-1]> <token[t]> <logb>
...

\<N>-grams:
<logp> <token[t-<N>+1]> ... <token[t]>
...

\end\
Parameters:
  • file – Either the path or a file pointer to the file.

  • token2id (Optional[Dict[str, signedinteger]]) – A dictionary whose keys are token strings and values are ids. If set, tokens will be replaced with ids on read

  • to_base_e (Optional[bool]) – ARPA files store log-probabilities and log-backoffs in base-10. This

  • ftype (Type[TypeVar(F, bound= Union[float, floating])]) – The floating-point type to store log-probabilities and backoffs as

  • logger (Optional[Logger]) – If specified, progress will be written to this logger at INFO level

Returns:

prob_dicts (list) – A list of the same length as there are orders of n-grams in the file (e.g. if the file contains up to tri-gram probabilities then prob_dicts will be of length 3). Each element is a dictionary whose key is the word sequence (earliest word first). For 1-grams, this is just the word. For n > 1, this is a tuple of words. Values are either a tuple of logp, logb of the log-probability and backoff log-probability, or, in the case of the highest-order n-grams that don’t need a backoff, just the log probability.

Warning

Version 0.3.0 and prior do not have the option to_base_e, always returning values in log base 10. While this remains the default, it is deprecated and will be removed in a later version.

This function is not safe for JIT scripting or tracing.

pydrobert.torch.data.read_ctm(ctm, wc2utt=None)[source]

Read a NIST sclite “ctm” file into a list of transcriptions

sclite is a commonly used scoring tool for ASR.

This function converts a time-marked conversation file (“ctm” format) into a list of transcripts. Each element is a tuple of utt_id, transcript, where transcript is itself a list of triples token, start, end, token being a string, start being the start time of the token (in seconds), and end being the end time of the token (in seconds)

Parameters:
  • ctm (Union[TextIO, str]) – The time-marked conversation file pointer. Will open if ctm is a path

  • wc2utt (Optional[dict]) – “ctm” files identify utterances by waveform file name and channel. If specified, wc2utt consists of keys wfn, chan (e.g. '940328', 'A') to unique utterance IDs. If wc2utt is unspecified, the waveform file names are treated as the utterance IDs, and the channel is ignored

Returns:

transcripts (list) – Each element is a tuple of utt_id, transcript. utt_id is a string identifying the utterance. transcript is a list of triples token, start, end, token being the token (a string), start being a float of the start time of the token (in seconds), and end being the end time of the token.

Notes

“ctm”, like “trn”, has “support” for alternate transcriptions. It is, as of sclite version 2.10, very buggy. For example, it cannot handle multiple alternates in the same utterance. Plus, tools like Kaldi use the Unix command that the sclite documentation recommends to sort a ctm, sort +0 -1 +1 -2 +2nb -3, which does not maintain proper ordering for alternate delimiters. Thus, read_ctm() will error if it comes across those delimiters

pydrobert.torch.data.read_textgrid(tg, tier_id=0, fill_token=None)[source]

Read TextGrid file as a transcription

TextGrid is the transcription format of Praat.

Parameters:
  • tg (Union[TextIO, str]) – The TextGrid file. Will open if tg is a path.

  • tier_id (Union[str, int]) – Either the name of the tier (first occurence) or the index of the tier to extract.

  • fill_token (Optional[str]) – If set, any intervals missing from the tier will be filled with an interval of this token before being returned.

Returns:

  • transcript (list) – A list of triples of token, start, end, token being the token (a string), start being a float of the start time of the token (in seconds), and end being the end time of the token. If the tier is a PointTier, the start and end times will be the same.

  • start_time (float) – The start time of the tier (in seconds)

  • end_time (float) – The end time of the tier (in seconds)

Notes

This function does not check for whitespace in or around token labels. This may cause issues if writing as another file type, like write_trn().

Start and end times (including any filled intervals) are determined from the tier’s values, not necessarily those of the top-level container. This is most likely a technicality, however: they should not differ normally.

pydrobert.torch.data.read_trn(trn, warn=True, processes=0, chunk_size=1000)[source]

Read a NIST sclite transcript file into a list of transcripts

sclite is a commonly used scoring tool for ASR.

This function converts a transcript input file (“trn” format) into a list of transcripts, where each element is a tuple of utt_id, transcript. transcript is a list split by spaces.

Parameters:
  • trn (Union[TextIO, str]) – The transcript input file. Will open if trn is a path.

  • warn (bool) – The “trn” format uses curly braces and forward slashes to indicate transcript alterations. This is largely for scoring purposes, such as swapping between filled pauses, not for training. If warn is True, a warning will be issued via the warnings module every time an alteration appears in the “trn” file. Alterations appear in transcripts as elements of ([[alt_1_word_1, alt_1_word_2, ...], [alt_2_word_1, alt_2_word_2, ...], ...], -1, -1) so that transcript_to_token() will not attempt to process alterations as token start and end times.

  • processes (int) – The number of processes used to parse the lines of the trn file. If 0, will be performed on the main thread. Otherwise, the file will be read on the main thread and parsed using processes many processes.

  • chunk_size (int) – The number of lines to be processed by a worker process at a time. Applicable when processes > 0

Returns:

transcripts (list) – A list of pairs utt_id, transcript where utt_id is a string identifying the utterance and transcript is a list of tokens in the utterance’s transcript.

Notes

Any null words (@) in the “trn” file are encoded verbatim.

pydrobert.torch.data.read_trn_iter(trn, warn=True, processes=0, chunk_size=1000)[source]

Read a NIST sclite transcript file, yielding individual transcripts

Identical to read_trn(), but yields individual transcript entries rather than a full list. Ideal for large transcript files.

Parameters:
Yields:
pydrobert.torch.data.spect_seq_to_batch(seq, batch_first=True, sort=True, has_alis=True, has_uttids=False)[source]

Convert a sequence of spectral data to a batch

This function is used to collate sequences of elements from a SpectDataSet into batches.

Parameters:
  • seq (Sequence[Tuple[Union[Tensor, str, None], ...]]) –

    A finite-length (N) sequence of tuples, each tuple corresponding to an utterance and containing, in order:

    1. feat_n, a tensor of size (T_n, F) representing per-frame spectral features.

    2. ali_n (if has_alis is True), either None or a tensor of shape (T_n) representing per-frame alignment ids.

    3. ref_n, either None or a tensor of size (R_n[, 3]) representing reference token sequences and optionally their frame shifts. Either all ref_n must contain the frame shift info (the 3 dimension) or none of them.

    4. utt_n (if has_uttids is True), the utterance id.

  • batch_first (bool) – If True, the batch dimension N comes before the sequence dimension T or R in the return values.

  • sort (bool) – If True, the tuples in seq are first sorted in descending order of T_n before being batched.

  • has_alis (bool) – Whether ali_n is part of the input values and alis is part of the output values. Note that has_alis should still be True if ali_n is present in seq but is None.

  • has_uttids (bool) – Whether utt_n is part of the input values and uttids is part of the output values.

Returns:

batch – A tuple containing the following elements:

  1. feats, a tensor of shape (max_n T_n, N, F) containing the right-padded sequences [feat_1, feat_2, ..., feat_N]. Padded with zeros.

  2. alis (if has_alis is True), either None or a tensor of shape (max_n T_n, N) containing the right-padded sequence [ali_1, ali_2, ... ali_N]. Padded with pydrobert.torch.config.INDEX_PAD_VALUE.

  3. refs, either None or a tensor of shape (max_n R_n, N[, 3])

    containing the right-padded sequences [ref_1, ref_2, ..., ref_N]. Padded with pydrobert.torch.config.INDEX_PAD_VALUE.

  4. feat_sizes, a tensor of shape (N,) containing the sequence [T_1, T_2, ..., T_N].

  5. ref_sizes, a tensor of shape (N,) containing the sequence [R_1, R_2, ..., R_N].

  6. uttids (if has_uttids is True), an N-tuple of the utterance ids.

pydrobert.torch.data.token_to_transcript(ref, id2token=None, frame_shift_ms=None)[source]

Convert a token sequence to a transcript

The inverse operation of transcript_to_token().

Parameters:
Returns:

transcript

Warning

The time interval inferred using frame_shift_ms is unlikely to be perfectly correct. See the note in transcript_to_token() for more details about the ambiguity in converting between seconds and frames.

pydrobert.torch.data.transcript_to_token(transcript, token2id=None, frame_shift_ms=None, unk=None, skip_frame_times=False)[source]

Convert a transcript to a token sequence

This function converts transcript of length R to a long tensor tok of shape (R, 3), the latter suitable as a reference or hypothesis token sequence for an utterance of SpectDataSet. An element of transcript can either be a token or a 3-tuple of (token, start, end). If token2id is not None, the token id is determined by checking token2id[token]. If the token does not exist in token2id and unk is not None, the token will be replaced with unk. If unk is None, token will be used directly as the id. If token2id is not specified, token will be used directly as the identifier. If frame_shift_ms is specified, start and end are taken as the start and end times, in seconds, of the token, and will be converted to frames for tok. If frame_shift_ms is unspecified, start and end are assumed to already be frame times. If start and end were unspecified, values of -1, representing unknown, will be inserted into tok[r, 1:]

Parameters:
  • transcript (Sequence[Union[str, Tuple[str, float, float]]]) –

  • token2id (Optional[dict]) –

  • frame_shift_ms (Optional[float]) –

  • unk (Union[str, int, None]) – The out-of-vocabulary token, if specified. If unk exists in token2id, the token2id[unk] will be used as the out-of-vocabulary identifier. If token2id[unk] does not exist, unk will be assumed to be the identifier already. If token2id is None, unk has no effect.

  • skip_frame_times (bool) – If True, tok will be of shape (R,) and contain only the token ids. Suitable for BitextDataSet.

Returns:

tok (torch.Tensor)

Warning

The frame index bounds inferred using frame_shift_ms should not be used directly in evaluation. See the below note.

Notes

If you are dealing with raw audio, each “frame” is just a sample. The appropriate value for frame_shift_ms is 1000 / sample_rate_hz (since there are sample_rate_hz / 1000 samples per millisecond).

Converting to frame indices from start and end times follows an overly-simplistic equation. Letting \((s_s, e_s)\) be the start and end times in seconds, \((s_f, e_f)\) be the corresponding start and end frames, \(\Delta\) be the frame shift in milliseconds, and \(I[\cdot]\) be the indicator function. Then

\[\begin{split}s_f = floor\left(\frac{1000s_s}{\Delta}\right) \\ e_f = \max\left(s_s + I[s_s = e_s], round\left(\frac{1000e_s}{\Delta}\right)\right)\end{split}\]

For a given token index, tok[r, 1] = s_f and tok[r, 2] = e_f. tok[r, 1] is supposed to be the inclusive start frame of the segment and tok[r, 2] the exclusive end frame. \((s_f, e_f)\) fail to be these on two accounts. First, they do not consider the frame length. First, while frames may be spaced \(\Delta\) milliseconds apart, they will usually be overlapping. Because of this overlap, the coefficients of frames \(s_f - 1\) and \(e_f\) may be in part dependent on the audio samples within the segment. Second, ignoring frame length, \(e_f = ceil(1000e_s/\Delta)\) would be more appropriate for an exclusive upper bound. However, pydrobert.speech.compute (and other, mainstream feature computation packages), the total number of frames in the utterance is calculated as \(T_f = ceil(1000T_s/\Delta)\), where \(T_s\) is the length of the utterance in seconds. The above equation ensures \(\max(e_f) \leq T_f\), which is a neccessary criterion for a valid SpectDataSet (see validate_spec_data_set()).

Accounting for both of these assumptions would involve computing the support of each existing frame in seconds and intersecting that with the provided interval in seconds. As such, the derived frame bounds should not be used for an official evaluation. This function should suffice for most training objectives, however.

pydrobert.torch.data.validate_spect_data_set(data_set, fix=None)[source]

Validate SpectDataSet data directory

The data directory is valid if the following conditions are observed:

  1. All tensors are on the CPU.

  2. All features are tensor instances of the same dtype.

  3. All features have two dimensions.

  4. All features have the same size second dimension.

  5. If alignments are present.

    1. All alignments are long tensor instances.

    2. All alignments have one dimension.

    3. Features and alignments have the same size first axes for a given utterance id (same number of frames).

  6. If reference sequences are present:

    1. All references are long tensor instances.

    2. All references have the same number of dimensions: either 1 or 2.

    3. If 2-dimensional:

      1. The second dimension has length 3

      2. For the start and end points of a reference token, r[i, 1:], either both of them are negative (indicating no alignment), or 0 <= r[i, 1] <= r[i, 2] <= T, where T is the number of frames in the utterance. We do not enforce that tokens be non-overlapping.

Raises a ValueError if a condition is violated.

If fix is not None, the following changes to the data will be permitted instead of raising an error. Any of these changes will be warned of using warnings and then written back to disk.

  1. Any CUDA tensors will be converted into CPU tensors

  2. A reference or alignment of bytes or 32-bit integers can be upcast to long tensors.

  3. A reference token with only a start or end bound (but not both) will have the existing one removed.

  4. A reference token with an exclusive boundary exceeding the number of frames by at most fix will be decreased by that amount. This is only possible if the exclusive end remains above or at the inclusive start.

  5. Alignments exceeding the total number of frames by at most fix will be cropped to that amount.

Notes

The behaviour of condition 6.3.2. has changed slightly since version 0.3.0. We now allow for empty reference token segments (i.e. r[i, 1] can equal r[i, 2]).

pydrobert.torch.data.write_textgrid(transcript, tg, start_time=None, end_time=None, tier_name='transcript', point_tier=None, precision=3)[source]

Write a transcription as a TextGrid file

TextGrid is the transcription format of Praat.

This function saves transcript as a tier within a TextGrid file.

Parameters:
  • transcript (Sequence[Tuple[str, float, float]]) – The transcription to write. Contains triples tok, start, end, where tok is the token, start is its start time, and end is its end time. transcript must be non-empty.

  • tg (Union[TextIO, str]) – The file to write. Will open if tg is a path.

  • start_time (Optional[float]) – The start time of the recording (in seconds). If not specified, it will be inferred from the minimum start time of the intervals in transcript.

  • end_time (Optional[float]) – The end time of the recording (in seconds). If not specified, it will be inferred from the maximum end time of the intervals in transcript.

  • tier_name (str) – What name to save the tier with.

  • point_tier (Optional[bool]) – Whether to save as a point tier (True) or an interval tier. If unset, the value is inferred to be a point tier if all segments are length 0 (within precision precision); an interval tier otherwise.

  • precision (int) – The precision of floating-point values to save times with.

pydrobert.torch.data.write_trn(transcripts, trn)[source]

From an iterable of transcripts, write to a NIST “trn” file

This is largely the inverse operation of read_trn(). In general, elements of a transcript (transcripts contains pairs of utt_id, transcript) could be tokens or tuples of x, start, end (providing the start and end times of tokens, respectively). However, start and end are ignored when writing “trn” files. x could be the token or a list of alternates, as described in read_trn().

Parameters: