References

[fan1962]

C. T. Fan, M. E. Muller, and I. Rezucha, “Development of sampling plans by using sequential (item by item) selection techniques and digital computers,” vol. 57, no. 298, pp. 387-402, Jun. 1962, doi: 10.1080/01621459.1962.10480667.

[howard1972]

S. Howard, “Discussion on Professor Cox’s paper,” Journal of the Royal Statistical Society, vol. 34, no. 2, pp. 210-211, Jan. 1972, doi: 10.1111/j.2517-6161.1972.tb00900.x.

[williams1992]

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229-256, May 1992.

[chen1994]

X.-H. Chen, A. P. Dempster, and J. S. Liu, “Weighted finite population sampling to maximize entropy,” Biometrika, vol. 81, no. 3, pp. 457-69, 1994, doi: 10.2307/2337119.

[mengerson1996]

K. L. Mengersen and R. L. Tweedie, “Rates of convergence of the Hastings and Metropolis algorithms,” The Annals of Statistics, vol. 24, no. 1, pp. 101-121, Feb. 1996, doi: 10.1214/aos/1033066201.

[graves2006]

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling unsegmented sequence data with recurrent neural networks,” New York, NY, USA, 2006, pp. 369-376. doi: 10.1145/1143844.1143891.

[mikolov2010]

T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, “Recurrent neural network based language model,” presented at Interspeech, Makuhari, Japan, 2010.

[heafield2011]

K. Heafield, “KenLM: Faster and smaller language model queries,” in Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, 2011, pp. 187-197.

[bengio2013]

Y. Bengio, N. Léonard, and A. C. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.,” CoRR, vol. abs/1308.3432, 2013, [Online]. Available: http://arxiv.org/abs/1308.3432

[cho2014]

K. Cho et al., “Learning phrase representations using RNN Encoder-Decoder for Statistical Machine Translation,” Doha, Qatar, 2014, pp. 1724–1734. [Online]. Available: https://www.aclweb.org/anthology/D14-1179

[bahdanau2015]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate.,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[gulcehre2015]

Ç. Gülçehre et al., “On using monolingual corpora in neural machine translation,” CoRR, vol. abs/1503.03535, 2015, [Online]. Available: http://arxiv.org/abs/1503.03535

[luong2015]

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 1412-1421.

[chan2016]

W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell: A neural network for Large Vocabulary Conversational Speech Recognition,” Mar. 2016, pp. 4960-4964. doi: 10.1109/ICASSP.2016.7472621.

[grathwohl2017]

W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. K. Duvenaud, “Backpropagation through the Void: Optimizing control variates for black-box gradient estimation,” CoRR, vol. abs/1711.00123, 2017.

[vaswani2017]

A. Vaswani et al., “Attention is all you beed,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5998-6008.

[tucker2017]

G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein, “REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 2627-2636.

[prabhavalkar2018]

R. Prabhavalkar et al., “Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models,” presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4839-4843.

[sabour2018]

S. Sabour, W. Chan, and M. Norouzi, “Optimal Completion Distillation for Sequence Learning,” CoRR, vol. abs/1810.01398, 2018.

[bert2019]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” Minneapolis, USA, 2019, vol. 1, pp. 4171-4186. [Online]. Available: https://aclweb.org/anthology/papers/N/N19/N19-1423/

[park2019]

D. S. Park et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613-2617, doi: 10.21437/Interspeech.2019-2680.

[park2020]

D. S. Park et al., “Specaugment on large scale datasets,” May 2020, pp. 6879-6883, doi: 10.1109/ICASSP40776.2020.9053205.