| Peer-Reviewed

Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process

Received: 22 April 2021    Accepted: 17 May 2021    Published: 31 May 2021
Views:       Downloads:
Abstract

There has been a considerable stream in ASR over the past few decades, but it may seem strange why this field is still a subject for researchers to work on. There are many reasons, but somewhat because the discipline is created with the promise of human-level performance under pragmatic states and this is an inextricable problem. In addition, the increasing advancement of technology in various fields has caused a more compelling need for this field. Especially the establishment of such a system in the security sector in insecure third world countries such as Afghanistan is an urgent need. This paper began with the reflection of all the necessary knowledge about speech recognition and then suggested an unprecedented method for building an automated speech recognition (ASR) system in the Dari language using the two most powerful open source engines CMUSphinx, from Carnegie Mellon University and DeepSpeech v0.9.3 /. These systems are much more impressive than early speech recognition systems. Using my own collected dataset, a speech-to-text model has been trained for the Dari language. Firstly, the dataset is filtered according to the task, then demonstrated the possible compatibility from the hidden Markov (HMM) models, the phoneme concept to RNN training. The system surpassed previously predicted results, as CMUSphinx stated, “for a typical 10-hour operation, the WER should be around 10%." Finally, 3.3% WER was achieved with 10.3-hours of audio recording using CMUSphinx. 1% WER with DeepSpeech.

Published in American Journal of Neural Networks and Applications (Volume 7, Issue 1)
DOI 10.11648/j.ajnna.20210701.13
Page(s) 15-22
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Dari Language, HMM (S), Neural Network, Non-speaking English Countries, RNN, Speech-recognition, WER

References
[1] https://www.nidcd.nih.gov/health/journey-of-sound-video.
[2] H. Satori, M. Harti, and N. Chenfour, Introduction to Arabic Speech Recognition Using CMUSphinx System.
[3] Mohamed Yassine El Amrani, M. M. Hafizur Rahman, Mohamed Ridza Wahiddin, Asadullah Shah, Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes, Elsevier, 2016.
[4] Mohammed Dib, Automatic Speech Recognition of Arabic Phonemes with Neural Networks, Springer Nature Switzerland AG 2019.
[5] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, Convolutional Neural Networks for Speech Recognition, IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2014.
[6] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in Proc. Interspeech, 2013.
[7] T. N. Sainath, A. rahman Mohamed, B. Kingsbury, and B. Ramabhadran. Deep convolutional neural networks for LVCSR. In ICASSP, 2013.
[8] Iffat Zafar, Giounona Tzanidou, Richard Burton, Nimesh Patel, Leonardo Araujo, Hands-On Convolutional Neural Networks with TensorFlow, 2018.
[9] D. Nagajyothi, P. Siddaiah, Speech Recognition Using Convolutional Neural Networks, international Journal of Engineering & Technology, 2018.
[10] Herve´ Bourlard and Nelson Morgan, CONNECTIONIST SPEECH RECOGNITION A Hybrid Approach, KLUWER ACADEMIC PUBLISHERS, 1994.
[11] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling inspeech recognition. IEEE Signal Processing Magazine, 29 (November) 2012.
[12] T. Schultz and A. Waibel, “Language-independent and language-adaptive acoustic modeling for speech recognition,” Speech Communication, vol. 35, August 2001.
[13] G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuousspeech recognition with context-dependent DBN HMMs,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011.
[14] R. G. Leonard and G. R. Doddington, “A database for speaker-independent digit recognition,” in Proceedings of the International Conference on Acoustics, Speech and Sig nal Processing, vol. 3. IEEE, 1984.
[15] Abhishek Dhankar, Study of deep learning and CMU sphinx in automatic speech recognition, IEEE, 2017.
[16] Rami Matarneh, Svitlana Maksymova, Vyacheslav V. Lyashenko, Nataliya V. Belova Speech Recognition Systems: A Comparative Review, IOSR Journal of Computer Engineering (IOSR-JCE), Volume 19, Issue 5, Ver. IV (Sep.- Oct. 2017).
[17] Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H. Connectionist probability estimators in HMM speech recogni tion. IEEE Transactions on Speech and Audio Processing, 1994.
[18] Paul Lamere, Philip Kwok, Evandro B. Gouvea, Bhiksha Raj, Rita Singh, William Walker, Peter Wolf, The CMU Sphinx-4 Speech Recognition System.
[19] https://en.wikipedia.org/wiki/SoX.
[20] https://cmusphinx.github.io/wiki/.
[21] Andrew W. Senior and Anthony J. Robinson, “Forward backward retraining of recurrent neural networks,” in NIPS, 1995.
[22] Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton, SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS, [cs. NE] 22 Mar 2013.
[23] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, 1997.
[24] Yasser Mohseni Behbahani*, Bagher Babaali, and MussaTurdalyuly, Persian sentences to phoneme sequences conversion based on recurrent neural networks, Open Comput. Sci. 2016.
[25] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, Baidu Silicon Valley AI Lab1, 1195 Bordeaux Avenue, Sunnyvale CA 94086 USA Baidu Speech Technology Group, No. 10 Xibeiwang East Street, Ke Ji Yuan, Haidian District, Beijing 100193 CHINA.
[26] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng, Deep Speech: Scaling up end-to-end speech recognition, arXiv: 1412.5567v2 [cs. CL] 19 Dec 2014.
[27] Willem R¨opke, Roxana R˘adulescu, Kyriakos Efthymiadis, and Ann Now´e, Training a Speech-to-Text Model for Dutch on the Corpus Gesproken Nederlands.
[28] https://deepspeech.readthedocs.io/en/r0.9/.
[29] https://en.wikipedia.org/wiki/Connectionist_temporal_classification.
[30] L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, May 2013.
[31] Naveen Srinivasamurthy and Shrikanth Narayanan, Language-Adaptive Persian Speech Recognition, 2003.
[32] Sun R., Giles C. L., Sequence learning: from recognition and prediction to sequential decision making, IEEE Intelligent Systems, 2001.
[33] L. R. Medsker, Departments of Physics and Computer Science and Information Systems American University, Washington, D. C., L. C. Jain, Knowledge-Based Intelligent Engineering Systems Centre, Faculty of Information Technology, Director/Founder, KES, University of South Australia, Adelaide, The Mawson Lakes, SA, Australia, Recurrent neural network, crs 2001.
[34] Demuynck, K., Roelens, J., Compernolle, D. V., Wambacq, P.: Spraak: An open source” speech recognition and automatic annotation kit”. In: Ninth Annual Conference of the International Speech Communication Association (2008).
[35] A. J. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994.
[36] Ben Shneiderman, the Limits of Speech Recognition, COMMUNICATIONS OF THE ACM September 2000/Vol. 43, No. 9.
[37] Dr. Raj Reddy, spoking language processing, 2001.
[38] Robert V. Shannon; Fan-Gang Zeng; Vivek Kamath; John Wygonski; Michael Ekelid, Speech Recognition with Primarily Temporal Cues, Science, New Series, Vol. 270, No. 5234. (Oct. 13, 1995).
[39] FRANÇOIS CHOLLET, Deep Learning with Python, Manning Publications Co.
[40] B. Logan and A. Salomon, “A music similarity function based on signal analysis,” in ICME 2001, August 2001.
[41] Nitin Indurkhya (Editor), Fred J. Damerau (Editor), Handbook of Natural Language Processing, Second Edition, Chapman & Hall/CRC, Machine Learning and Pattern Recognition Aeries.
[42] The credit for the first figure goes to Rolphe Frédéric Fehlmann, it is designed by Dunya Yousufzai but the concept was taken from Rolphe Frédéric Fehlmann.
Cite This Article
  • APA Style

    Dunya Yousufzai. (2021). Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process. American Journal of Neural Networks and Applications, 7(1), 15-22. https://doi.org/10.11648/j.ajnna.20210701.13

    Copy | Download

    ACS Style

    Dunya Yousufzai. Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process. Am. J. Neural Netw. Appl. 2021, 7(1), 15-22. doi: 10.11648/j.ajnna.20210701.13

    Copy | Download

    AMA Style

    Dunya Yousufzai. Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process. Am J Neural Netw Appl. 2021;7(1):15-22. doi: 10.11648/j.ajnna.20210701.13

    Copy | Download

  • @article{10.11648/j.ajnna.20210701.13,
      author = {Dunya Yousufzai},
      title = {Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process},
      journal = {American Journal of Neural Networks and Applications},
      volume = {7},
      number = {1},
      pages = {15-22},
      doi = {10.11648/j.ajnna.20210701.13},
      url = {https://doi.org/10.11648/j.ajnna.20210701.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajnna.20210701.13},
      abstract = {There has been a considerable stream in ASR over the past few decades, but it may seem strange why this field is still a subject for researchers to work on. There are many reasons, but somewhat because the discipline is created with the promise of human-level performance under pragmatic states and this is an inextricable problem. In addition, the increasing advancement of technology in various fields has caused a more compelling need for this field. Especially the establishment of such a system in the security sector in insecure third world countries such as Afghanistan is an urgent need. This paper began with the reflection of all the necessary knowledge about speech recognition and then suggested an unprecedented method for building an automated speech recognition (ASR) system in the Dari language using the two most powerful open source engines CMUSphinx, from Carnegie Mellon University and DeepSpeech v0.9.3 /. These systems are much more impressive than early speech recognition systems. Using my own collected dataset, a speech-to-text model has been trained for the Dari language. Firstly, the dataset is filtered according to the task, then demonstrated the possible compatibility from the hidden Markov (HMM) models, the phoneme concept to RNN training. The system surpassed previously predicted results, as CMUSphinx stated, “for a typical 10-hour operation, the WER should be around 10%." Finally, 3.3% WER was achieved with 10.3-hours of audio recording using CMUSphinx. 1% WER with DeepSpeech.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Early Stages of Automatic Speech Recognition (ASR) in Non-english Speaking Countries and Factors That Affect the Recognition Process
    AU  - Dunya Yousufzai
    Y1  - 2021/05/31
    PY  - 2021
    N1  - https://doi.org/10.11648/j.ajnna.20210701.13
    DO  - 10.11648/j.ajnna.20210701.13
    T2  - American Journal of Neural Networks and Applications
    JF  - American Journal of Neural Networks and Applications
    JO  - American Journal of Neural Networks and Applications
    SP  - 15
    EP  - 22
    PB  - Science Publishing Group
    SN  - 2469-7419
    UR  - https://doi.org/10.11648/j.ajnna.20210701.13
    AB  - There has been a considerable stream in ASR over the past few decades, but it may seem strange why this field is still a subject for researchers to work on. There are many reasons, but somewhat because the discipline is created with the promise of human-level performance under pragmatic states and this is an inextricable problem. In addition, the increasing advancement of technology in various fields has caused a more compelling need for this field. Especially the establishment of such a system in the security sector in insecure third world countries such as Afghanistan is an urgent need. This paper began with the reflection of all the necessary knowledge about speech recognition and then suggested an unprecedented method for building an automated speech recognition (ASR) system in the Dari language using the two most powerful open source engines CMUSphinx, from Carnegie Mellon University and DeepSpeech v0.9.3 /. These systems are much more impressive than early speech recognition systems. Using my own collected dataset, a speech-to-text model has been trained for the Dari language. Firstly, the dataset is filtered according to the task, then demonstrated the possible compatibility from the hidden Markov (HMM) models, the phoneme concept to RNN training. The system surpassed previously predicted results, as CMUSphinx stated, “for a typical 10-hour operation, the WER should be around 10%." Finally, 3.3% WER was achieved with 10.3-hours of audio recording using CMUSphinx. 1% WER with DeepSpeech.
    VL  - 7
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Computer Science, Software Engineering Department, Kabul University, Kabul, Afghanistan

  • Sections