Project

Text normalization for speech recognition using deep learning

In 21st century it has become a norm that companies are using virtual assistants such as Siri,
 Google assistant or Alexa to resolve many problems faced by customers. These technologies are
 empowered by using natural language processing and are good at understanding queries asked by
 humans. These assistants are examples of speech to text conversion models where human inputs
 are converted into texts and processed on cloud.
 On the other hand, text to speech conversion has attracted attention from researchers and
 practitioners where text needs to be normalized first and then provided as an input to the system.
 Companies have developed Text-to-speech synthesis (TTS) and automatic speech recognition
 (ASR) systems. The biggest challenge is to develop and test grammar for various rules. We are
 addressing that challenge and converting input tokens into meaningful words. The text
 normalization is a process in which we convert written text into human understandable language.
 The aim of this project is to design and implement DeepNarrator, a novel system that extends the
 existing TTS libraries in support of the conversion from alphanumeric text into meaningful words.
 In this study, we have used dataset provided by Google for text normalization challenge on
 Kaggle. This dataset has 16 classes and more than 9 million rows of data for training and testing
 purpose. DeepNarrator consists of four modules which includes vector extraction, token
 classification, text conversion and speech generation. In vector extraction, input tokens are being
 converted into 100 dimensional vectors. Multiple approaches were employed to classify input token into classes. The GRU-based approach provided 98.03% testing accuracy. While the LSTM-based
 approach yielded 98.14% accuracy for predicting correct results. Based on predicted class, input
 text is being converted into spoken form using regular expression. Audio files are being generated
 using Google text-to-speech API.

Project (M.S., Computer Science)--California State University, Sacramento, 2018.

In 21st century it has become a norm that companies are using virtual assistants such as Siri, Google assistant or Alexa to resolve many problems faced by customers. These technologies are empowered by using natural language processing and are good at understanding queries asked by humans. These assistants are examples of speech to text conversion models where human inputs are converted into texts and processed on cloud. On the other hand, text to speech conversion has attracted attention from researchers and practitioners where text needs to be normalized first and then provided as an input to the system. Companies have developed Text-to-speech synthesis (TTS) and automatic speech recognition (ASR) systems. The biggest challenge is to develop and test grammar for various rules. We are addressing that challenge and converting input tokens into meaningful words. The text normalization is a process in which we convert written text into human understandable language. The aim of this project is to design and implement DeepNarrator, a novel system that extends the existing TTS libraries in support of the conversion from alphanumeric text into meaningful words. In this study, we have used dataset provided by Google for text normalization challenge on Kaggle. This dataset has 16 classes and more than 9 million rows of data for training and testing purpose. DeepNarrator consists of four modules which includes vector extraction, token classification, text conversion and speech generation. In vector extraction, input tokens are being converted into 100 dimensional vectors. Multiple approaches were employed to classify input token into classes. The GRU-based approach provided 98.03% testing accuracy. While the LSTM-based approach yielded 98.14% accuracy for predicting correct results. Based on predicted class, input text is being converted into spoken form using regular expression. Audio files are being generated using Google text-to-speech API.

Relationships

Items