Artificial Intelligence Promotes the Development of Speech Recognition Technology

Speech is the most natural way for humans to interact. After the invention of the computer, letting the machine "understand" the human language, understand the inner meaning of the language, and make the correct answer becomes the goal that people pursue. This process involves three main technologies, namely automatic speech recognition; natural language processing (the purpose is to let the machine understand human intentions) and speech synthesis (the purpose is to let the machine speak).
Communicate with the machine and let it understand what you are talking about. Speech recognition technology has turned the once-dream of mankind into reality. Speech recognition technology is the "machine's auditory system", which allows the machine to convert speech signals into corresponding text or commands through recognition and understanding.
The Origin and Development of Modern Intelligent Speech Recognition Technology
At the Bell Institute in 1952, Davis et al. developed the world's first experimental system that could recognize 10 English digital pronunciations. In 1960, Denes et al. of the United Kingdom developed the first computer speech recognition system.
Large-scale speech recognition research began in the 1970s and made substantial progress in the identification of small vocabulary and isolated words. After the 1980s, the focus of speech recognition research has gradually shifted to large vocabulary and non-specific continuous speech recognition.
At the same time, speech recognition has undergone major changes in research ideas. The traditional technical ideas based on standard template matching have turned to technical ideas based on statistical models. In addition, experts in the industry have once again proposed a technical idea to introduce neural network technology into speech recognition.
After the 1990s, there was no major breakthrough in the system framework for speech recognition. However, great progress has been made in the application and productization of speech recognition technology. For example, DARPA is a program funded by the US Department of Defense's Vision Research Program in the 1970s to support the research and development of language understanding systems. In the 1990s, the DARPA program was still in progress, and its research focus had shifted to the natural language processing part of the identification device, and the identification task was set to “air travel information retrieval”.
China's speech recognition research began in 1958, and the Chinese Academy of Sciences Institute of Acoustics used the electronic tube circuit to identify 10 vowels. Due to the limitations of the conditions at the time, China's speech recognition research work has been in a slow development stage. Until 1973, the Institute of Acoustics of the Chinese Academy of Sciences began computer speech recognition.
Since the 1980s, with the gradual popularization and application of computer application technology in China and the further development of digital signal technology, many domestic units have the basic conditions for researching speech technology. At the same time, international speech recognition technology has become a research hotspot after years of silence. In this form, many domestic units have invested in this research work.
In 1986, speech recognition was listed as an important part of the research of intelligent computer systems. With the support of the “863” program, China began to organize research on speech recognition technology and decided to hold a special session on speech recognition every two years. Since then, China's speech recognition technology has entered a new stage of development.
Since 2009, with the development of deep learning research in the field of machine learning and the accumulation of big data corpus, speech recognition technology has developed by leaps and bounds.
The deep learning of machine learning domain is introduced into the speech recognition acoustic model training, and the multi-layer neural network with RBM pre-training is used to improve the accuracy of the acoustic model. In this regard, Microsoft researchers took the lead in making breakthroughs. After using the Deep Neural Network Model (DNN), the speech recognition error rate was reduced by 30%, which is the fastest progress in speech recognition technology in the past 20 years.
Around 2009, most mainstream speech recognition decoders have adopted a finite state machine (WFST)-based decoding network, which can integrate language models, dictionaries and acoustic shared sounds into a large decoding network, improving decoding. The speed provides the basis for real-time application of speech recognition.
With the rapid development of the Internet and the popularization of mobile terminals, a large number of text or speech corpora can be obtained from multiple channels, which provides abundant resources for the training of language models and acoustic models in speech recognition, making the construction of universal Large-scale language models and acoustic models are possible.
In speech recognition, the matching and richness of training data is one of the most important factors to promote the performance improvement of the system. However, the annotation and analysis of corpus requires long-term accumulation and precipitation. With the advent of the era of big data, large-scale corpus resources Accumulation will refer to strategic heights.
Nowadays, the application of speech recognition on mobile terminals is the hottest. Voice dialogue robots, voice assistants and interactive tools are emerging one after another. Many Internet companies have invested in human resources, material resources and financial resources to carry out research and application in this field. The purpose is to interact through voice. The new and convenient model quickly captures the customer base.
The main method of speech recognition technology
At present, representative speech recognition methods mainly include dynamic time warping technology (DTW), hidden Markov model (HMM), vector quantization (VQ), artificial neural network (ANN), and support vector machine (SVM).
Dynamic Time Warping (DTW) is a simple and effective method for speech recognition in non-specific people. Based on the idea of ​​dynamic programming, the algorithm solves the problem of template matching with different pronunciation lengths. It is a speech recognition technology. An earlier, more commonly used algorithm appears. When applying the DTW algorithm for speech recognition, the pre-processed and framing speech test signals are compared with the reference speech templates to obtain the similarity between them, and the similarity between the two templates is obtained according to a certain distance measure. And choose zui good path.
Hidden Markov Model (HMM) is a statistical model in speech signal processing. It is evolved from Markov chain, so it is a statistical recognition method based on parametric model. Since the pattern library is a model that is formed by repeated training and has a probability of coincidence with the training output signal, rather than a pre-stored pattern sample, and the likelihood of the speech sequence to be recognized and the HMM parameter is used in the recognition process. The zui good state sequence corresponding to the probability of reaching the large value of zui is used as the recognition output, so it is an ideal speech recognition model.
Vector Quantization is an important method of signal compression. Compared with HMM, vector quantization is mainly used in speech recognition of small vocabulary and isolated words. The process is to form a vector of a plurality of speech signal waveforms or characteristic parameters into a vector and perform overall quantization in a multi-dimensional space. The vector space is divided into several small areas, each of which finds a representative vector, and the vector that falls into the small area during quantization is replaced by this representative vector. The design of vector quantizer is to train a good codebook from a large number of signal samples, and to find a good distortion measure definition formula from the actual effect, design a good vector quantization system, and use the least amount of search and calculation of distortion calculation. Achieve the average possible signal to noise ratio of Zui.
In the actual application process, people also studied a variety of methods to reduce complexity, including memoryless vector quantization, memory vector quantization and fuzzy vector quantization.
Artificial neural network (ANN) is a new speech recognition method proposed in the late 1980s. It is essentially an adaptive nonlinear dynamics system that simulates the principles of human neural activity, with adaptability, parallelism, robustness, fault tolerance and learning characteristics, its powerful classification ability and input-output mapping capability. Very attractive in speech recognition. The method is an engineering model that simulates the human brain thinking mechanism. It is contrary to HMM. Its classification decision-making ability and ability to describe uncertain information are universally recognized, but its ability to describe dynamic time signals is not satisfactory. The MLP classifier can only solve the static pattern classification problem and does not involve the processing of time series. Although scholars have proposed many structures with feedback, they are still not sufficient to characterize the dynamic characteristics of time series such as speech signals. Since the ANN cannot describe the temporal dynamic characteristics of the speech signal well, the ANN is often combined with the traditional recognition method to utilize the respective advantages for speech recognition to overcome the shortcomings of the HMM and the ANN. In recent years, significant progress has been made in the identification algorithm combining neural network and implicit Markov model. The recognition rate is close to the identification system of implicit Markov model, which further improves the robustness and accuracy of speech recognition.
Support vector machine (SVM) is a new learning machine model that applies statistical theory. It adopts Structural Risk Minimization (SRM), which effectively overcomes the shortcomings of traditional empirical risk minimization methods. Taking into account the training error and generalization ability, it has many superior performances in solving small sample, nonlinear and high-dimensional pattern recognition, and has been widely applied to the field of pattern recognition.
Application of speech recognition technology model in medium depth neural network
Deep learning refers to a general term for machine learning methods that use multiple layers of nonlinear signals and information processing techniques to perform tasks such as signal conversion, feature extraction, and pattern classification through supervised or unsupervised methods. Because the deep structure model is used to process signals and information, it is referred to herein as "deep" learning. Many traditional machine learning models belong to shallow structure models, such as support vector machines, GMM, HMM, conditional random fields, linear or nonlinear dynamic systems, and single hidden layer neural networks.
It is a common feature of these structural models that the original input signal is processed linearly or nonlinearly with relatively few levels (usually one layer) to achieve signal and information processing. The advantage of the shallow model is that it has a relatively complete algorithm in mathematics, and the structure is simple and easy to learn. However, the shallow model uses less linear or nonlinear transformation combinations. The complex structural information in the signal cannot be effectively studied, and the ability to express complex signals is limited. The deep structure model is more suitable for dealing with complex types of signals, because the deep structure has multiple layers of nonlinear transformation, and has stronger expression and modeling capabilities.
The generation and perception of human speech signals is such an extremely complicated process, and it has been proved to have obvious multi-level and even deep processing structures in biology. Therefore, for speech recognition tasks, the use of shallow structure models is obviously very large. limitation. It is a more reasonable choice to use the multi-layer nonlinear transformation in deep structure to extract structured information and higher-level information in speech signals.
Application and Limitations of DNN in Speech Recognition System
Since 2011, based on the DNN-HMM acoustic model, the speech recognition of multiple languages ​​and multiple tasks has achieved a significant and consistent effect over the traditional GMM-HMM acoustic model. The basic framework of the DNN-HMM speech recognition system is shown in the figure. The DNN is used to replace the GMM model to model the speech observation probability, which is different from the traditional GMM-HMM speech recognition system. The feedforward deep neural network is the initial mainstream deep neural network because it is relatively simple.
Feature extraction for speech recognition requires first windowing and framing the waveform, and then extracting features. The input of the training GMM model is a single frame feature, and the DNN generally uses a plurality of adjacent frames to be spliced ​​together as an input. This method enables the structural information of the speech signal to be described longer. Research shows that the feature splicing input is compared to the DNN. GMM can be a key factor in achieving significant performance gains. Due to the influence of the coordinated pronunciation during speech, speech is a complex time-varying signal with strong correlation between frames. The pronunciation of the words to be spoken and several words before and after have influence, and the length of the influence is related to the speech. The content changes from time to time. Although a certain degree of context information can be learned by means of splicing frames, since the window length of the DNN input (ie, the number of spliced ​​frames) is fixed in advance, the structure of the DNN can only learn a fixed input-to-input mapping relationship. , resulting in insufficient modeling flexibility for the longer-term correlation of timing information.
Application of recurrent neural network in acoustic model
Speech signals have a significant synergistic phenomenon, so long-term correlation must be considered. Due to the strong long-term modeling ability of the cyclic neural network, RNN has gradually replaced DNN as the mainstream modeling solution for speech recognition. The network structure of DNN and RNN is shown in the figure. RNN adds a feedback connection to the hidden layer, which is different from DNNzui. This means that the input of the hidden moment of the RNN includes not only the output from the previous layer, but also the hidden layer output of the previous moment. This cyclic feedback connection allows the RNN to see the information of all the previous moments in principle. Equivalent to RNN has a historical memory function. For timing signals such as speech, using RNN modeling is more appropriate.
However, the traditional RNN has a problem of gradient disappearing during the training process, which makes the model difficult to train. In order to overcome the problem of gradient disappearance, some researchers have proposed long and short time memory RNN. The LSTM-RNN uses input gates, output gates, and forgetting gates to control the flow of information so that the gradient can propagate stably over a relatively long time span. When the current frame is processed, the bidirectional LSTM-RNN (BLSTM-RNN) can make use of historical voice information and future voice information, making it easier to make more accurate decisions, and thus can achieve better performance improvement than unidirectional LSTM.
Although the performance of the two-way LSTM-RNN is better, it is not suitable for real-time systems. Due to the use of future information for a long time, the system will have a large delay and is mainly used for some offline speech recognition tasks. Based on this, the researchers proposed model structures such as delay-controlled BLSTM and line-convolution BLSTM, which attempt to construct a compromise between unidirectional LSTM and BLSTM: the forward LSTM remains unchanged and is used for future information. The reverse LSTM has been optimized. In the LC-BLSTM architecture, the standard reverse LSTM is replaced by a reverse LSTM with up to N frames of look-ahead, and an N-frame forward-looking row convolution replacement is integrated in the line convolution model.
FSMN-based speech recognition system
At present, there are many academic or industrial organizations in the world conducting research under the RNN framework. The BLSTM-RNN-based speech recognition system, which is currently the most effective, has a problem of excessive delay, which is not suitable for real-time speech interaction systems such as voice input. Although BLSTM can be implemented as a real-time voice interaction system through LC-BLSTM and line convolution BLSTM, since RNN has a more complex structure than DNN, it takes a lot of time to train RNN models under massive data. Finally, because the RNN fits the context correlation more strongly, it is easier to fall into the over-fitting problem than the DNN, and it is easy to bring additional anomaly recognition errors due to local problems of the training data.
In order to solve the above problems, in conjunction with the traditional DNN framework and the characteristics of RNN, HKUST developed a new framework called feedforward sequential memory network, as shown in the figure. The structure of the FSMN uses a non-cyclic feedforward structure, which requires only 180 ms of delay, which is equivalent to the BLSTM-RNN.
The structure of the FSMN is shown in the figure. It is mainly based on the improvement of the traditional DNN structure. A “memory module” is added next to the hidden layer of the DNN. This memory module is used to store the history of the speech signal useful for judging the current speech frame. Information and future information. The above picture shows the timing expansion structure of the N-frame voice information of the memory module. The history to be remembered and the length of future information N can be adjusted according to the needs of the actual task. The memory function of the FSMN memory block is implemented using a feedforward structure, which is different from the traditional loop feedback based RNN model. There are two major advantages to using this feedforward structure to store information: First, the traditional bidirectional RNN must wait for the end of the speech input to judge the current speech frame. The bidirectional FSMN only needs to wait for a limited length of future speech frames when remembering future information. This advantage makes the delay of FSMN controllable. Experiments show that with the two-way FSMN structure, the delay control can achieve the same effect as the traditional two-way RNN at 180 ms. Secondly, the traditional simple RNN can not remember the infinite history information, but can only remember the limited Long historical information is due to the problem of gradient disappearance during training. However, the memory network of FSMN is completely based on feedforward expansion. In the model training process, the gradient is transmitted back to each moment along the connection weight of the memory block and the hidden layer. The information on the influence of the current speech frame is determined by these connection weights. Decide, and this gradient propagation is trainable and is attenuated at any time. The above implementation makes FSMN also have long-term memory like LSTM, which is equivalent to using a simpler way. Solved the problem of gradient disappearance in traditional RNN. In addition, since FSMN is based entirely on the feedforward neural network structure, it also makes its parallelism higher, GPU computing power can be utilized more fully, so that a more efficient model training process is obtained, and the FSMN structure also performs in terms of stability. Better.
Speech recognition system based on convolutional neural network
The core of a convolutional neural network is a convolution operation (or convolutional layer), another model that can effectively utilize long-term contextual information. Following the successful application of DNN to large vocabulary continuous speech recognition, CNN was reintroduced under the DNN-HMM hybrid model architecture. The reintroduction of CNN was originally designed to improve the stability of the model by solving the variability of the frequency axis, because the HMM in the hybrid model already has a strong ability to deal with variable-length discourse problems in speech recognition. The early CNN-HMM model used only one or two convolutional layers and then stacked with the fully connected DNN layer. Later, other RNN layers such as LSTM were also integrated into the model, forming the so-called CNN-LSTM-DNN (CLDNN) architecture.
Speech recognition based on CNN-HMM framework has attracted a large number of researchers, but there have been few major breakthroughs. There are two basic reasons: First, they still use fixed-length speech frame stitching as input for traditional feedforward neural networks. The idea leads to the model not seeing enough context information; secondly, they use a small number of convolutional layers, generally only 1 to 2 layers, and use CNN as a feature extractor to express such a convolutional network structure. The ability is very limited. In response to these problems, in 2016, HKUST launched a new speech recognition framework called the fully fully convolutional neural network (DFCNN). Experiments have shown that DFCNN has a stronger system recognition rate of more than 15% than the BLSTM speech recognition system, academia and industry.
As shown in the figure, DFCNN first performs Fourier transform on the speech signal in the time domain to obtain the speech spectrum of the speech. DFCNN directly converts a speech into an image as input, and the output unit directly and final recognition results (such as syllables or Chinese characters) correspond. In the structure of DFCNN, time and frequency are taken as two dimensions of the image, and the combination of more convolutional layers and pooling layers is used to model the whole sentence speech. The principle of DFCNN is to think of the spectrogram as an image with a specific pattern, and an experienced phonetician can see what is said inside.
In order to understand the advantages of DFCNN, the following is more specifically analyzed from the perspective of input, model structure and output. First, at the input end, the traditional speech recognition system extracts features by using various types of artificially designed filters after Fourier transform, such as Log Mel-Filter Bank, resulting in the frequency domain of the speech signal, especially the high frequency. The information loss of the area is more obvious. In addition, the traditional speech feature uses a very large frame shift to reduce the amount of computation, resulting in loss of information in the time domain. This problem is more prominent when the speaker speaks faster. DFCNN takes the spectrum as input and avoids the loss of information in both the frequency domain and the time domain, which has a natural advantage. Secondly, from the perspective of model structure, in order to enhance the expressive ability of CNN, DFCNN draws on the most prominent network configuration in image recognition, and at the same time, in order to ensure that DFCNN can express the long-term correlation of speech, through convolution pooling With the accumulation of layers, DFCNN can see long enough historical and future information. With these two points, DFCNN performs better than the BLSTM network structure. Finally, from the output point of view, DFCNN is flexible and can be easily integrated with other modeling methods, such as the combination of the sequential time series model to achieve the end-to-end acoustic model training of the entire model. The DFCNN speech recognition framework can be easily combined with other technical points. Experiments have shown that the DFCNN system gains an extra tens of thousands of hours of Chinese speech recognition tasks compared to the industry's strongest speech recognition framework BLSTM-CTC system. 15% performance improvement.
Training of neural network acoustic model under large-scale speech data
Compared to the traditional GMM-HMM system, the DNN-HMM based speech recognition system has achieved tremendous performance improvement. But the training of the DNN acoustic model is very time consuming. For example, an acoustic model training of 20,000-hour voice data on a CPU configured with E5-2697 v4 takes about 116 days to complete training. The potential cause of this situation is that the stochastic gradient descent algorithm is used as the basic algorithm in neural network training. The SGD algorithm converges relatively slowly, and it is a serial algorithm, which is difficult to parallelize training. At present, the training data of the mainstream speech recognition system in the industry generally ranges from several thousand hours to tens of thousands of hours. Therefore, improving the training speed and training efficiency of deep neural networks under large-scale voice data has become a research hotspot and a must. solved problem.
Since the model parameters of the deep neural network are very sparse, using this feature, more than 80% of the smaller parameters in the deep neural network model are set to 0, there is almost no performance loss, and the model size is greatly reduced, but the training time is not significantly reduced. Small, because the highly random memory access caused by parameter sparsity does not get much optimization. Further, in the deep neural network, the weight matrix is ​​represented by the product of two low rank matrices, achieving an efficiency improvement of 30% to 50%.
Using multiple CPU or GPU parallel training to solve neural network training efficiency is another feasible method. The usual way is to divide the training data into many small blocks and then send them to different machines in parallel to perform matrix operations, thus achieving parallel training. The optimization scheme is: in each iteration of the model, the training data is first divided into N completely disjoint subsets, then a sub-MLP is trained in each subset, and finally these sub-MLPs are combined and combined. In order to further improve the parallel efficiency, this method is implemented in the computing cluster of thousands of CPU cores. The training of the deep network mainly utilizes the asynchronous gradient descent algorithm. The asynchronous gradient descent algorithm is applied to multiple GPUs. A pipelined BP algorithm has been proposed, which uses different GPU units to calculate different layers in the neural network to achieve parallel training effects. Experiments show that compared to using a single GPU training, the method achieves an efficiency improvement of about 3.1 times by using four GPUs. However, extremely frequent data transfer between different computing units has become a major bottleneck in improving the training efficiency of such methods. Therefore, in order to better realize parallel training of neural networks, a new multi-deep neural network modeling method based on state clustering is proposed. The method first clusters the training data at the state level and does not at the state level. The division of intersecting subsets makes the data transfer between different computing unit neural networks greatly reduced, thus achieving complete independent parallel training of each neural network. Using four GPUs, the experiment on the SWB (SwitchBoard) dataset shows that the multi-neural network method of state clustering achieves about 4 times improvement in training efficiency when the number of clusters is 4.
In addition to its extensive application in acoustic model modeling, deep learning theory has also been applied to another important component of speech recognition systems, the language model. Before the popularization of deep neural networks, speech recognition systems were mainly modeled using the traditional statistical language model N-gram model. The N-gram model also has obvious advantages, its structure is simple and the training efficiency is very high, but the model parameters of N-gram will increase exponentially with the increase of order and vocabulary, resulting in the inability to use higher order. The performance is easy to encounter bottlenecks. When the training corpus is in a relatively sparse state, the mature smoothing algorithm such as reduction and backing-off can be used to solve the probability estimation problem of low-frequency words or invisible words. More reliable model estimates.
In the early 20th century, some shallow feedforward neural networks were used for statistical language model modeling. The neural network language model is a continuous space language model. The smooth word probability distribution function makes it more robust to the probability estimation of low frequency words and invisible words in training corpus, and has better generalization in speech recognition tasks. It also achieved remarkable results. In recent years, relevant researchers have also used deep neural networks for modeling language models and achieved further performance improvements.
The relationship between deep learning, big data, and cloud computing
The speech recognition technology based on deep learning came to the center of the stage in the early 21st century, not only due to the advancement of deep learning machine learning algorithms, but the mutual promotion of the three elements of big data, cloud computing and deep learning.
Different from the previous GMM-HMM speech recognition framework, the expression ability is limited, and the effect is easy to saturate large-scale data. The deep structure of multi-layer nonlinear transformation possessed by the deep learning framework has stronger expression and modeling ability. The speech recognition model has unprecedentedly improved the ability to mine and learn complex data, enabling the full-scale massive data to be fully utilized. Big data, like milk powder, “nurturing” deep learning algorithms, making deep learning algorithms more and more powerful.
With the popularity of mobile Internet, Internet of Things technologies and products, it is more important to adopt cloud computing methods to enable multiple types of massive data to be collected in the cloud. The requirement for large-scale data computing has significantly increased the reliance on cloud computing methods, so cloud computing has become one of the key drivers of this deep learning revolution. The deployment of the deep learning framework in the cloud significantly enhances the power of cloud computing.
It is precisely because of the mutual promotion of deep learning, big data and cloud computing that the progress of speech technology has been achieved and the wave of artificial intelligence has been achieved.

Clear Bond

China Clear Bond,Glass Glue Clear,Clear Bonding Glue,Transparent Glass Glue, we offered that you can trust. Welcome to do business with us.

Clear Bond,Glass Glue Clear,Clear Bonding Glue,Transparent Glass Glue

KRONYO United Co., Ltd. , https://www.kronyotaiwan.com