Hidden Markov Model and Speech Recognition_Free Document Search and Download

Page 1

Seminar report On
Hidden Markov Model and Speech Recognition
by Nirav S. Uchat Roll No: 06305906 under the guidance of Prof. Sivakumar Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai

Page 2

Contents
1 Introduction 1 2 Mathematical Understanding of Hidden Markov Model 2 2.1 Discrete Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Extension to Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Elements Of Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 The Three Basic Problem for HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 Solution to Three Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1 Forward Algorithm for Evaluation Problem [P(O|λ)] . . . . . . . . . . . . 7 2.5.2 Viterbi Algorithm for Decoding Hidden State Sequence P(Q, O|λ) . . . . 7 2.5.3 Baum-Welch Algorithm for Learning . . . . . . . . . . . . . . . . . . . . . 8 3 Speech Recognition and HMM 10 3.1 Block Diagram Of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 How to generate code book(K-means clustering algorithm) . . . . . . . . 13 3.2.2 Searching algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Evaluation Problem and Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . 14 3.4 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Isolated Word Recognizer 18 4.1 Architecture Of speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Conclusion and Future Work 22 i

Page 3

Abstract Modeling signal model for speech recognition is challenging task. It gives us great deal of information about problem being modeled. In this seminar we will see how Hidden Markov Model is used to model speech recognition application. We start with mathematical un- derstanding of HMM followed by problem faced by it and its solution. Then we move to block diagram of speech recognition which include feature extraction, acoustic modeling and language model, which works in tandem to generate search graph. Use of HMM in acoustic modeling is explained. At the end we will look at isolated word recognizer using HMM.
ii

Page 4

1 Introduction
Real-world processes generally produce observable outputs which can be characterized as signals. The signals can be discrete in nature or continuous in nature. The signal source can be stationary (i.e., its statistical properties do not vary with time), or non-stationary (i.e., the signal properties vary over time). The signals can be pure or can be corrupted from other signal sources or by transmission distortions.[2] A problem of fundamental interest is characterizing such signal in terms of signal model, signal model gives us • Theoretical description of a signal processing system which can be used to process the signal and so as to provide desired output. • It helps up to understand great deal about signal source without having to have source available. One can classify signal model in to two types • Deterministic Model : It exploit some known property of signal (like Amplitude of wave). • Statistical Model : It takes statistical property of signal in to account. Example of this type of model is Gaussian Model, Poisson Model, Markov Model and Hidden Markov model. Speech Recognition : Speech recognition is a process of converting speech signal to a se- quence of word. Various approach has been used for speech recognition which include Dynamic programming and Neural Network. Figure 1: Classification 1

Page 5

In this seminar we will try to bridge speech recognition and HMM and figuring out how HMM can be effectively used in speech recognition problem. Section 2 gives mathematical understanding of Hidden Markov Model. It also focuses on three fundamental problems for HMM,namely:the probability of observation sequence given the model i.e.,P(O/λ); the determination of single best state sequence, given the observation se- quence O = O1,O2,···,OT ; and the adjustment of model parameter λ = (A,B,π) to maximize recognition probability. It also describe the method to efficiently solve this problems. Section 3 explains block diagram of speech recognition system. We start with acoustic model design using vector quantization which is used to convert feature vector to symbol. It also ex- plains how the algorithms described in 2nd section are used to solve the problem associated with speech recognition. This section discuss in detail about evaluation problem and viterbi algorithm for finding “single best” state sequence. Finally bi-gram language model is explained. In Section 4, we will apply all technique discuss in previous section to understand the working of isolated word recognizer.
2 Mathematical Understanding of Hidden Markov Model
Why Hidden Markov Model for Speech recognition ? • HMM is very rich in mathematical structure and hence can form the theoretical basis for use in a wide range of application. • HMM model, when applied properly work well in practice for several important application.
2.1 Discrete Markov Process
Consider a system which may be described at any time as being in one of the state of set of N distinct state, S1,S2,S3,···,SN ,. At regularly time interval system undergoes a change of state ( possibly back to same state) according to set of probability associated with the state. we denote time associated with state change as t =1,2 .., and we denote actual state at time t as qt. A full probabilistic description of above system, would in general require specification of current state and predecessor states. For the special case of discrete, first order , Markov chain, this probabilistic description is truncated to just current and the predecessor state i.e P[qt = Sj|qt−1 = Si,qt−2 = Sk,···] = P[qt = Sj|qt−1 = Si] Furthermore we consider those processes in which the right hand side of above equation is independent of time , thereby leading to the set of state transition probabilities aij of the form aij = P[qt = Si|qt−1 = Sj] 1 ≤ i, j ≤ N with the state transition coefficients having the properties aij ≥ 0
N
∑
j=1
aij = 1 2

Page 6

The above process could be called as an observable Markov Model, since the output of the process is the set of states at each instant of time, where each state correspond to a observable event.
2.2 Extension to Hidden Markov Model
In above model each state correspond to an observable event. This model is too restrictive to be applicable to many problem of interest. we now extend this idea where observation is probabilistic function of state. To get clear understanding we take coin tossing example. Coin Toss Models: Assume the following scenario. You are in a room with a barrier through which you cannot see what is happening. On the other side of the barrier is another person who is performing a coin (or multiple coin) tossing experiment. The other person will not tell you anything about what he is doing exactly, he will only tell you the result of each coin flip. Thus a sequence of hidden coin tossing experiments is performed, with the observation sequence consisting of a series of heads and tails, for e.g., a typical observation sequence would be O = O1 O2 O3 ···OT (Head Tail Head···T ail) to model this coin tossing event with HMM, the fundamental question is • What the state in model correspond to • How many state in the model should there be To understand this in better way lets take three different HMM 1. Single biased coin with two state - each state representing head or tail 2. Two biased coin with two state - each state representing single coin 3. Three biased coin with three state - with each coin representing single coin When single bias coin is being tossed. In this case we could model the situation with a 2-state model where each state corresponds to a side of the coin (i.e., heads or tails). This model is depicted in fig 2. In this case the Markov model is observable, and the only issue for complete specification of the model would be to decide on the best value for the probability of heads and tails. In case 2 there are 2 states in the model and each state corresponds to a different biased coin being tossed. Each state is characterized by a probability distribution of heads and tails, and transitions between states are characterized by a state transition matrix. Third form of HMM for explaining the observed sequence of coin toss outcomes is given in Fig. 2. This model corresponds to using 3 biased coins, and choosing from among the three, based on some probabilistic event. Given the choice among the three models shown in Fig.2 for explaining the observed sequence of heads and tails, a natural question would be which model best matches the actual observations. It should be clear that • simple 1-coin model has only 1 unknown parameter 3

Page 7

Figure 2: Three possible Markov model which can account for coin tossing experiment • 2-coin model has 4 unknown parameters • 3-coin model has 12 unknown parameters Thus, with the greater degrees of freedom, the larger HMMs would seem to more capable of modeling a series of coin tossing experiments than would equivalently smaller models, but it might just be the case that only a single coin is being tossed. Then using the 3-coin model would be inappropriate, since the actual physical event would not correspond to the model being used-i.e., we would be using an underspecified system.[2]
2.3 Elements Of Hidden Markov Model
We now define elements Of HMM, HMM is characterized by following 1. Number of state N 2. Number of distinct observation symbol per state M, V = V1,V2,···,VM 4

Page 8

Figure 3: Elements of HMM 3. State transition probability, aij = P[qt+1 = Si|qt = Sj], 1 ≤ i, j ≤ N 4. Observation symbol probability distribution in state j,Bj(K) = P[Vk at t|qt = Sj] 5. The initial state distribution π = πi where πi = P[q1 = Si] 1 ≤ i ≤ N[2] Given appropriate value of N,M,A,B and π, HMM can be used as generator to give an observation sequence O = O1 O2 O3 ···OT 5

Page 9

2.4 The Three Basic Problem for HMMs
Problem 1 : Evaluation Problem Given the observation sequence O = O1 O2 ··· OT , and model λ = (A,B,π), how do we efficiently compute P(O|λ), the probability of observation sequence given the model. problem 1 is evaluation problem i.e given observation sequence and model, how do we compute probability that observed sequence was produce by the model. we can think it as scoring problem. if you have to choose between many competing model then one with maximum probability will give better result. Figure 4: Evaluation Problem Problem 2 : Hidden State Determination (Decoding) Given the observation sequence O = O1 O2 ··· OT , and model λ = (A,B,π), how do we choose corresponding state sequence Q = q1 q2 ··· qT which is optimal in some meaningful sense. Problem 2 is one which attempts to uncover the hidden part of the problem. There is no correct solution to this problem, in practice we usually use optimal criterion to find best possible solution. For coin tossing example with observation sequence 0 = H H T T H corresponding state sequence in each of the model (in Fig 2) can be O=HHTTH Model 1 : S = 1 1 2 2 1 Model 2 : S = 2 1 1 1 2 Model 3 : S = 3 1 2 3 2 Problem 3 : Learning How do we adjust the model parameter λ = (A,B,π) to maximize P(O|λ). Problem 3 is one in which we try to optimize model parameter so as to best describe as to how given observation sequence comes out. The observation sequence used here are called “training” sequence since it is used for training HMM. Training is one of the crucial element of HMM. It allows us to adjust model parameter as to create best model for given training sequence.[2]
2.5 Solution to Three Problem
Given this problem, it’s vital to solve them using efficient algorithm. 6

Page 10

2.5.1 Forward Algorithm for Evaluation Problem [P(O|λ)] we want to find P(O|λ), given the observation sequence O = O1,O2,O3,···,OT . The most straight forward way to find the solution is enumerating every possible state sequence of length T. Consider one such state sequence Q = q1,q2,q3,···,qT such that q1 produces O1 with some probability, q2 produces O2 with some probability and so on. so using chain rule P(O|λ) =
∑
q1,q2,···,qT
πq1 bq1(O1)aq1q2 bq2 (O2)···aqT −1qT bqT (OT ) but the order of chain rule is NT , since at every t = 1,2,···,T, there are N possible states which can be reached. This is clearly and inefficient algorithm, to over come this forward algorithm is used. Forward Algorithm: Forward algorithm is a dynamic algorithm which uses forward variable αt(i) defined as αt(i) = P(O1,O2,···,Oi,qt = Si|λ) i.e., the probability of partial observation sequence, O1,O2 till Ot and state Si at time t given the model λ, as we are taking partial observation in to account, we can solve αt(i) inductively as Initialization: α1(i) = πibi(O1), 1 ≤ i ≤ N. Induction: αt+1(j) =
[ N ∑
i=1
αt(i)aij
]
bj(Ot+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N Termination: P(O|λ) =
N
∑
i=1
αT (i) since αt(i) is the probability of partial observation sequence O1 to Ot and the state at time t is Si, then the product αi(t)aij is the probability of joint event that the observation O1 to Ot are observed and state Sj is reached at time t + 1 via state Si at time t. Summing over all N possible state Si (1 ≤ i ≤ N), at time t, results in the probability of Sj at time t + 1. By multiplying this quantity with bj(Ot+1), we can find αt+1(j). This computation is performed for all state j(1 ≤ j ≤ N) for a given t and it is iterated through t = 1,2,3,···,T − 1. Finally the required P(O|λ) is sum of the terminal forward variables αT (i), this is true because αT (i) = P(O1,O2,···,OT ,qT = Si|λ) results shows that this algorithm requires of the order of N2T calculation rather then NT in case of chin rule.[1][2] 2.5.2 Viterbi Algorithm for Decoding Hidden State Sequence P(Q, O|λ) There are various way in which optimum solution can be calculated. The difficulty lies in the definition of optimum state sequence. One way to find optimum state sequence is to choose the states qt which are individually most likely. But this has serious flaw in sense that if two states 7

Page 11

Figure 5: Forward variable calculation i and j are selected such that aij = 0, then despite of most likely state at time t and t + 1, it is not valid state sequence. So deciding optimum criteria is very crucial. There is a way to find the “single best” state sequence is using dynamic programming algorithm, called as viterbi algorithm. Viterbi Algorithm: It says that to find single best state sequence, Q = q1,q2.q3,···,qt, (which produces given observation sequence) for a given observation sequence O = o1,o2,o3,···,ot, we define a quantity δt(i) = max
q1,q2,···,qt−1
P[q1q2 ···qt = i, O1O2 ···Ot|λ] i.e., δt(i) is the best score along a single path, at time t, which account for the first t observations and ends in state Si, by induction δt+1(j) =
[
max
i
δt(i)aij
]
bj(Ot+1) in order to find the state sequence we need to keep track of state which maximize the above equation. we do this via array ψt(j) for each t and stat j. Once the final state is reached corresponding state sequence can be found out using backtracking. Key point is, Viterbi algorithm is similar (except for the backtracking part) in implementation to the Forward algorithm. The major difference is maximization of the previous state in place of summing procedure in forward calculation.[1][2] 2.5.3 Baum-Welch Algorithm for Learning Most challenging of all is to adjust the model parameter (A,B,λ) to maximize the probability of the observation sequence given the model. Given any finite observation sequence as training data, there is no optimal way of estimating the model parameters. But there are some heuristic procedure which try to find local optimization over global optimization.[2] Re-estimation Procedure: In order to define a procedure for re-estimation of HMM param- eter, we first define ξt(i, j) = P(qt = Si,qt+1 = Sj|O, λ) i.e.,the probability of being in state Si at time t, and state Sj at time t + 1 given the model and observation sequence we can write 8

Page 12

ξt(i, j) = αt(i)aijbj(Ot+1)βt+1(j) P(O|λ) where βt(j) = backward variable , i.e., probability of the partial observation sequence t + 1 to end (T), given state Si at time t and model. now we define γt(i) as the probability of being in state Si at time t, given the observation sequence and model; hence we can relate γt(i) to ξt(i, j) by summing over j, such that γt(i) =
N
∑
j=1
ξt(i, j) if we sum γt(i) over time index we get a quantity which can be interpreted as expected number of time state Si is visited, or expected number of transition made from Si. Similarly summation of ξt(i, j) over time can be interpreted as the expected number of transition made from state Si to state Sj. That is
T−1
∑
t=1
γt(i) = expected number of transitions from Si
T−1
∑
t=1
ξt(i, j) = expected number of transitions form Si to Sj Using above formula we can give a method for re-estimation of the parameter of an HMM. ¯π = Expected number of times in state Si at time (t=1) = γ1(i) ¯aij = expected number of transition from state Si to Sj expected number of transition form state Si =
T−1
∑
t=1
ξt(i, j)
T−1
∑
t=1
γt(i) ¯ bj(k) = expected number of times in state j and observing symbol vk expected number of times in state j
= T
∑
t=1 s.t Ot=Vk
γt(j)
T
∑
t=1
γt(j) Using this three algorithm we can solve all three problem associated with HMM. In next section we will see how this algorithm are used in speech recognition using HMM. 9

Page 13

3 Speech Recognition and HMM
3.1 Block Diagram Of Speech Recognition
Figure 6: Block Diagram of Speech Recognition Speech recognition consists of two main modules, feature extraction and feature matching. The purpose of feature extraction module is to convert speech waveform to some type of rep- resentation for further analysis and processing, this extracted information is known as feature vector. The process of converting voice signal to feature vector is done by signal-processing front end module. As shown in above block diagram input to front-end is noise free voice sample and output of it is feature vector. In feature matching, the extracted feature vector from unknown voice sample is scored against acoustic model, the model with max score wins ,and it’s output is considered as recognized word. Following are the few method for implementing front-end(for extracting feature factor) • MFCC (Mel-Frequency Cepstrum Coefficient) • LPC (Linear Predictive Coding) Once the feature vector are obtained we build the acoustic model. The acoustic model is used to score the unknown voice sample. As shown in block diagram, Output of front-end is given as input to acoustic model. Different types of acoustic model are • VQ-Code Book [1] • GMM-Gaussian Mixture Model Acoustic Model Representation : In speech recognition, basic unit of sound is phoneme. Phoneme is a minimal unit that serves to distinguish between meanings of words. For example 10

Page 14

sequence of phoneme for “CAT” is K A and T. In English language there are nearly around 46 phoneme. We can construct any word from English dictionary using proper concatenation of this phoneme. In order to recognize a given word, we should extract phoneme from voice sample. Due of slowly timed varying nature of speech signal short-term spectral analysis is the most common ways to characterize speech signal. When examined over a sufficiently short period of time(between 10 and 25 msec), its characteristics are fairly stationary. However, over long pe- riod of time the signal characteristic change to reflect the different speech sound being spoken. Using this observation, we find that feature vector extracted over 10 to 25 msec correspond to single phoneme. For speech recognition in HMM, we assign each basic unit(phoneme) a unique HMM. Study shows that each phoneme HMM can be represented by three state i.e. begin, middle and end state. as shown in the Fig. 7. Furthermore state correspond to feature Figure 7: HMM for single Phoneme vector. To understand this take example of isolated word recognizer, Each word in vocabulary has distinct HMM. When unknown word comes, it is scored against all HMM model and HMM with maximum score is considered as recognized word. As shown in block diagram output of acoustic model is sequence of phoneme. By looking dictionary in reverse way ( phoneme - word ) we can find corresponding word. But in general there are many word with same phoneme sequence, for example car key and Khakee has same phoneme sequence. In such case language structure comes in to picture. Language structure uses context information to narrow down the recognized word to resemble the given grammar construct. This type of model is known as mono-phone or context-independent model, following are differ- ent types of HMM 1. Context-Independent Phoneme HMM • Number Of State : d-state HMM for each phoneme (d is normally equal to 3) • Accuracy : not accurate in continuous speech recognition • Compact : d-state HMM lead to less parameter to be calculated • General : Yes, we can build HMM for new word using existing phoneme HMM 2. Context-Dependent Triphone HMM • Number Of State : d-state HMM for each phoneme • Accuracy : Accurate, as it has left-right phoneme relation 11

Page 15

• Compact : Each phoneme has immediate left-right relation, more parameter needs to be calculated • General : Yes 3. Whole-Word HMM • Number Of state : No phoneme generation • assign number of state to model a word as whole • Accurate : Yes, require large training data and work for small vocabulary • Compact : No, need too many state as vocabulary increases • General : No, can’t build new word using this representation In practice triphone model is widely used in speech recognition using Hidden Markov Model. Now we are not going to discuss how feature vector are extracted from voice signal. we assume that using either MFCC or LPC we get required feature vector. In case of MFCC it is 39 dimension vector. The question is how to map feature vector to HMM state ? for doing so we use technique called Vector Quantization.
3.2 Vector quantization
It is a technique for mapping MFCC type feature vector to symbol. • create training set of feature vector • cluster them in to small number of classes • represent each class by symbol • for each class Vk, compute the probability that it is generated by given HMM state. In Vector Quantization, we define a codebook which contain entry for each symbol, which is also known as prototype vector or codeword. for example if we have 256 classes(i.e. 8 bit VQ), we make 256 entry in codebook. As shown in Fig.8 the incoming vector is compared with each prototype vector and one with minimum distance is chosen and it’s index value is given to the input vector.[1] Figure 8: Vector Quantization 12

Page 16

3.2.1 How to generate code book(K-means clustering algorithm) • choose M vector from L training Vector - where M = 2B as initial code words (Choose M with MAX distance between them) • for each training vector, find the closest code word (minimum distance). Assign this training vector to that cell • for each cell, compute centroid of that cell. The new codeword is the centroid • repeat last two step until average distance falls below threshold As shown in Fig.9, all vectors are clustered around newly formed centroid. Figure 9: VQ Example citecite 3.2.2 Searching algorithm • To compute p(ot|qj) • compute distance between feature vector ot and each codeword entry (prototype vector) • Choose the vector that is the closest to ot and take its codeword vk • look up the likelihood of vk given HMM state j in the matrix b • i.e. bj(ot) = bj(vk)s.t.vk is codeword of closest vector to ot For example there are total 60 vectors mapped to state j out of that 20 are indexed to k in code word, then probability of bj(vk) is bj(vk) = 20/60 13

Page 17

3.3 Evaluation Problem and Viterbi Algorithm
In Section 2 we saw evaluation problem i.e., given observation sequence O = O1,O2,O3,···,OT how to calculate P(O|λ). There we apply chain rule for calculating P(O|λ) as Figure 10: chain rule - intractable algorithm P(O|λ) =
∑
overallstate
πq1 bq1(O1)aq1q2 bq2 (O2)···aqT −1qT bqT (OT ) which is an order of O(NT ). As shown in Fig.10 probability of P(O1,O2,O3) over all state sequence is P(o1,o2,o3|q0,q0,q0) + P(o1,o2,o3|q0,q0,q1) + P(o1,o2,o3|q0,q1,q0) + P(o1,o2,o3|q0,q1,q1) .... For better efficiency we use forward variable which uses dynamic programming technique to reduce order to N2T. In section 2.5.2 we defined forward variable such that αt(i) = P(O1,O2,···,Oi,qt = Si|λ) and computation of αt(j) by summing all previous values αt−1(i) for all i. Fig.11 shows the calculation of word “ONE” whose phoneme sequence is W AH and N. In forward calculation we consider summing of all incoming node to find P(O|λ). Since our phoneme model is bi-gram model, only it’s predecessor states input is consider to calculate sum of probability. For better understanding refer Fig.11 at time t=3, input to state N is from state AH and N only and not from W. In this way we can find the P(O|λ), given an observation sequence and model λ. 14

Page 18

Figure 11: Forward calculation Now we will look at Viterbi algorithm for finding best state sequence. As explained in section 2.5.2 Viterbi algorithm is same (except backtracking)in implementation as of forward algorithm. Major difference is, in Viterbi we take MAX of all incoming input for given state. As shown in Fig.12 line from state 1 at time t=1 to state 2 at t=3 is considered for probability calculation, while in Forward algorithm we sum up all probability. So till now we have seen • How feature vector are mapped on to HMM state, using Vector Quantization • How to calculate probability of observation sequence given model i.e., P(O|λ) ,using For- ward algorithm • How to find single best state sequence given observation sequence and model i.e., sequence of states which produces required output, using Viterbi algorithm 15

Page 19

Figure 12: Viterbi Calculation Figure 13: Language model, acoustic model and decoder 16

Page 20

3.4 Language model
The model that compute P(W) is known as Language model. Fig.13 gives how Acoustic model, Decoder and Language model work together to recognize given sentence/word W. Given a sentence how one can calculate the joint probability. One way is to use chain rule. In general P(A, B, C, D) = P(A).P(B|A).P(C|A, B).P(D|A,B,C). For an example what is the probabil- ity of P(computer,got,crashed,while,installing,windows). Using above rule we can write P(computer)* P(got/computer)* P(crashed/computer,got)* P(while/computer,got,crashed)* P(installing/computer,got,crashed,while)* P(windows/computer,got,crashed,while,installing) In practice we will never get data to calculate the probability of long prefix such as P(windows/computer,got,crashed,while,installing) In order to model practical application we can make assumption of calculating P(got/computer), P(crashed/got), P(windows/installing) and so on. This is known as bi-gram model. In general we write, P(Wn|Wn−1) Language model help us in generating search graph. Search graph is collection of HMM of each phoneme in given vocabulary. Once the graph is created, incoming feature vector are quantized and search is performed to find best sequence of phoneme. Fig.14 will give better understanding of search graph. Figure 14: Search graph for given vocabulary 17

Page 21

4 Isolated Word Recognizer
In this section we will quickly go through all the components used in speech recognition appli- cation.
4.1 Architecture Of speech recognition
Figure 15: Architecture Front-End: The purpose of the Front-End is to parametrize an Input signal (e.g., audio) into a sequence of output Features. Voice samples are taken every 10 - 25 msec. This sample data is feed to the Front-End module for further processing. Output of Front-End is list of feature vec- tor ( MFCC - 39D). This feature vector are then mapped to symbol using vector quantization.[3] Vector Quantization:It maps Feature vector to symbol. This is also knows as acoustic mod- eling. This symbols represent HMM state. During recognition process this symbol are matched against unknown symbols. This gives us way to map complex vector in to manageable symbol set. This method is discussed in section 3.2. HMM model creation: Depending on implementation, HMMs are created for every basic sound unit, in our case it is Phoneme. Further all HMM are linked together to represent the vocabulary under consideration. This linked representation is known as search space for given problem. During recognition phase this graph is searched for finding occurrence of given word. 18

Page 22

To represent relation between the words in given sentence, bi-gram model is used. Language model, vocabulary and acoustic modeling forms the core of speech recognition. Lexicon (Fig.13) are used to map word from vocabulary to HMM. For example consider word ”ONE”, in lexicon listing we have it’s corresponding phoneme sequence. Training: The most difficult task is to adjust the model parameter to accurately represent the word under consideration. In training mode large amount of voice data ( from different speaker ) is given to HMM model. Using this, HMM adjust its probability distribution and transition matrix. There is no global optimal algorithm for learning. Every HMM must be trained to maximize it’s (local optimum) recognition power. Initially HMM for phoneme (before learning) consists of 3-state and it’s adjacency matrix and output probability distribution are initialized randomly. It gets automatically updated once the training starts.[4] Recognition : unknown word is fed to HMM and it’s output probability is calculated. Five steps for automatic speech recognition(ASR) • Feature Extraction: 39 MFCC features • Acoustic Model: Vector Quantization - Output probability distribution for each state • Lexicon: Word to Phoneme mapping - Used in creation of HMM • Language Model: N-grams for computing P(wi|wi−1) • Decoder: Viterbi algorithm, using dynamic programming for combining all these to get word sequence from speech!
4.2 Example
The role of the recognizer is to do mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this very difficult • The mapping from symbols to speech is not one-to-one since different underlying symbols can give rise to similar speech sounds(example of car key and khakee having same phoneme sequence). Furthermore, there are large variations in the speech waveform due to speaker variability, noise, environment, etc. • The boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated patterns. The second problem of not knowing the word boundary locations can be avoided by restricting the task to Isolated word recognition (Fig.16). In this the speech waveform corresponds to a single word chosen from a fixed vocabulary. Despite it is very artificial, it has wide range of practical applications. It serves as basic model to understand the concept of speech recognition. We Assume that parameter aij and bj(ot) are known for each model Mi. Given a set of example for a given word, HMM model parameter can be calculated automatically using learn- ing algorithm discussed in section 2.5.3. Fig 17 summarize the use of HMM in isolated word 19

Page 23

Figure 16: Isolated word recognizer recognizer. Firstly, a HMM is trained for each vocabulary word using number of example of that word(this will adjust aij and bj(vk)). In this example the words are cat,dial and call respec- tively. In order ot recognize some unknown word, observation sequence is given to each HMM, output of HMM with maximum P(O|λ) (it is calculated using Forward algorithm) is considered as recognized word. [4] 20

Page 24

Figure 17: Markov Model for isolated word recognizer 21

Page 25

5 Conclusion and Future Work
We saw how HMM can be used as signal model for speech recognition application. Forward, Viterbi and Baum-Welch algorithm provide solution to three problem associated with HMM. Speech recognition using HMM gives good result due to resemblance between architecture of HMM and varying speech data. Neural Network is another method, which uses gradient decent method with back propagation algorithm. Study shows that this method works well in presence of large amount of training data also the recognition accuracy is high if the word under consid- eration is from training data set. While in HMM, the recognition ability is good for unknown word. HMM is generic concept and is used in many area of research. In whole architecture of speech recognition, HMM is just one block which help in creating search graph. It work in tandem with other block such as front-end, language model, lexicon to achieve desired goal. The purpose of HMM is to map feature vector to some representable state and emit symbol, concatenation of which gives desired phoneme sequence. Problem with continuous speech recognition is, to determine word boundary. It requires knowledge of language construct and regional accent understanding. Other problem is presence of noise in sample data. Efficient solution to this two problem is required for accurate continuous speech recognition. What we discussed in this seminar is recognition of English sentence, we can extend this model to recognize multi-lingual sentence. 22

Page 26

References
[1] Dan Jurafsky. CS 224S / LINGUIST 181 Speech Recognition and Synthesis. World Wide Web, http://www.stanford.edu/class/cs224s/. [2] Lawrence R. Rabiner. A Tutorial on Hidden Markov Model and Selected Applicaiton in Speech Recognition. IEEE, 1989. [3] Willie Walker, Paul Lamere, and Philip Kwok. Sphinx-4: A Flexible Open Source Framework for Speech Recognition. SUN Microsystem, 2004. [4] Steve Young and Gunnar Evermannl. The HTK Book. Microsoft Corporation, 2005. 23

Hidden Markov Model and Speech Recognition

Recent Documents:

Recent Search: