Study on the Technologies of Text-independent Short-duration Speaker Recognition
|School||PLA Information Engineering University|
|Course||Communication and Information System|
|Keywords||Text-independent speaker recognition short-duration Mel Frequency Cepstral Coefficient Gaussian Mixture Model Support Vector Machine manifold learning fusion|
In recent years, along with the advancement from application demands and the development of relative theories, research on speaker recognition has made great progress. Various research facilities inside and outside are positively promoting its new theory research, new method experiment and practical advancement. Among these researches, the training and testing based on short-duration speech attract much attention.From 2004, in the Speaker Recognition Evaluations organized by National Institute of Standard and Technology, test items are designed according to the length of speech, and in the test item with shortest-duration the lengths of training speech and testing speech are not more than 10 seconds. It can be concluded from the evaluation results that compared with the test items with longer speech, when the lengths of training speech and testing speech are reduced to 10 seconds, the performance of speaker recognition degrades drastically. The main reason is that current speaker recognition systems are mainly based on probability statistical models, their recognition performances mostly depend on the matching degree of training speech and testing speech. But the most widely used short-time cepstral features both contain speaker information and content information, the content information difference in the short-time cepstral feature influences the matching degree of training and testing speech. And the main explanation why text-dependent speaker recognition has great advantage over text-independent speaker recognition is that it guarantees the contents of training speech and testing speech are totally matched. But in the text-independent speaker recognition, if the length of training speech and testing speech are too short, there may exist serious mismatch phenomenon, and since current speech signal processing technology can’t separate the content information and speaker information from speech signal, it becomes the main factor to restrict the performance of text-independent speaker recognition.To research on the influence of speech length on speaker recognition performance and improve the performance of short-duration text-independent speaker recognition, the research of this thesis is from two aspects. Firstly, research on how to reduce the influence of content information in the short-time cepstral features, and to the application of speaker identification and speaker verification, respectively proposes two schemes. Secondly, research on how to get more speech feature from speech with limited length to enrich speaker feature description, which can help improve speaker recognition performance under short-duration environment.The main work and contributions of this thesis are outlined as follows:1) A feature transformation method based on speaker attribution constraint is proposed to suppress the influence of content information on the distribution of short-time cepstral features, which can make the features from the same speaker become more centered, and the distinction between different speakers more obvious, and improve the identification rate in short-duration speaker identification. This thesis firstly uses the speech characteristic of obeying nonlinear manifold structure, and through analysis on the local geometry structure of speech feature, constructs neighbor relationship package, and secondly uses the speaker attribution constrained transformation to suppress the influence of content information in the short-time cepstral feature, finally a dominant transformation matrix is deviated. The effect of this transformation is tested on the baseline system based on GMM-UBM. On the same test database, compared with other feature transformation methods, the best relative improvement rate of SAC-LPP is 13.48%, 9.58%,8.75%,9.90%and 11.92%, when the length of training data is 10 seconds and the length of testing data is 10 seconds,8 seconds,5 seconds,3 seconds and 2 seconds.2) A text-independent speaker verification scheme based on UBM mixture subspace is proposed, which can search the content information matching unit in the training supervector and testing supervector, then the influence of the mistching part in the supervector can be reduced, and the equal error rate in short-duration speaker verification can be reduced. Based on the objective fact that the performance of text-dependent speaker recognition is superior over the text-independent speaker recognition and the main reason that influences the short-duration speaker recognition is the mismatch between the content information from the training speech and testing speech, this thesis proposes a method to use the neighbor distribution relationship of UBM mixtures to roughly classify the content information in the training and testing speech, then in each subspaces, text-independent speaker recognition is transformed into "content-dependent" speaker recognition. Moreover, dual-confidence subspace fusion scheme is proposed which distributes different weights based on the feature distribution of training speech and testing speech and the distinguishing ability of each subspace, through it the detailed information of speech is fully used. On the same test database, compared with other subspace methods, the best relative improvement rate of the proposed method is 18.67%,10.22%,6.13%, 5.00% and 6.10%, when the length of training data is 10 seconds and the length of testing data is 10 seconds,8 seconds,5 seconds,3 seconds and 2 seconds.3) A "biomimetic neural network excitation source" feature is proposed, in which the thought of biomimetic pattern recognition is introduced to model the excitation source from speech data. The effectiveness of this feature in speaker recognition is validated, and it can improve the performance of speaker recognition when integrating with short-time cepstral feature. Due to the disadvantage of using AANN to extract and model excitation source feature from LP residual, this thesis proposes to use biomimetic neural network to model speaker LP residual excitation feature, and constructs excitation source feature and relative recognition system. This method not only avoids the complicated iteration training in the traditional neural network, but also uses the principle contained in biomimetic pattern recognition that "recognition based on learning but not distinguishing" to make it has great performance under little smple environment, that is short-utterance environment. On the same test database, based on LP residual vector, when the length of training data is 10 seconds, compared with the AANN recognition method, the BNN proposed via this thesis can relatively reduce the identification error rates by 6.98%,11.59%,9.67%,9.00% and 8.45%, when the length of testing data is 10 seconds,8 seconds,5 seconds,3 seconds and 2 seconds. Due to the complementary characteristics of LP residual excitation source to short-time cepstral feature in speaker, this thesis studies on the integration of short-time cepstral feature and LP residual excitation source feature in speaker recognition, and designs dicision integration method based on the confidence of each feature. Through the measurement on correlations among different features used in speaker recognition, the complimentary of LP residual excitation source feature and short-time cepstral feature in speaker recognition is proposed theoretically. And different fusion methods are proposed for speaker identification and speaker verification. In the dynamical fusion method, the reliability of each feature is gotten from single recognition result, and in the static fusion method, the distinguishing ability of each feature is gotten from its inherent speaker distinguishing ability. Compared with single short-time cepstral feature, when the length of training data is 10 seconds, the fusion method can reduce the identification error rates by 13.44%, 11.11%,10.22%,10.12% and 8.95%(speaker identification), reduce the equal error rates by 5.51%,5.02%,10.72%,8.43% and 2.55%(speaker verification), when the length of testing data is 10 seconds,8 seconds,5 seconds,3 seconds and 2 seconds.