Beta 1


Title Tools for Automatic Audio Indexing
Author Jørgensen, Kasper Winther (Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Mølgaard, Lasse Lohilahti (Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Supervisor Hansen, Lars Kai (Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Institution Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark
Thesis level Master's thesis
Year 2006
Abstract Current web search engines generally do not enable searches into audio files. Informative metadata would allow searches into audio files, but producing such metadata is a tedious manual task. Tools for automatic production of metadata are therefore needed. This project investigates methods for audio segmentation and speech recognition, which can be used for this metadata extraction. Classification models for classifying speech and music are investigated. A feature set consisting of zero-crossing rate, short time energy, spectrum flux, and mel frequency cepstral coefficients is integrated over a 1 second window to yield a 60-dimensional feature vector. A number of classifiers are compared including artificial neural networks and a linear discriminant. The results obtained using the linear discriminant are comparable with the performance of more complex classifiers. The dimensionality of the feature vectors is decreased from 60 to 14 features using a selection scheme based on the linear discriminant. The resulting model using 14 features with the linear discriminant yields a test misclassification of 2.2%. A speaker change detection algorithm based on a vector quantization distortion (VQD) measure is proposed. The algorithm works in two steps. The first step finds potential change-points and the second step compensates for the false alarms produced by the first step. The VQD metric is compared with two other frequently used metrics: Kullback Leibler divergence (KL2) and Divergence Shape Distance (DSD) and found to yield better results. An overall F-measure of 85.4% is found. The false alarm compensation shows a relative improvement in precision of 59.7% with a relative loss of 7.2% in recall in the found change-points. The choice of parameters based on one data set generalize well to other independent data sets. The open source speech recognition system SPHINX-4 is used to produce transcripts of the speech segments. The system shows an overall word accuracy of \verb+~+ 75%.
Imprint Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU : DK-2800 Kgs. Lyngby, Denmark
Fulltext
Original PDF imm4472.pdf (2.78 MB)
Admin Creation date: 2006-10-06    Update date: 2012-12-19    Source: dtu    ID: 191644    Original MXD