Acoustic-based machine learning diagnostic tool for voice disorders

Wu, Huiyi

Thesis

Acoustic-based machine learning diagnostic tool for voice disorders

Download PDF

Creator

Wu, Huiyi

Rights statement

Strathclyde Thesis Copyright

Awarding institution

University of Strathclyde

Date of award

2020

Thesis identifier

T15843

Person Identifier (Local)

201689995

Qualification Level

Doctoral (Postgraduate)

Qualification Name

Doctor of Philosophy (PhD)

Department, School or Faculty

Department of Electronic and Electrical Engineering

Abstract

The research presented in this thesis addresses the application of deep neural networks and digital signal processing algorithms in the pathological voice detection. In this thesis, the novel methods are presented, including deep acoustic recurrent model that combines frame-based cepstral and spectral features and Bi-directional Long short-term memory (Bi-LSTM) network, a 10-layer convolutional neural network (CNN) model with spectrogram of the speech as input, transfer learning from image recognition applications to pathological voice detection field using timefrequency representation as input, and a novel CNN model using data augmentation idea with scalogram of the speech as input. The deep acoustic recurrent model explores the relationship of frame-based cepstral features with RNN model. Two novel cepstral features based on cepstrum are proposed: Second Peak Perturbation (SPP) and standard deviation of cepstrum (CepStd). These novels cepstral features are validated to improve the classification performance on three databases. In addition, traditional acoustic analysis is compared with the proposed deep acoustic recurrent model. It is shown that framebased cepstral features shows overall better performance on deep recurrent model than traditional classifiers. A 10-layer convolutional neural network is proposed with spectrogram of the speech as input. This is the first model that applies time-frequency representation in deep learning for pathological voice detection. The experimental results have shown that it is an effective and efficient model for detecting pathological speech data. However, it shows overfitting problem to some extent. This is a commonly seen problem due to the small data size. In order to address this issue, transfer learning with state-of-the-art CNN networks from image recognition field is applied in the pathological voice detection field. The results shows that transfer learning improves the testing data accuracy. However, the overfitting problem is still severe. Finally, the concept of data augmentation is explored and a novel CNN model called the R-Net is proposed. This method uses continuous wavelet transform to obtain the scalograms of the speech onset, and data augmentation within a CNN environment. This model significantly reduces the overfitting problems, and improves the testing performance between 15% to 20% on the most challenging SVD database. It validates the efficiency of data augmentation on small-data-size problems.

Advisor / supervisor

Lowit, Anja
Soraghan, John
Di Caterina, Gaetano

Resource Type

Doctoral thesis

Note

Previously held under moratorium from 15th April 2021 until 2nd May 2023.

DOI

10.48730/sw79-z053

Date Created

2020

Former identifier

9912981493502996

Funder

Relations

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	PDF of thesis T15843	2023-04-17	Public	Download
	File	2021-07-02	Private

Acoustic-based machine learning diagnostic tool for voice disorders

Downloadable Content

Relations

Items