Multivariate analysis of metabonomic data

Metabonomics is one of the main technologies used in biomedical sciences to improve understanding of how various biological processes of living organisms work. It is considered a more advanced technology than e.g. genomics and proteomics, as it can provide important evidence of molecular biomarkers for the diagnosis of diseases and the evaluation of beneficial adverse drug effects, by studying the metabolic profiles of living organisms. This is achievable by studying samples of various types such as tissues and biofluids. The findings of a metabonomics study for a specific disease, disorder or drug effect, could be applied to other diseases, disorders or drugs, making metabonomics an important tool for biomedical research. This thesis aims to review and study various multivariate statistical techniques which can be used in the exploratory analysis of metabonomics data. To motivate this research, a metabonomics data set containing the metabolic profiles of a group of patients with epilepsy was used. More specifically, the metabolic fingerprints (proton NMR spectra) of 125 patients with epilepsy, of blood serum type, have been obtained from the Western Infirmary, Glasgow, for the purposes of this project. These data were originally collected as baseline data in a study to investigate if the treatment with Anti-Epileptic Drugs (AEDs), of patients with pharmacoresistant epilepsy affects the seizure levels of the patients. The response to the drug treatment in terms of the reduction in seizure levels of these patients enabled two main categories of response to be identified, i.e. responders and the non-responders to AEDs. We explore the use of statistical methods used in metabonomics to analyse these data. Novel aspects of the thesis are the use of Self Organising Maps (SOM) and of Fuzzy Clustering Methods to pattern recognition in metabonomics data. Part I of the thesis defines metabonomics and the other main "omics" technologies, and gives a detailed description of the metabonomics data to be analysed, as well as a description of the two main analytical chemical techniques, Mass Spectrometry (MS) and Nuclear Magnetic Resonance Spectroscopy (NMR), that can be used to generate metabonomics data.;Pre-processing and pre-treatment methods that are commonly used in NMR-generated metabonomics data to enhance the quality and accuracy of the data, are also discussed. In Part II, several unsupervised statistical techniques are reviewed and applied to the epilepsy data to investigate the capability of these techniques to discriminate the patients according to their type of response. The techniques reviewed include Principal Components Analysis (PCA), Multi-dimensional scaling (both Classical scaling and Sammon's non-linear mapping) and Clustering techniques. The latter include Hierarchical clustering (with emphasis on Agglomerative Nesting algorithms), Partitioning methods (Fuzzy and Hard clustering algorithms) and Competitive Learning algorithms (Self Organizing maps). The advantages and disadvantages of the different methods are examined, for this kind of data. Results of the exploratory multivariate analyses showed that no natural clusters of patients existed with regards to their response to AEDs, therefore none of these techniques was capable of discriminating these patients according to their clinical characteristics. To examine the capability of an unsupervised technique such as PCA, to identify groups in such data as the data based on metabolic fingerprints of patients with epilepsy, a simulation algorithm was developed to run a series of experiments, covered in Part III of the thesis. The aim of the simulation study is to investigate the extent of the difference in the clusters of the data, and under what conditions this difference is detectable by unsupervised techniques. Furthermore, the study examines whether the existence or lack of variation in the mean-shifted variables affects the discriminating ability of the unsupervised techniques (in this case PCA) or not. In each simulation experiment, a reference and a test data set were generated based on the original epilepsy data, and the discriminating capability of PCA was assessed. A test set was generated by mean-shifting a pre-selected number of variables in a reference set. Three methods of selecting the variables to meanshift (maximum and minimum standard deviations and maximum means), five subsets of variables of sizes 1, 3, 20, 120 and 244 (total number of variables in the data sets) and three sample sizes (100, 500 and 1000) were used. Average values in 100 runs of an experiment for two statistics, i.e. the misclassification rate and the average separation (Webb, 2002) were recorded. Results showed that the number of mean-shifted variables (in general) and the methods used to select the variables (in some cases) are important factors for the discriminating ability of PCA, whereas the sample size of the two data sets does not play any role in the experiments (although experiments in large sample sizes showed greater stability in the results for the two statistics in 100 runs of any experiment). The results have implications for the use of PCA with metabonomics data generally.

Resource Type

Doctoral thesis

DOI

10.48730/7kr8-7d57

Date Created

2014

Former identifier

1039515

Relations

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	PDF of thesis T13852	2021-07-02	Public	Download

Multivariate analysis of metabonomic data

Downloadable Content

Relations

Items