Analyzing Metabolomics Data

The availability of data is the foremost step in analysis. There are several metabolomic databases available, each of them serving a different purpose. Their goal is to put the metabolites in some order so that it becomes easy for researchers to spot and analyze the data.

Image Credit: toeytoey / Shutterstock

The Human Metabolome Database (HMDB) has over 40,000 metabolite entries and aims to recognize all the metabolites present in humans. MassBank is a spectral database that has over 39,000 entries. Over 75,000 entries are available in the database METLIN—a database for bacteria, animals, and plants. Lipid metabolites and pathways strategy (LIPID MAPS) is the largest repository for lipid molecular structures. Madison metabolomics consortium database has over 20,000 data, a resource for mass spectrometry and nuclear magnetic resonance-based metabolomics research. The metabolic pathways are contained in the Kyoto Encyclopedia of Genes and Genomes (KEGG).

Types of Data Analysis Techniques

The two important factors to consider while analyzing data are their organization and visualization, so that the data can be interpreted or hypotheses can be devised. There are four major techniques available to analyze metabolomics data. They are:

Unsupervised learning method
Supervised learning methods
Pathway analysis methods
Time course data methods

Unsupervised Learning Method

During the analysis stage, we may want to get an idea of the data structure.

Unsupervised learning helps to learn about the data; more precisely, it helps to discover the data trend. The data are not labeled under any class and the unsupervised learning method will discover the data. Thus the researcher has little information or assumptions about the data that are under analysis. Being the first step in the analysis process, unsupervised learning assists in visualizing the data. The following four methods are the most frequently used for analyzing metabolites.

Principal component analysis (PCA) — When the number of metabolites is greater, finding few combinations of the data helps in dimension reduction. When applying the PCA algorithm in the original dataset, the total variation can be found. Most information about the dataset is retained by the principal components that actually replace all the correlated variables. A score plot is used to find the groups, while a loading plot is to discover variables that separate the groups from each other.
Clustering technique helps to group data that are similar, so that the data in one cluster are alike and relatable when compared with the data in another cluster. The two most widely used clustering techniques in metabolomics are k-means clustering and hierarchical clustering. In k-means clustering, the data are divided into k clusters that do not overlap. Unlike k-means clustering, hierarchical clustering does not stop at finding specific numbers of clusters, but continues to split all the data until a hierarchy of clusters is formed. It is often combined with a heat map for data matrix visualization.
Self-organizing map (SOM) is a visualization tool that assists in visual discovery of the clusters present in data.

Supervised Learning mMethod

Widely used in the biomarker discovery, categorization, and prediction, supervised learning methods deal with datasets having response variables that are either continuous or discrete. These methods find the association between covariates and response variables, and accurate predictions are made.

Partial least squares (PLS) is mostly used in metabolomics research. PLS is widely used for identifying biomarkers and in classifying diseases, while support vector machine (SVM) is used in cancer research.

Pathway Analysis Methods

Pathway analysis helps to find the biological mechanisms in the list of identified metabolites. The two most common methods are 1) over-representation analysis (ORA) and 2) functional class scoring (FCS).

ORA is the simplest method, which is performed when the pathways differ considerably among the two study groups. Some limitations of ORA are addressed by the functional class scoring (FCS) method. Single metabolite statistics are obtained first and these are aggregated to evaluate a pathway-level statistic, either univariate or multivariate. Most often enrichment score, mean, and meridian are used for univariate pathway-level statistics. Hotelling’s T2 statistic is widely used for multivariate statistics

Time Course Data Methods

The metabolites concentration may vary with time, thus a time dimension is created in the dataset. We need to include a time dimension and continue using unsupervised learning methods and visualization tools such as PCA and SOM. Additionally, profile graphs can be drawn to check the profiles of the metabolites for the various clusters. Statistical techniques such as analysis of variance (ANOVA)-based models are used to compare the different change patterns of the metabolites.

During the metabolomics research process, researchers observed some differences among metabolites within the same cluster. When the variation is greater, the repeated measures (RM) model is used. When it is required to analyze many metabolites in parallel by considering the structure of the correlation, a generalized ANOVA called ANOVA-simultaneous component analysis (ASCA) is used.

Methods such as time series data analysis, functional-based method, smoothing splines mixed effects model, and hierarchical linear model are also suggested.

Sources

https://www.nap.edu/read/23414/chapter/1#12
https://dragon.cchmc.org/Publications/15Metabolomics_RenS.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC43504
http://www.sciencedirect.com/science/article/pii/S2001037014600520
http://www.massbank.jp
http://www.hmdb.ca/
https://metlin.scripps.edu
http://www.lipidmaps.org
http://mmcd.nmrfam.wisc.edu
http://www.genome.jp/kegg/

Types of Data Analysis Techniques

Unsupervised Learning Method

Supervised Learning mMethod

Pathway Analysis Methods

Time Course Data Methods

Sources

Further Reading

Afsaneh Khetrapal