Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.111

Gene expression modular analysis: an overview from the data mining perspective

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

In this review, we discuss the main problems and state‐of‐the‐art solutions applied to the field of gene expression. Specific data analysis workflows have been developed in parallel with the technology and currently cover a very wide spectrum of methods and applications needed to give answers to a lot of scientific questions that this type of data are producing. Computer science and, more specifically, the data mining area is still benefiting from a large set of real‐case scenarios to apply and develop new ideas and tools for discovering biological knowledge and new information from this experimental data. In this article, we make the reader aware of the main problems that still persist and provide a description of the methodologies that are applied for classification, clustering, and functional exploration of gene expression data. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 381–396 DOI: 10.1002/widm.29

Figure 1.

Schematic representation of the gene expression matrix extracted from a series of DNA microarrays. (a) Expression profiles of a gene: each gene is a spot in each microarrays and it is represented as a vector whose components are the intensity values from each chip. (b) Representation of the molecular profile of a sample or experimental condition measured in each microarray. In this case, the genes are the variables or components of the sample vector.

[ Normal View | Magnified View ]
Figure 2.

Agglomerative hierarchical clustering representation of a gene expression matrix. The heatmap represents the expression levels of each gene (rows) across the different samples or experimental conditions (columns). The vertical and horizontal dendogram reflects the clustering structure of the dataset and provides a visual intuition about the number of groups. Manual inspection and exploration is needed to select the final number of clusters guided by the visual representation of the dendogram. In this case, we have set eight gene clusters and two sample clusters that are depicted in different colors.

[ Normal View | Magnified View ]
Figure 3.

k‐means clustering results. In this example, six clusters were extracted with k‐means algorithm with Pearson correlation coefficient as distance metric. Y‐axis represents the logarithm of the expression ratio and X‐axis represents the samples (seven in this case). Each cluster is represented in a different color.

[ Normal View | Magnified View ]
Figure 4.

Three‐dimensional scatterplot of a gene expression dataset projected on their first three principal components calculated using principal component analysis (PCA). Colors represent the clusters estimated by k‐means algorithm (Figure 3).

[ Normal View | Magnified View ]
Figure 5.

Representation of a 7 × 5 self‐organizing map (SOM) applied to a gene expression matrix. Gene expression profiles are represented in each node (code vector). Note the large homogeneity in each code vector and the similarity of neighboring nodes. It is precisely this smoothly distribution of nodes in the map one of the most attractive features of SOM. Clusters can now be defined by selecting a set of adjacent profiles.

[ Normal View | Magnified View ]
Figure 6.

Schematic representation of the biclustering process using the nonnegative matrix factorization (NMF) algorithm. A synthetic gene expression matrix X was generated with four clearly overlapped block‐structures over a random noisy background. Matrices W and H clearly identify the modules or biclusters. Factor 1, corresponding to the second bicluster in the original matrix is depicted in blue.

[ Normal View | Magnified View ]
Figure 7.

Illustrative example of concurrent enrichment analysis of biological annotations. The input query list is composed by 49 human genes while the reference list is 29,095 genes long. The first row reads as follows: 8 genes out of 49 genes in my input list are annotated with the forebrain development biological process and the plasma membrane cellular component according to Gene Ontology. Fifteen genes out of a total of 29,095 in the genome are also annotated with the same terms. We can then conclude with a high level of significance (corrected P value of 8.76e‐13) that these two functional annotations are enriched in my query list, shedding light into the interpretation of my experiment. Results were produced with GENECODIS application sample file (http://genecodis.cnb.csic.es).

[ Normal View | Magnified View ]

Related Articles

Fundamentals of association rules in data mining and knowledge discovery
Applications of tensor (multiway array) factorizations and decompositions in data mining

Browse by Topic

Application Areas > Science and Technology
Technologies > Classification
Algorithmic Development > Biological Data Mining

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts