Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

A practical guide to text mining with topic extraction

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open‐ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document‐term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open‐source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining
Word frequency for National Science Foundation corpus with first four terms ‘the’, ‘of’, ‘and’.
[ Normal View | Magnified View ]
Classification tree for fatality response variable with words as factors.
[ Normal View | Magnified View ]
Records associated with low document loadings for Topic 36.
[ Normal View | Magnified View ]
Singular value decomposition factors most associated with fatalities.
[ Normal View | Magnified View ]
Stand‐alone topics from singular value decomposition factors using varimax rotation.
[ Normal View | Magnified View ]
Plot of first two varimax rotated singular value decomposition factors clearly showing the genetic and also the computer algorithm research.
[ Normal View | Magnified View ]
Plot of first two singular value decomposition factors for National Science Foundation abstracts showing no real insights.
[ Normal View | Magnified View ]
Sparse National Science Foundation document‐term matrix reduced to a dense three‐dimensional representation.
[ Normal View | Magnified View ]
Portion of the very sparse National Science Foundation document‐term matrix.
[ Normal View | Magnified View ]
Distribution of word frequency by documents.
[ Normal View | Magnified View ]

Browse by Topic

Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition
Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts