Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.541

Seeing beyond reading: a survey on visual text analytics

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input material—either single texts or collections of texts—and their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics. © 2012 Wiley Periodicals, Inc. This article is categorized under: Algorithmic Development > Text Mining Technologies > Visualization

Tag‐cloud visual metaphor for the testimony of William Jefferson ‘Bill’ Clinton on his impeachment trial. (a) TagCrowd visual representation. (b) Wordle visual representation. The size of the font maps the frequency of the corresponding term occurring in the testimony, with larger fonts indicating more frequent terms. Images generated with the IBM Many Eyes visualization system (http://www‐958.ibm.com) accessed on November 7, 2011.

[ Normal View | Magnified View ]

TileBars: visualization of the results of a search on medical documents. Each document appears as a rectangular icon composed by colored bars spatially placed to indicate the frequencies and distribution of the query terms in the document. Squares in darker colors indicate higher frequencies of a particular query term set. (Reproduced with permission from Ref 46. Copyright 1995 ACM.)

[ Normal View | Magnified View ]

Action Science Explorer (ASE): tool presenting multiple views of research papers on a particular field—tables of papers, full texts, text summaries, and visualizations of the citation network and its groups are shown. All data views are coordinated. (Reproduced with permission from Ref 42. Copyright 2012 American Society for Information Science and Technology.)

[ Normal View | Magnified View ]

Streamit: dynamic document map for a collection of abstracts describing projects funded by the US National Science Foundation Information and Intelligent Systems award between March 2000 and August 2003, generated with a dynamic force‐directed projection. Given latent Dirichlet allocation topics extracted in a preprocessing step, documents that match specific user‐selected topics are presented as pie charts, with slice sizes indicating the topic's weight in the corresponding document. Circle sizes represent the amount of funding to the project. Topical events are discovered with a dynamic clustering approach: (a) September 2000—red pie slice represents topic 16 (Query, Database, Data, XML, Stream, Edu) and green slices represent topic 19 (Data, Workflow, Privacy, Management, Web, Metadata); (b) September 2001—clusters 1 and 2 from Figure 7(a) have merged into cluster 3. Clusters 4 and 5 are new. (Reproduced with permission from Ref 38. Copyright 2012 IEEE.)

[ Normal View | Magnified View ]

TextFlow: topic flows for scientific articles published in IEEE InfoVis from 2001 to 2010. Similarly to Theme River, TextFlow employs a metaphor based on river ‘streams’ to represent the strength of different topics varying over time within a document collection. It adds extra visual marks to represent events associated with topics, such as topic birth, split, merge and death. In this example, the event marked as d indicates that the topic document/temporal has turned into a major topic in this collection around year 2009. (Reproduced with permission from Ref 32. Copyright 2011 IEEE.)

[ Normal View | Magnified View ]

ThemeRiver: visualization showing documents about the Cuban Missile Crisis, from December 1959 through June 1961. In this representation, the major topics addressed in the document collection are shown as colored ‘streams’, with stream width indicating the topic's strength at a certain moment. (Reproduced with permission from Ref 29. Copyright 2002 IEEE.)

[ Normal View | Magnified View ]

Document maps of a collection of scientific papers obtained with multidimensional projection techniques. (a) Least square projection(LSP) representation. (b) Hierarchical point placement(HiPP) representation. On LSP, circles represent documents and are placed so that circle proximity is proportional to the similarity among the corresponding documents. On HiPP, the circles represent groups of similar documents and proximity maps the similarity between the groups. Both maps are annotated with automatically extracted topics, and the colors reflect an existing classification of the documents. (Reproduced with permission from Refs. 20, 24 Copyright 2008 IEEE.)

[ Normal View | Magnified View ]

History flow: this visualization highlights the temporal patterns of editions made by different authors in the Wikipedia entry about Microsoft. It shows each version of the target document as a vertical ‘revision line’, formed by several colored sections and with length proportional to the length of the corresponding text. Each author has been assigned a different color, and the sections of each revision line are colored according to their original author. Text sections that have been preserved across consecutive versions are visually linked. (Reproduced with permission from Ref 15. Copyright 2004 ACM.)

[ Normal View | Magnified View ]

Different visualizations that convey semantic relationships among terms occurring in the testimony of William Jefferson ‘Bill’ Clinton on his impeachment trial. (a) Word Tree representation. (b) Phrase Net representation. In the Word Tree, sequential terms in the text are linked, enabling users to navigate in the text by selecting words and checking all sentences in which they occur. The Phrase Nets creates a graph where nodes correspond to terms and edges correspond to user‐specified relationships. In this example, the clause ‘is’ defines the relationship connecting the terms. Images generated with the IBM Many Eyes visualization system (http://www‐958.ibm.com) accessed on November 7, 2011.

[ Normal View | Magnified View ]

Browse by Topic

Technologies > Visualization
Algorithmic Development > Text Mining

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts