This Title All WIREs
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 4.476

A new anchor word selection method for the separable topic discovery

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract Separable nonnegative matrix factorization (SNMF) is an important method for topic modeling, where “separable” assumes every topic contains at least one anchor word, defined as a word that has non‐zero probability only on that topic. SNMF focuses on the word co‐occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly influences the quality of the extracted topics. Existing anchor word selection algorithm is to greedily find an approximate convex hull in a high‐dimensional word co‐occurrence space. In this work, we propose a new method for the anchor word selection by associating the word co‐occurrence probability with the words similarity and assuming that the most different words on semantic are potential candidates for the anchor words. Therefore, if the similarity of a word‐pair is very low, the two words are very likely to be the anchor words. According to the statistical information of text corpora, we can get the similarity of all word‐pairs. We build the word similarity graph where the nodes correspond to words and weights on edges stand for the word‐pair similarity. Following this way, we design a greedy method to find a minimum edge‐weight anchor clique of a given size in the graph for the anchor word selection. Extensive experiments on real‐world corpus demonstrate the effectiveness of the proposed anchor word selection method that outperforms the common convex hull‐based methods on the revealed topic quality. Meanwhile, our method is much faster than typical SNMF‐based method. This article is categorized under: Algorithmic Development > Text Mining Technologies > Machine Learning
Our method consists two stages, converting XV × M to the word similarity graph (subsection of The Word Similarity Graph) and finding K‐cliques (subsection of Finding Cliques in the Word Similarity Graph)
[ Normal View | Magnified View ]
An example to illustrate there may be a hidden topic model apart from topic. There may be five topics of the documents, sport, health, society, economy, and democracy. While sport may be divided into competition and exercise and the others could be divided with the same reason. So we may get another five different topics, exercise, competition, disease, politics, and law
[ Normal View | Magnified View ]
An example to illustrate there could be different anchor words for a topic. We have choosed five topics, sport, entertainment, economy, politics and health. Clique a and clique b do not have any anchor words in common while clique c is mixed with some anchor words from clique a and b. Nevertheless, each of them could classsify the documents correctly
[ Normal View | Magnified View ]
The running time of FAW and SC on KOS corpora
[ Normal View | Magnified View ]

Browse by Topic

Technologies > Machine Learning
Algorithmic Development > Text Mining

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts