Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 4.476

Online streaming feature selection with incremental feature grouping

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract Today, the dimensionality of data is increasing in a massive way. Thus, traditional feature selection techniques are not directly applicable. Consequently, recent research has led to the development of a more efficient approach to the selection of features from a feature stream, known as streaming feature selection. Another active research area, related to feature selection, is feature grouping. Feature grouping selects relevant features by evaluating the hidden information of selected features. However, although feature grouping is a promising technique, it is not directly applicable to feature streams. In this paper, we propose a novel and efficient algorithm that uses online feature grouping, embedded within a new incremental technique, to select features from a feature stream. This technique groups similar features together; it assigns new incoming features to an existing group or creates a new group. To the best of our knowledge, this is the first approach that proposes the use of incremental feature grouping to perform feature selection from features. We have implemented this algorithm and evaluated it, using benchmark datasets, against state‐of‐the‐art streaming feature selection algorithms that use feature grouping or incremental selection techniques. The results show superior performance by the proposed technique through combining the online selection and grouping, in terms of prediction accuracy and running time. This article is categorized under:   Algorithmic Development > Spatial and Temporal Data Mining   Technologies > Data Preprocessing   Technologies > Classification   Technologies > Machine Learning
Data stream and feature selection categories
[ Normal View | Magnified View ]
The chart shows the effect of q on the decision tree accuracy on the Madelon dataset. q achieves its highest accuracy result within the range (0.5, 2). After q value of 2.5, the accuracy drops and stabilizes after the q value of 3
[ Normal View | Magnified View ]
The chart shows the quality of the group's distribution. The q achieves a better result within the range (0.1, 2). Before and after this range, groups' distribution is almost the same. This range allows for a better distribution of informative features and the creation of average total groups
[ Normal View | Magnified View ]
Gain ratio threshold parameter sensitivity analysis on dataset Arcene
[ Normal View | Magnified View ]
Running time comparison of the four competing approaches. The time presents streaming feature selection timing and building of the model to report the accuracy. In the case of our SFGS, the model is updated each time there is a new feature
[ Normal View | Magnified View ]
MIMIC accuracy's results represent the highest and most stable results of our SFGS, relative to the competing approaches. SFGS exhibits high performance in terms of the accuracy of the most learning algorithms. The highest accuracy result is that of random forest
[ Normal View | Magnified View ]
Madelon accuracy's results show the highest performance of our SFGS, relative to the competing approaches. SFGS displays a high performance in terms of accuracy, compared to the four learning algorithms. The highest accuracy result is the random forest CV fold‐10, where it achieves 68.02%
[ Normal View | Magnified View ]
Internet accuracy's results show only the results of our SFGS and Alpha‐Investing approach. The two other competing approaches failed to process the dataset, therefore resulting in zero accuracy
[ Normal View | Magnified View ]
Hiva accuracy's results show the closest performance among the four competing approaches. Our SFGS has the most stable performance, among the competing approaches
[ Normal View | Magnified View ]
Dorothea accuracy's results also show the highest performance of our SFGS, relative to the competing approaches. SFGS achieves around 96%
[ Normal View | Magnified View ]
Arcene accuracy's results show the performance of our SFGS, compared to those of the competing approaches. SFGS presents the highest performance, in terms of accuracy, among the four learning algorithms. The highest accuracy result is the KNN CV fold‐10, where it achieves 88.57%
[ Normal View | Magnified View ]
The effect of q on generating the most representative subset. You may notice that, when the q increases, the total number of the generated groups decreases
[ Normal View | Magnified View ]
An illustration scenario, in which the new candidate‐relevant feature is in a new group by itself. As the dist(fi, G1) > qr, the fi is assigned to its own new group and it is also the new centroid. Correspondingly, the most representative feature is also updated
[ Normal View | Magnified View ]
An illustration scenario that involves adding a new candidate‐relevant feature. As the dist(fi, G1) < q * AvgRad , the fi is assigned to Group 1, the group's centroid is be redefined. Correspondingly, the most representative feature is also updated
[ Normal View | Magnified View ]
SFGS high‐level design consists of three stages: (a) initialization, (b) online grouping to assign new incoming relevant feature, and (c) model update to recalculate the groups' centroids and final, most representative subset selection
[ Normal View | Magnified View ]

Browse by Topic

Technologies > Classification
Technologies > Data Preprocessing
Algorithmic Development > Spatial and Temporal Data Mining
Technologies > Machine Learning

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts