Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Data Mining Knowl Discov
Impact Factor: 2.541

Big data processing tools: An experimental performance evaluation

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Big Data is currently a hot topic of research and development across several business areas mainly due to recent innovations in information and communication technologies. One of the main challenges of Big Data relates to how one should efficiently handle massive volumes of complex data. Due to the notorious complexity of the data that can be collected from multiple sources, usually motivated by increasing data volumes gathered at high velocity, efficient processing mechanisms are needed for data analysis purposes. Motivated by the rapid growth in technology, development of tools, and frameworks for Big Data, there is much discussion about Big Data querying tools and, specifically, those that are more appropriated for specific analytical needs. This paper describes and evaluates the following popular Big Data processing tools: Drill, HAWQ, Hive, Impala, Presto, and Spark. An experimental evaluation using the Transaction Processing Council (TPC‐H) benchmark is presented and discussed, highlighting the performance of each tool, according to different workloads and query types. This article is categorized under: Technologies > Computer Architectures for Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Data Preprocessing Application Areas > Data Mining Software Tools
Sample query set for 30 GB between fastest tools
[ Normal View | Magnified View ]
Sample query set for 10 GB between the fastest tools
[ Normal View | Magnified View ]
Hadoop master/worker architecture
[ Normal View | Magnified View ]
TPC‐H schema model. (Redrawn based on TPC‐H ())
[ Normal View | Magnified View ]
Spark architecture. (Redrawn based on Laskowsky (2017))
[ Normal View | Magnified View ]
Presto architecture. (Redrawn based on Apache Presto Overview (2018))
[ Normal View | Magnified View ]
Impala architecture. (Redrawn based on Impala Overview ())
[ Normal View | Magnified View ]
Hive architecture. (Redrawn based on Hortonworks (2018))
[ Normal View | Magnified View ]
HAWQ architecture. (Redrawn based on Szegedi (2014))
[ Normal View | Magnified View ]
Drillbit components. (Redrawn based on McDonald (2015))
[ Normal View | Magnified View ]
Drill architecture. (Redrawn based on McDonald (2015))
[ Normal View | Magnified View ]
Comparing total query execution time for 10, 30, and 100 GB
[ Normal View | Magnified View ]
Increase in query execution time for 10, 30, and 100 GB
[ Normal View | Magnified View ]
Total query execution time for 10, 30, and 100 GB
[ Normal View | Magnified View ]
Sample query set for 100 GB
[ Normal View | Magnified View ]

Browse by Topic

Fundamental Concepts of Data and Knowledge > Big Data Mining
Technologies > Data Preprocessing
Technologies > Computer Architectures for Data Mining
Application Areas > Data Mining Software Tools

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts