DEV Community

Cover image for Spark MLlib for Big data and Machine learning

Spark MLlib for Big data and Machine learning

siddhantpatro profile image D Siddhant Patro ・4 min read

In this world, full of data, there’s a good chance that you might know what Big data and Apache Spark is. If you don’t, that’s ok! I’ll tell you what it is but before knowing about big data and spark, you need to understand, what is Data.

Data :- The quantities, characters, or symbols containing some kind of information on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Alt Text

Since you all got an idea about what Data is, now it will be easy for you to understand what big data is.

Big data :- It is a collection of data that is huge in volume and having more complexity, especially obtained from new data sources, and it is growing exponentially with time. These data sets are so voluminous that traditional data processing software just can’t manage them.
It consists of 3 types of data, they are structured, semi-structured and unstructured.

Alt Text

Machine learning :- It is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Alt Text

Apache Spark :- With an immense amount of data, we need a tool to digest it and the tool is Apache Spark. It is a fast, unified computing and open source data-processing engine for parallel data processing on computer clusters. It is designed to deliver the computational speed and scalability required for Big Data — specifically for streaming data, graph data, machine learning applications.

Alt Text

Spark provides an unified data processing engine known as the
Spark stack. This stack is built on top of a strong foundation called Spark Core, which provides all the necessary functionalities to manage and run distributed applications such as scheduling, coordination, and fault tolerance. Available libraries of Spark are Spark SQL, Spark Streaming, GraphX, Spark MLlib and Spark R.

Spark SQL is for batch as well as interactive data processing.
Spark Streaming is for real-time stream data processing.
Spark GraphX is for graph processing.
Spark MLlib is for machine learning.
Spark R is for running machine learning tasks using the R shell.

Alt Text

Spark MLlib is nothing but a library that helps in managing and simplifying many of the machine learning models for building tasks, such as featurization, pipeline for constructing, evaluating and tuning of the model. Machine learning algorithms are iterative in nature, meaning they run through many iterations until a desired objective is achieved. Spark makes it extremely easy to implement those algorithms and run them in a scalable manner through a cluster of machines.

Spark MLlib tools are given below:-

  1. ML Algorithms
  2. Featurization
  3. Pipelines
  4. Model Tuning
  5. Persistence

ML Algorithms:-
ML Algorithms form the core of MLlib. These include common learning algorithms such as classification, regression, clustering, and collaborative filtering. MLlib standardizes APIs to make it easier to combine multiple algorithms into a single pipeline.

Featurization includes feature extraction, transformation, dimensionality reduction, and selection.

  1. Feature Extraction is extracting features from raw data.
  2. Feature Transformation includes scaling, and modifying features
  3. Feature Selection involves selecting a subset of necessary features from a huge set of features.

In machine learning, it is common to run a sequence of steps to clean and transform data, then train one or more ML algorithms to learn from the data. MLlib has a class called Pipeline, which consists of a sequence of Pipeline Stages (Transformers and Estimators) to be run in a specific order.

Model Tuning:-
The goal of the model tuning is to train a model with the right set of parameters to achieve the best performance to meet the object defined in the first step of the ML development process.

Persistence helps in saving and loading ML algorithms, models, and pipelines. This helps in reducing time and efforts as the model is persistence, it can be loaded or reused any time when needed.

The above are the tools via which one can learn to use machine learning algorithms on Apache spark framework for better and faster processing of massive and voluminous data.

Alt Text

In the Python world, scikit-learn is one of the most popular open source machine learning libraries. It provides a set of supervised and unsupervised learning algorithms. It is designed to be simple and efficient and therefore, it is a perfect tool to learn and practice machine learning on a single machine. But the moment the size of the data exceeds the storage capacity of a single machine, that’s when it is time to switch to Spark MLlib.

Thank you.

Discussion (0)

Editor guide