DEV Community

Beatrice Akaeme for AWS Community Builders

Posted on

AWS AND APACHE SPARK

WHAT IS APACHE SPARK
Apache Spark is an open source data processing framework which can execute processing analytical tasks on very huge data sets which it allocates across multiple computers. It can handle both batch and real-time analytics and data processing workloads and run in Hadoop clusters through YARN or Spark's standalone mode . It is a very fast since it runs on RAM and is a general engine for a massive data processing. Apache Spark supports various widely used programming languages such as Python, Java, Scala, and R. It can be used for multiple things such as batch applications, iterative algorithms, queries, SQL and streaming.

COMPONENTS
Apache Spark Core – This is the fundamental widespread execution engine for the Spark platform that all other services are built upon.
Spark SQL – is a component which is at the top of Apache Spark’s core which enables users to query data stored in contrasting applications using the common SQL language and it also provides SQL language support.
Spark Streaming – Performs streaming analytics with Spark Core's fast scheduling capability. Data can be ingested in mini-batches from many sources like Kafka, Flume, and HDFS.
MLlib (Machine Learning Library) – This is an Apache Spark's library which contains a wide array of machine learning and statistical algorithms.
GraphX –This is part of Spark's library which is used to manipulate graph databases and perform computations. It is a distributed graph-processing framework on top of Apache Spark

Spark languages
Spark is mainly written in Scala, it is the native language for interacting with Spark core engine . Scala can be a good fit for those who have prior knowledge of Java. It enables developers to write the clean designs of spark applications. The Scala shell can be accessed through spark-shell

Python ; Of all the languages supporting Python is a simple, open-source, general-purpose language and is very easy to learn and understand. Used for creating schemas, calling REST API is much easier to perform with python while working in spark.

R programming language can also be downloaded and run in Spark. This enables users to run the popular desktop data science language on large-scale distributed data sets in Spark in order to use it to build applications which supports machine learning algorithms.

FEATURES

Apache Spark has following features.

Speed − Spark has 100 times faster in memory to help in running an application in Hadoop cluster, also 10 times faster when running on disk. How is this possible? by reducing number of read/write operations to disk. Big data is distinguished by volume, variety, velocity, and veracity which needs to be processed at a very high speed. Spark fast in performance, utilizes memory and disk, and has built-in fault tolerance

Uses multiple programming languages − Spark supports use of multiple programming languages such as Java, Scala, or Python. Therefore, you can write applications in different languages.

Advanced Analytics − Spark supports ‘Map’ and ‘reduce’ as well as SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Flexibility – Apache Spark supports numerous languages and allows the developers to write applications in Scala, Java, R, or Python.

In-memory computing – Spark stores the data in the RAM of servers which permits very fast access and in turn increases the speed of analytics.

Real-time processing – Spark is able to process real-time streaming data. Unlike MapReduce which processes only stored data, Spark is able to process real-time data and is, therefore, able to produce instant outcomes.

Better analytics –Apache Spark is made up of a rich set of SQL queries, machine learning algorithms, complex analytics, etc. With all these components, analytics can be performed in a better way with the help of Spark.

Use Case For Spark: It is used in healthcare sector making data available to health workers. It is used in the financial sector for financial products recommendation and also investment banking. It is used in manufacturing sector and retail sector for a more efficient business.

AWS AND APACHE SPARK
On Amazon EMR
Where is the best place to run Apache Spark? You can run Apache Spark on Amazon EMR . With the AWS Management Console, AWS CLI, or the Amazon EMR API You can rapidly and easily create managed Spark clusters. With Amazon EMR you get a managed Hadoop framework that which makes it easy, fast, and cost-effective to process extensive amounts of data using EC2 instances.

On Amazon Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service that is presently being provided by AWS to prepare and load data for analytics . In the AWS management Console You can easily create and run an ETL job. It is a very good data pipeline tool which can automatically creates partitions to make queries more efficient.

Apache Spark on EC2
To install Apache Spark on EC2 instances you will need to to go a set of standard configurations. The first thing we need is an AWS EC2 instance. First create an AWS account and sign into the console with it. In the console click services, then EC2, then Launch instance. After that install Spark.

Spark on Kubernetes
Kubernetes is an open-source container orchestration system initially developed at Google. Spark can run on clusters managed by Kubernetes and this has been growing in popularity. A Kubernetes cluster includes a set of nodes on which you can run containerized Apache Spark applications.

Apache Spark in the Cloud
To deploy Apache Spark in the cloud, Amazon EMR is the best way to go. It allows you to launch Spark clusters in minutes without needing to do node provisioning, cluster setup, Spark configuration, or cluster tuning. EMR enables you to provision one, hundreds, or thousands of compute instances in minutes.

Amazon Redshift and Apache Spark
Amazon Redshift is a part of AWS, it is an analytical database. A Cloud-based Data Warehouse service fully managed by Amazon to handle large raw through a process called ETL. Amazon makes it easy to discover new insights from the data. Amazon Redshift is fast, easy and cheap to use. Spark is used for real-time stream processing, while for Redshift its a near real-time batch operations. You can build an application with Spark, and then use Redshift both as a source and a destination for data

Top comments (0)