Apache Spark in English

For an assignment I was asked to choose a datas science library or tool and write about it so I chose Apache Spark after a friend who has been helping me here and there mentioned it to me.

Apache Spark is an open source extremely fast cluster computing system for large scale data processing and is meant to be used in general for batch based and real time processing[1]. A cluster computing system uses many machines to accomplish computing tasks using the pooled resources on a network. This allows the system to be far more flexible in the amount of processing resources available as more computers can be added and removed at any time. This, in turn, allows a cluster computing system to have a greater pool of resources. If one of the computers in the cluster should fail the processing power of the other computers will pick up the slack and continue running the process[7]. This, as you might imagine, makes these systems capable of processing large amounts of data that could consist of a varying amount of files or records at once as with batch data or allow it to consistently react to data within a very tight window of time that would be required for a real world processes as with real time processing[4]. This leaves a large amount of room for high level computing processes that could be done on many platforms and with many varied and different tools as well as the way the system processes the data for speed.

Of the many problems that are addressed by Apache Spark one example is represented by the importance of speed, IP network traffic is expected to reach 396 exabytes per month by 2022 an increase of 274 petabytes per month from just five years before in 2017. One of the solutions that have been successfully applied is in-memory cluster computing, which can mean many different things depending on what system is being applied and why. To summarize, many groups are attempting to simplify the way the computer processes data from the hard disk. The reason for this is in traditional systems data is processed from memory which causes slowing in the amount of data processed at a given time and can raise the amount of power demand for the system. Traditionally the way many companies overcome this problem is through scaling or making their devices smaller and more flexible. However, another method applied by database vendors is to process the data in the main memory or DRAM rather than storing It in solid state drives or disk drives on a server or another system. This accelerates the speed of transactions[3]. Apache Spark is hailed by many to have accomplished the application of this system of data processing in a way that others systems like it have not, making it 100 times faster in memory and ten times faster on disk when compared to other systems such as MapReduce giving it a reputation for low latency[5]. Another problem that Spark attempts to solve is the abundance of other systems and high level programming demands that could be placed on it by making it accessible to many different processes.

Spark has six major spaces where it fits best for compute. Those six are fast data processing, iterative processing, near real-time processing, graph processing, machine learning, and joining datasets. I have already discussed how it’s data processing speed is established and how this translates to near real-time processing so I will begin by discussing iterative processing. Spark uses a Resilient Distributed Dataset or RDD which is an immutable dataset that is split up and then replicated among multiple nodes such that if one node should fail the others will still process the data[5]. This makes Spark adept at processing and reprocessing the same data very quickly since it can enable many different operations in memory[2]. Apache Spark comes with a number of preloaded system among them is GraphX which due to it’s affinity to iterative processing enables it to use RDDs for graph computation. A machine learning library is included with Spark to allow for better use in that regard and due to its speed can create combinations at a decent speed[5]. All of these things make Spark very comparable in the market to its main alternatives.

Having looked around there are many alternatives to Spark though some are alternatives to the one of the functionalities included in the program like graph processing or streaming. For this discussion I would like to focus on the programs meant specifically for big data processing specifically Apache Storm, Apache Fink, IBM InfoSphere Streams, and TIBCO StreamBase.

Apache Storm is another open source program designed for stream processing and near real-time event processing. In addition to its ability to do many of the same functions as Spark such as online machine learning and real-time analytics. It comes with a group of built in programs designed for functions such as cluster management, queued messaging, and multicast messaging. Storm can be used with any programming language making it more flexible than spark in that way[6].

Apache Flink is not a micro-batch model, instead, it uses an operator based model for computation. Using this system all data elements are immediately pipelines using the included program for streaming as quickly as they are received. Flink is faster with graph processing and machine learning due to its propensity for closed loop iterations and is comparable in speed while allowing code from programs like Storm and MapReduce[6].

IBM InfoSphere Steams has everything necessary for stream processing including integration abilities as well as a heavily scalable event server. This program will uncover patterns such as data flows in the information during the period and can fuse the streams that can assist in gaining insights from many different streams. Streams comes with security software and network management features, a runtime environment for deployment and monitoring stream applications, and finally a programming model for writing applications in SPL[6].

TIBCO Streambase is mainly for analysis of real-time data and creation of applications that support developers that make those applications such that they will be faster and easier to deploy. This program is unique for its LiveView data mart that utilizes continuously streaming data from real-time sources and creates an in-memory warehouse to store data and afterward return push-based query outputs to the users. Users of this program can reflect on the returned data and use elements meant to make the desktop more like an interactive command application for the users[6].

Apache spark was originally developed at UC Berkeley in 2009 and is an entirely open source project currently being hosted by the Apache Software Foundation and is maintained by Databricks[1]. If you are interested in more information please investigate the links below.

The Apache Spark Tutorial
https://www.tutorialspoint.com/apache_spark/index.htm

The Spark Quickstart guide
https://spark.apache.org/docs/latest/quick-start.html

Resources

Apache Spark™ - What is Spark. (2020, April 13). Retrieved September 27, 2020, from https://databricks.com/spark/about
Bekker, A. (2017, September 14). Spark vs. Hadoop MapReduce: Which big data framework to choose [Web log post]. Retrieved September 26, 2020, from https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce#:~:text=In%20fact%2C%20the%20key%20difference,up%20to%20100%20times%20faster.
Lapedus, M. (2019, February 21). In-Memory Vs. Near-Memory Computing [Web log post]. Retrieved 2020, from https://semiengineering.com/in-memory-vs-near-memory-computing
Schiff, L. (2020, May 13). Real Time vs Batch Processing vs Stream Processing [Web log post]. Retrieved 2020, from https://www.bmc.com/blogs/batch-processing-stream-processing-real-time/
Vaidya, N. (2019, May 22). [Web log post]. Retrieved 2020, from https://www.edureka.co/blog/spark-architecture/
Verma, A. (2018, May 25). What are the Best Alternatives for Apache Spark? [Web log post]. Retrieved September 26, 2020, from https://www.whizlabs.com/blog/apache-spark-alternatives
What is Cluster Computing: A Concise Guide to Cluster Computing. (2020, May 18). Retrieved September 27, 2020, from https://www.educba.com/what-is-cluster-computing/