As a Data Scientist one might have worked with large amount of data. I never got chance to work on large data earlier. Recently i came across a 1.3gb of sensor data, it was little hard to work on using pandas dataframe. I have to wait for couple of miniutes to read or write data or to perform data manipulation.
I also realize that, While working with big data we can't use pandas dataframe. It fails to give better performing in terms of reading and writing file(IO Operation), even data manipulation also takes time. Reading a 1gb csv file took around 44sec using pandas while Pyspark took just 6sec.(The time taken depends on hardware) It made me realize that i need to explore Pyspark.
In this tutorial, we will see pyspark installation step and doing some basic operation with dataframe object.
You will require Java installed in the environment. It also ask for A proper Java Home Variable path defined in the environment. Make sure you install JDK or JRE.
To install pyspark we just need to do pip installation in
conda or any python
pip install pyspark
Before doing any operation in
pyspark we need to initialize spark session it can be done like this,
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Practice').getOrCreate()
appName, we can provide any name based on the objective. The session builder takes little time to setup but it is one time process.
Once its completed
pyspark is ready to use
Pyspark syntax is very similar to pandas. In pandas library we read csv file like this,
import pandas as pd df_pandas = pd.read_csv('sample.csv') df_pandas
similarly, in the spark we have below syntax
df_pyspark = spark.read.csv("sample.csv") df_pyspark.show()
Note: In pyspark dataframe will not be shown directly, we need to call
show() on the dataframe object.
There are some functions that are similar in
pyspark dataframe like
# head df_pandas.head() df_pyspark.head() # describe df_pandas.describe() df_pyspark.describe()
and many more that gives almost similar syntax and results.
There are also few functions which works differently from pandas like column selection and slicing function
# column selection function df_pandas['column1'] df_pyspark.select('column1').show()
With this note, the pyspark learning journey begins...