SeattleDataGuy

Posted on Feb 4, 2020

How to Manage Big Data With 5 Python Libraries

#python #database #codenewbie

Python really is everywhere at this point.

Although many gatekeepers argue whether a person is really a software developer if they don't code in a language more difficult than Python, it still is everywhere.

It's used to automate, manage websites, analyze data, and wrangle big data.

As data grows, the way we manage it becomes more and more fine-tuned.

We aren't limited to only using relational databases anymore.

That also means there are now more tools for interacting with these new systems, like Kafka, Hadoop (more specifically HBase), Spark, BigQuery, and Redshift (to name a few).

Each of these systems take advantage of concepts like distribution, columnar architecture, and streaming data to provide information to the end user faster.

The need for faster, more up-to-date information will drive the need for data engineers and software engineers to utilize these tools.

That's why we wanted to provide a quick intro to some Python libraries that could help you out.

BigQuery

Google BigQuery is a very popular enterprise warehouse that's built with a combination of the Google Cloud Platform and Bigtable.

This cloud service works great for all sizes of data and executes complex queries in a few seconds.

BigQuery is a RESTful web service that enables developers to perform interactive analysis on enormous data sets in conjunction with the Google Cloud Platform. Let's take a look at an example I put together in another piece located here.


from google.cloud import bigquery
from google.oauth2 import service_account

# TODO(developer): Set key_path to the path to the service account key
#                  file.
# key_path = "path/to/service_account.json"

credentials = service_account.Credentials.from_service_account_file(
    filename="project_g.json"

)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id,
)

medicare = client.dataset('cms_medicare', project='bigquery-public-data')

medicare = client.dataset('cms_medicare', project='bigquery-public-data')
print([x.table_id for x in client.list_tables(medicare)])

This example shows you how you can connect to BigQuery and then start pulling information about the tables and data sets you'll be interacting with.

In this case, the Medicare data set is an open-source data set that anyone can access.

Another point about BigQuery is it operates on Bigtable. It's important to understand this warehouse is not a transactional database. Therefore, it can't be thought of as an online transaction processing (OLTP) database. It's designed specifically for big data. Therefore, it working aligns with the processing of petabyte-sized data sets.

Redshift and Sometimes S3

Next up, we have Amazon's popular Redshift and S3. Amazon S3 is basically a storage service that's used to store and retrieve enormous amounts of data from anywhere on the internet. With this service, you pay only for the storage you actually use. Redshift, on the other hand, is a fully managed data warehouse that handles petabyte-scaled data efficiently. This service offers faster querying using SQL and BI tools.

Together, Amazon Redshift and S3 work for data as a powerful combination: Massive amounts of data can be pumped into the Redshift warehouse using S3. This powerful tool, when coded in Python, becomes very convenient for developers. Let's have a look at a simple "Hello, World!" example for reference.

SOURCE FOR THE CODE BELOW

__author__ = 'fbaldo'

import psycopg2
import pprint


configuration = { 'dbname': 'database_name', 
                  'user':'user_name',
                  'pwd':'user_password',
                  'host':'redshift_endpoint',
                  'port':'redshift_password'
                }




def create_conn(*args,**kwargs):

    config = kwargs['config']
    try:
        conn=psycopg2.connect(dbname=config['dbname'], host=config['host'], port=config['port'], user=config['user'], password=config['pwd'])
    except Exception as err:
        print err.code, err


    return conn


def select(*args,**kwargs):
    # need a connection with dbname=<username>_db
    cur = kwargs['cur']

    try:
        # retrieving all tables in my search_path
        cur.execute("""select tablename from pg_table_def""")
    except Exception as err:
            print err.code,err


    rows = cur.fetchall()
    for row in rows:
        print row


print 'start'
conn = create_conn(config=configuration)
cursor = conn.cursor()
print 'start select'
select(cur=cursor)
print 'finish'

cursor.close()
for n in conn.notices():
    pprint(n)
conn.close()

This script is a basic connect. Select using psycopg2.

In this case, I borrowed jaychoo code.

But this, again, provides a quick guide on how to connect and then pull data from Redshift.

PySpark

All right, let's leave the world of data storage systems, and let's look into tools that help you process data quickly.

Apache Spark is a very popular open-source framework that performs large-scale distributed-data processing. It can also be used for machine learning.

This cluster-computing framework focuses mainly on streamlining analytics. It works with resilient-distributed data sets (RDDs) and allows users to handle managed resources of spark clusters.

It's often used in conjuncture with other Apache products (like HBase). Spark will quickly process the data and then store it into a tables set on other data storage systems.

In order to get started, let's look at a basic example of running a spark.


from pyspark.sql import Row
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

data=[('Big Burger Chain',92),('Small Burger Chain',99),('Taco Bar',100),('Thai Place',70)] 

rdd = sc.parallelize(data)  

#convert the data into 
restaurant_map=rdd.map(lambda x: Row(restaurant=x[0], rating=int(x[1])))            

df_restaurant = sqlContext.createDataFrame(restaurant_map).collect()

Sometimes installing PySpark can be a challenge, as it requires dependencies. You see it runs on top of JVM and, therefore, requires an underlying infrastructure of Java to function. However, in this era where Docker is prevalent, experimenting with PySpark becomes much more convenient.

Alibaba use PySpark to personalize web pages and offer target advertising --- as do many other large data-driven organizations

Kafka Python

Kafka is a distributed publish-subscribe messaging system that allows users to maintain feeds of messages in both replicated and partitioned topics.

These topics are basically logs that receive data from the client and store it across the partitions. Kafka Python is designed to work as an official Java client integrated with the Python interface. It's best used with new brokers and is backward compatible with all of its older versions.

KafkaConsumer

from kafka import KafkaConsumer
from pymongo import MongoClient
from json import loads



consumer = KafkaConsumer(
    'sebtest',
     bootstrap_servers=['localhost:9092'],
     auto_offset_reset='earliest',
     enable_auto_commit=True,
     group_id='my-group',
     value_deserializer=lambda x: loads(x.decode('utf-8')))



client = MongoClient('localhost:27017')
collection = client.sebtest.sebtest

for message in consumer:
    message = message.value
    collection.insert_one(message)
    print('{} added to {}'.format(message, collection))

KafkaProducer

from time import sleep
from json import dumps
from kafka import KafkaProducer


producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                         value_serializer=lambda x:
                         dumps(x).encode('utf-8'))
for e in range(10):
    data = {'number' : 'Hello this is message '+str(e)}
    producer.send('sebtest', value=data)
    sleep(1)

As you can see in the above examples, coding with Kafka Python requires both a consumer and a producer referenced.

In Kafka Python, we have these two sides work side by side. The KafkaConsumer is basically a high-level message consumer that intends to operate as the official Java client.

It requires brokers to support group APIs. The KafkaProducer is an asynchronous message producer, which also intends to operate very similarly to Java clients. The producer can be used across threads without an issue, while the consumer requires multiprocessing.

Pydoop

Let's get this out of the way. No, Hadoop isn't in itself a data storage system. Hadoop actually has several components, including MapReduce and the Hadoop Distributed File System (HDFS).

So Pydoop is on this list, but you'll need to pair Hadoop with other layers (such as Hive) to more easily wrangle data.

Pydoop is a Hadoop-Python interface that allows you to interact with the HDFS API and write MapReduce jobs using pure Python code.

This library allows the developer to access important MapReduce functions, such as RecordReader and Partitioner, without needing to know Java. For this last example, I think the people at Edureka do it better than I could. So here's a great quick intro.

Find The Intro Here

Pydoop itself might be a little too low-level for most data engineers. More than likely, most of you will be writing ETLs in Airflow that run on top of these systems. But it's still great to at least get a general understanding of what you are working with.

So Where Will You Start?

Managing big data is only going to get tougher in the coming years.

Due to increasing networking abilities --- IoT, improved compute, etc. --- the flood of data that's coming at us is just going to continue to grow.

Thus, getting an understanding of some of the data systems and libraries you can use to interact with these systems will be necessary if we're to keep up.

Thanks for reading.

If you want to read more:
Airbnb's Airflow Vs Spotify's Luigi

Automating File Loading Into SQL Server With Python And SQL

5 Skills Every Software Engineer Needs Based Off Of A Job Description

The Top 10 Big Data Courses, Hadoop, Kafka And Spark

Data Science Use Cases That Are Improving the Finance Industry

Data Science Consulting: How To Get Clients