Spark functions

#spark #dataengineering #python

We learned about Spark dataframes in the Data Engineering Zoomcamp and how to write Spark functions. This is one of the advantages of Spark. Spark can write SQL commands, but it also has many functions, and we can also create our own functions to manipulate the data in complex ways.

First we imported Spark functions: from pyspark.sql import functions as F. We refactored a column using the Spark function "to_date". This turned datetime into a plain date, eliminating the time aspect.

df \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \

Then we created a crazy function. This took the base number and added a different character if it was divisible by 7, 3, or neither. Also we changed the display to hexadecimal.

def crazy_stuff(base_num):
    num = int(base_num[1:])
    if num % 7 == 0:
        return f's/{num:03x}'
    elif num % 3 == 0:
        return f'a/{num:03x}'
    else:
        return f'e/{num:03x}'

We added a user defined function as follows:

crazy_stuff_udf = F.udf(crazy_stuff, returnType=types.StringType())

I actually left out the returnType, because the default type was string. Then I added this to the output, and called select and show:

df \
    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \
    .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \
    .withColumn('base_id', crazy_stuff_udf(df.dispatching_base_num)) \
    .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \
    .show()

And got this output:

+-------+-----------+------------+------------+------------+
|base_id|pickup_date|dropoff_date|PULocationID|DOLocationID|
+-------+-----------+------------+------------+------------+
|  e/9ce| 2021-01-07|  2021-01-07|         142|         230|
|  e/9ce| 2021-01-01|  2021-01-01|         133|          91|
|  e/acc| 2021-01-01|  2021-01-01|         147|         159|
|  e/b35| 2021-01-06|  2021-01-06|          79|         164|
|  s/b44| 2021-01-04|  2021-01-04|         174|          18|
|  e/b3b| 2021-01-04|  2021-01-04|         201|         180|
|  e/9ce| 2021-01-04|  2021-01-04|         230|         142|
|  e/b38| 2021-01-03|  2021-01-03|         132|          72|
|  s/af0| 2021-01-01|  2021-01-01|         188|          61|
|  e/9ce| 2021-01-04|  2021-01-04|          97|         189|
|  e/acc| 2021-01-01|  2021-01-01|         174|         235|
|  a/b37| 2021-01-05|  2021-01-05|          35|          76|

The point in defining this crazy function was to show how easy it was to do. If we had tried to do this with SQL it would have been a nightmare.

Unfortunately, I ran into another problem. The Python version I had installed was 3.11, and that version was incompatible with the version of Spark that we were using, which was older. I kept getting a strange error PicklingError: Could not serialise object: IndexError: tuple index out of range. Luckily that had been addressed in the FAQ. It said we had to use an older version of python.

I put the question in the Slack about whether I should use an older version of Python or just watch the video. The answer was, make it work. So I asked ChatGPT how to go about this. It said that I could type on the command line in terminal conda create -n myenv python=3.10 anaconda and then conda activate myenv. I was told that this created a conda environment named myenv with python3.10 installed. It worked, and I could go on. The problem is, I have to remember to type this command every time I log onto the virtual machine. I actually read about this today in Andrew Ng's newsletter. I'll quote here:

Dear friends,

I think the complexity of Python package management holds down AI application development more than is widely appreciated. AI faces multiple bottlenecks — we need more GPUs, better algorithms, cleaner data in large quantities. But when I look at the day-to-day work of application builders, there’s one additional bottleneck that I think is underappreciated: The time spent wrestling with version management is an inefficiency I hope we can reduce.
A lot of AI software is written in the Python language, and so our field has adopted Python’s philosophy of letting anyone publish any package online. The resulting rich collection of freely available packages means that, with just one “pip install” command, you now can install a package and give your software new superpowers! The community’s parallel exploration of lots of ideas and open-sourcing of innovations has been fantastic for developing and spreading not just technical ideas but also usable tools.
But we pay a price for this highly distributed development of AI components: Building on top of open source can mean hours wrestling with package dependencies, or sometimes even juggling multiple virtual environments or using multiple versions of Python in one application. This is annoying but manageable for experienced developers, but creates a lot of friction for new AI developers entering our field without a background in computer science or software engineering.
I don’t know of any easy solution. Hopefully, as the ecosystem of tools matures, package management will become simpler and easier. Better tools for testing compatibility might be useful, though I’m not sure we need yet another Python package manager (we already have pip, conda, poetry, and more) or virtual environment framework.
As a step toward making package management easier, maybe if all of us who develop tools pay a little more attention to compatibility — for example, testing in multiple environments, specifying dependencies carefully, carrying out more careful regression testing, and engaging with the community to quickly spot and fix issues — we can make all of this wonderful open source work easier for new developers to adopt.

Keep coding!
Andrew

DEV Community

Spark functions

Top comments (0)

Read next

Backtest Like a Pro with a Forex API

Synchronous Applications

Advent of Code '24 - Day9: Disk Fragmenter (Python)

Deep Dive into Microsoft MarkItDown