Enhancing Optimized PySpark Queries

edturner profile image Edward Turner ・1 min read

As we continue increasing the volume of data we are processing and storing, and as the velocity of technological advances transforms from linear to logarithmic and from logarithmic to horizontally asymptotic, innovative approaches to improving the run-time of our software and analysis are necessary.

These necessitating innovative approaches include utilizing two very popular frameworks: Apache Spark and Apache Arrow. These two frameworks enable users to process large volumes of data in a distributive fashion. These two frameworks, also, enables users to process larger volumes of data more quickly by using vectorized approaches. These two frameworks can easily facilitate big-data analysis. However, despite these two frameworks and their ability to empower users, there is still room for improvement, specifically within the python-ecosystem. Why can we confidently identify pockets of improvement in utilizing these frameworks within python? Let’s examine some features python has.


If you want to learn more, please continue reading here: https://towardsdatascience.com/enhancing-optimized-pyspark-queries-1d2e9685d882


markdown guide