Building a Weather Data Pipeline with PySpark, Prefect, and Google Cloud

#datapipeline #pyspark #gcp #prefect

In this article, I'll walk you through how to build a data pipeline that fetches weather data for multiple cities, processes the data using PySpark, and stores the output in Google Cloud.

We'll be using PySpark for distributed data processing, Prefect for workflow management, and Google Cloud Storage and BigQuery for data storage and processing.The code is available on github.
looker dashboard

The pipeline fetches weather data for multiple cities from the OpenWeatherMap API using requests library. The data is then processed and filtered using PySpark, and the output is stored as a CSV file in Google Cloud Storage. Finally, the data is loaded into a BigQuery table for further analysis.

pipeline

from pyspark.sql import SparkSession
from google.cloud import storage
from google.cloud import bigquery
import requests
import json
from google.cloud.exceptions import NotFound
import random


# Define the OpenWeatherMap API key and base URL
api_key = ""
base_url = "https://api.openweathermap.org/data/2.5/weather"

cities_100 = ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'Moscow', 'Beijing', 'Rio de Janeiro', 'Mumbai', 'Cairo', 'Rome', 'Berlin', 'Toronto', 'Lagos', 'Bangkok', 'Melbourne', 'Johannesburg']

cities = random.sample(cities_100, 25)

# Initialize a Spark session
spark = SparkSession.builder.appName("WeatherData").getOrCreate()

# Define a function to fetch weather data for a city and return a Spark dataframe
def fetch_weather_data(city):
    # Send a request to the OpenWeatherMap API for the city's weather data
    params = {"q": city, "appid": api_key, "units": "metric"}
    response = requests.get(base_url, params=params)
    data = response.json()

    # Extract the relevant weather data from the API response
    temp = data["main"]["temp"]
    humidity = data["main"]["humidity"]
    wind_speed = data["wind"]["speed"]

    # Create a Spark dataframe with the weather data for the city
    df = spark.createDataFrame([(city, temp, humidity, wind_speed)],
                               ["City", "Temperature", "Humidity", "WindSpeed"])
    return df

# Use the fetch_weather_data function to fetch weather data for all cities and merge them into a single dataframe
weather_data = None
for city in cities:
    city_weather_data = fetch_weather_data(city)
    if weather_data is None:
        weather_data = city_weather_data
    else:
        weather_data = weather_data.union(city_weather_data)

# Perform some basic processing and transformation on the weather data using PySpark
weather_data = weather_data.filter("Temperature > 10") \
                           .groupBy("City") \
                           .agg({"Humidity": "avg", "WindSpeed": "max"}) \
                           .withColumnRenamed("avg(Humidity)", "AverageHumidity") \
                           .withColumnRenamed("max(WindSpeed)", "MaxWindSpeed")

# Show the final processed and transformed weather data
weather_data.show()

# Write the weather data as a CSV file to a Google Cloud Storage bucket
bucket_name = "weather_app_dez"
file_name = "weather_data.csv"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.upload_from_string(weather_data.toPandas().to_csv(index=False), content_type="text/csv")

# Create a new BigQuery table and load the data from the CSV file
table_name = "dez-dtc-23-384116.weather_app.weather_data"
bigquery_client = bigquery.Client()
table = bigquery.Table(table_name)
schema = [bigquery.SchemaField("City", "STRING"),
          bigquery.SchemaField("AverageHumidity", "FLOAT"),
          bigquery.SchemaField("MaxWindSpeed", "FLOAT")]
table.schema = schema
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True)
job = bigquery_client.load_table_from_uri(f"gs://{bucket_name}/{file_name}", table, job_config=job_config)
job.result()
print(f"Loaded {job.output_rows} rows into BigQuery table {table_name}")

adding prefect functionality

from google.cloud import storage, bigquery
import requests
import json
from google.cloud.exceptions import NotFound
import random
from prefect import task, Flow
from pyspark.sql import SparkSession


# Define the OpenWeatherMap API key and base URL
api_key = ""
base_url = "https://api.openweathermap.org/data/2.5/weather"

spark = SparkSession.builder.appName("WeatherData").getOrCreate()

# Define the cities list
cities_100 = ['New York', 'London', 'Paris', 'Tokyo', 'Sydney', 'Moscow', 'Beijing', 'Rio de Janeiro', 'Mumbai', 'Cairo']
cities = random.sample(cities_100, 5)

# Define a function to fetch weather data for a city and return a Spark dataframe
@task
def fetch_weather_data(cities):
    weather_data = None
    for city in cities:
        # Send a request to the OpenWeatherMap API for the city's weather data
        params = {"q": city, "appid": api_key, "units": "metric"}
        response = requests.get(base_url, params=params)
        data = response.json()

        # Extract the relevant weather data from the API response
        temp = data["main"]["temp"]
        humidity = data["main"]["humidity"]
        wind_speed = data["wind"]["speed"]

        # Create a Spark dataframe with the weather data for the city
        city_weather_data = spark.createDataFrame([(city, temp, humidity, wind_speed)],
                                ["City", "Temperature", "Humidity", "WindSpeed"])
        if weather_data is None:
            weather_data = city_weather_data
        else:
            weather_data = weather_data.union(city_weather_data)
    return weather_data


'''
# Use the fetch_weather_data function to fetch weather data for all cities and merge them into a single dataframe
@task
def merge_weather_data(cities):
    weather_data = None
    for city in cities:
        city_weather_data = fetch_weather_data(city)
        if weather_data is None:
            weather_data = city_weather_data
        else:
            weather_data = weather_data.union(city_weather_data)
    return weather_data
'''
# Perform some basic processing and transformation on the weather data using PySpark
@task
def process_weather_data(weather_data):
    processed_data = weather_data.filter("Temperature > 10") \
                           .groupBy("City") \
                           .agg({"Humidity": "avg", "WindSpeed": "max"}) \
                           .withColumnRenamed("avg(Humidity)", "AverageHumidity") \
                           .withColumnRenamed("max(WindSpeed)", "MaxWindSpeed")
    return processed_data

# Write the weather data as a CSV file to a Google Cloud Storage bucket
@task
def write_weather_data_to_gcs(weather_data):
    bucket_name = "weather_app_dez"
    file_name = "weather_data.csv"
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    blob.upload_from_string(weather_data.toPandas().to_csv(index=False), content_type="text/csv")
    return f"gs://{bucket_name}/{file_name}"

# Create a new BigQuery table and load the data from the CSV file
@task
def write_weather_data_to_bigquery(uri):
    table_name = "dez-dtc-23-384116.weather_app.weather_data"
    bigquery_client = bigquery.Client()
    table = bigquery.Table(table_name)
    schema = [bigquery.SchemaField("City", "STRING"),
              bigquery.SchemaField("AverageHumidity", "FLOAT"),
              bigquery.SchemaField("MaxWindSpeed", "FLOAT")]
    table.schema = schema
    job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True)
    job = bigquery_client.load_table_from_uri(uri, table, job_config=job_config)
    job.result()
    print(f"Loaded {job.output_rows} rows into BigQuery table {table_name}")

with Flow("Weather Data Pipeline") as flow:

    cities = random.sample(cities_100, 5)
    weather_dat = fetch_weather_data(cities)
    processed_data = process_weather_data(weather_dat)
    uri = write_weather_data_to_gcs(processed_data)
    write_weather_data_to_bigquery(uri)

flow.run()

To run the pipeline, you'll need to have a Google Cloud account with billing enabled or free tier, and the necessary API keys for accessing the OpenWeatherMap API.

Configuration

Create a new Google Cloud Storage bucket to store the output data. Create a new BigQuery dataset and table to store the output data. Update the bucket_name and table_name variables in the write_weather_data_to_gcs and write_weather_data_to_bigquery tasks, respectively, with the appropriate names of the bucket and table you created.

Running the Pipeline

To run the pipeline, follow these steps: Open a terminal window and navigate to the project directory. Build the Docker image using the following command: docker build -t weather-data-pipeline . Run the Docker container using the following command: docker run--rm -it -v $(pwd):/app -e GOOGLE_APPLICATION_CREDENTIALS=/app/your-credentials.json weather-data-pipeline Note: Replace your-credentials.json with the name of your Google Cloud Platform service account key file. The pipeline will run and the output data will be written to the Google Cloud Storage bucket and BigQuery table you specified in the configuration step.

Troubleshooting

If you encounter any issues while running the pipeline, please check the following:

Ensure that the Google Cloud Platform credentials you specified are valid and have the appropriate permissions to access GCS and BigQuery. Ensure that the bucket and table names you specified in the configuration step are correct. Check the logs for any error messages that might indicate the cause of the issue.

In conclusion, building a data pipeline that fetches and processes weather data using PySpark, Prefect, and Google Cloud is an exciting project that showcases the power of these technologies. With this pipeline, you can easily collect and analyze weather data for multiple cities, and use it for various applications such as predictive modeling and weather forecasting.

DEV Community

Building a Weather Data Pipeline with PySpark, Prefect, and Google Cloud

pipeline

adding prefect functionality

Configuration

Running the Pipeline

Troubleshooting

Top comments (0)

Read next

Frontend Challenge - December Edition

Winter Solstice

Erasure (Apagamento de Tipos)

Erros de Ambiguidade com Genéricos