Exploiting Schema Inference in Apache Spark

#spark #scala #bigdata #datascience

One of the greatest feature of Apache Spark is it’s ability to infer the schema on the fly. Reading the data and generating a schema as you go although being easy to use, makes the data reading itself slower. However, there is a trick to generate the schema once, and then just load it from disk. Let’ dive in!

For the sake of this article, let’s assume that we are using data with complex data structure, where creating a case class or struct type of any kind would be a hundred lines long.

Saving the Schema

Let’s begin with reading some sample data and enforcing a schema inference on it. Using Apache Spark and Scala language, this can be done like following:

val carsDf = spark.read
    .option("inferSchema", "true")
    .json("src/main/resources/data/cars.json")

I read a sample JSON file (cars.json) to a DataFrame called carsDf. A sample record of this dataset looks like the following:

{
   "Name":"chevrolet chevelle malibu",
   "Miles_per_Gallon":18,
   "Cylinders":8,
   "Displacement":307,
   "Horsepower":130,
   "Weight_in_lbs":3504,
   "Acceleration":12,
   "Year":"1970-01-01",
   "Origin":"USA"
}

As we can see above, the schema is fairly straight forward to understand. Now, let’s see what Spark has managed to infer from it:

val schemaJson = carsDf.schema.json
println(schemaJson)

The output of this action can be something like this:

{
   "type":"struct",
   "fields":[
      {
         "name":"Acceleration",
         "type":"double",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Cylinders",
         "type":"long",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Displacement",
         "type":"double",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Horsepower",
         "type":"long",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Miles_per_Gallon",
         "type":"double",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Name",
         "type":"string",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Origin",
         "type":"string",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Weight_in_lbs",
         "type":"long",
         "nullable":true,
         "metadata":{

         }
      },
      {
         "name":"Year",
         "type":"string",
         "nullable":true,
         "metadata":{

         }
      }
   ]
}

That is how Spark defines the schema it managed to infer. Now, this information would have to be recreated every time we launch our Spark job. When doing development, there is a lot of that, and so we are losing a lot of time on recreation of that schema. Let’s try to generate it once and reuse in consecutive reads.

In the Scala I am using, the schema’s JSON file can be saved to local filesystem using the code:

val file = new File("src/resources/schema.json")
val bw = new BufferedWriter(new FileWriter(file))
bw.write(schemaJson)
bw.close()

This way, our inferred schema is saved to local filesystem as a JSON file called schema.json. In the next step, we will see how to read it and speed up the Spark with it.

Loading the Schema

Now, let’s see how to read the JSON file using Spark and Scala environment. This can be done using the following code:

import org.apache.spark.sql.types.{DataType, StructType}

val schemaJson = Source
    .fromFile("src/resources/schema.json")
    .getLines
    .mkString

val schemaStructType = Try(DataType.fromJson(schemaJson))
        .getOrElse(LegacyTypeStringParser.parse(schemaJson))
    match {
      case t: StructType => t
      case _ => throw new RuntimeException(s"Failed parsing JSON Schema")
    }

The section of code above, reads the schema from JSON file and parses it into a StructType instance, thanks to Spark SQL package. I have used pattern matching here, to accommodate for the incorrect path being used.

With the pre-generated schema being available, reading the data in Spark will be way faster. Using our example with cars.json file, we can read this data like this:

val carsDf = spark.read
    .schema(schemaStructType)
    .json("src/main/resources/data/cars.json")

Summary

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally, you can follow me on my social media if you fancy so 🙂

DEV Community

Exploiting Schema Inference in Apache Spark

Saving the Schema

Loading the Schema

Summary

Top comments (0)

Read next

How Big Data is Powering the Internet of Things (IoT) Revolution - MasTech InfoTrellis

New Paradigm: Vision Mamba Offers Efficient Visual Learning with Bidirectional State Models

Deep Dive into Dremio's File-based Auto Ingestion into Apache Iceberg Tables

New Benchmark Reveals Limitations of Long-Context AI Language Models