While working with a huge volume of data, it may be required to do analysis only on certain set of data specific to say days', months' data. It is not uncommon to store data in a year/month/date or even hour/minute format.
As loading data to dataframe requires a lot of compute power and time, any optimization on data load saves a tons of resources.
So for selectively searching data in specific folder using spark dataframe load method, following wildcards can be used in the path parameter.
Environment Setup:
The files are on Azure Blob Storage with the format of yyyy/MM/dd/xyz.txt. So as to see the results, the files themselves just have one line with the date in it for easier explanation.
The examples below might show for day alone, however you can
All the files for all the days.
Format to use:
"/*/*/*/*" (One each for each hierarchy level and the last * represents the files themselves).
df = spark.read.text(mount_point + "/*/*/*/*")
Specific days/ months folder to check
Format to use:
"/*/*/1[2,9]/*" (Loads data for Day 12th and 19th of all months of all years)
"/*/*//{09,19,23/}/*" (Loads data for 9th, 19th and 23rd of all months of all years)
df = spark.read.text(mount_point + "/*/*/1[2,9]/*")
df = spark.read.text(mount_point + "/*/*//{09,19,23/}/*")
Specific series of folders
Format to use:
"/*/*/1[3-6]/*" (Loads data from Day 13th to 16th of all months of all years)
df = spark.read.text(mount_point +"/*/*/1[3-6]/*")
Combining Specific folders and some series
Format to use:
"/*/*//{09,1[8-9],2[0-1]/}/*" (Loads data for Day 9th and from 18th to 21st of all months of all years)
df = spark.read.text(mount_point +"/*/*//{09,1[8-9],2[0-1]/}/*")
Ofcourse, the other folders in the path can also use wildcards or specific values, based on need. Below is an example of 10th, 20th and 30th of Sep-2020 alone.
df = spark.read.text(mount_point +"/2020/09//{10,20,30/}/*")
Happy data mining!
Top comments (0)