XML Batch Data Processing
For decades xml was the dominate format many applications used to communicate with one another. In recent years xml has become less and less popular as a data format for semi-structured data cases. Now in modern cloud native, microservice applications json or yaml has become the norm. However this doesn't change the fact that much of the systems, tooling, and business process are still in place. XML is also the defacto 'neutral' format for converting industry specific file types. As a professional analyst being able to process xml as easily as csv is a must.
In this post we'll create a batch data processing script for a typical lab reporting use case. The scenario, a technician, Susian, at the end of each day analyzes the departments test results. The lab equipment outputs test results as xml. Susian requires the xml to be converted into csv in order to feed an Excel report she was given.
I'll be using a sample lab-measurement-dataset I auto generated in a previous post here.
Contents
Convert XML to CSV
Example XML File
<?xml version="1.0" ?>
<DataFiles>
<DataFile id="d7ecf46df41f4355a23c42c4607266c7">
<machine_id>Machine_10</machine_id>
<test_id>3c0c95f773bf426585a3a68642d2d41a</test_id>
<technician>Brett Kerr</technician>
<test_routine>c/f/a/b</test_routine>
<batched>Yes</batched>
<loc_1>
<x_offset>8.63</x_offset>
<y_offset>0.39661</y_offset>
<z_offset>3736</z_offset>
</loc_1>
<loc_2>
<x_offset>-4.94</x_offset>
<y_offset>0.964363</y_offset>
<z_offset>3182</z_offset>
</loc_2>
</DataFile>
</DataFiles>
Target Table Structure
machine_id | test_id | technician | test_routine | batched | measurement_location_id | x_offset | y_offset | z_offset |
---|---|---|---|---|---|---|---|---|
Machine_03 | 87729203dfce4e9da7efbbb985c83bd9 | Stacey Simpson | d/a/f | Yes | loc_1 | 47.81 | 1.878128 | 871 |
Machine_03 | 87729203dfce4e9da7efbbb985c83bd9 | Stacey Simpson | d/a/f | Yes | loc_2 | -2.1 | 0.995672 | 4694 |
Machine_03 | 98910cae86864c3697e6e7fccfb8cc33 | Brittney Gray | f/d/a | Yes | loc_1 | 2.5 | 0.236024 | 3034 |
Parsing XML: Root element
To handel the xml parsing we'll use the import xml.etree.ElementTree
package. The key to parsing xml is creating a class representing a single file and methods that translate each element. In the example file the first element to isolate will be each DataFile
for the DataFiles
collection:
import xml.etree.ElementTree as ET
import uuid
import pandas as pd
import glob
file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'
with open(file_path) as f:
tree = ET.parse(f)
rootElem = tree.getroot()
# Select each DataFile
for data_file in rootElem.findall('DataFile'):
print(data_file.get('id'))
>>> output:
d7ecf46df41f4355a23c42c4607266c7
Parsing XML: First level elements
Next extract the 'Header' information for each test
file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'
with open(file_path) as f:
tree = ET.parse(f)
rootElem = tree.getroot()
for data_file in rootElem.findall('DataFile'):
m_id = data_file.find('machine_id').text
test_id = data_file.find('test_id').text
tech_id = data_file.find('technician').text
test_routine = data_file.find('test_routine').text
batched = data_file.find('batched').text
header_ar = [m_id, test_id, tech_id, test_routine, batched]
print(header_ar)
>>> output:
['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A']
Parsing XML: Sub-elements
Finally, extract the measurement data for loc_1
and loc_2
file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'
def parse_measurement_location(dataFile, loc_name):
res = []
measurement_loc = dataFile.find(loc_name)
measurement_loc_id = measurement_loc.tag
x_m = measurement_loc.find('x_offset').text
y_m = measurement_loc.find('y_offset').text
z_m = measurement_loc.find('z_offset').text
return [measurement_loc_id,x_m,y_m,z_m]
with open(file_path) as f:
tree = ET.parse(f)
rootElem = tree.getroot()
for data_file in rootElem.findall('DataFile'):
file_ar = []
m_id = data_file.find('machine_id').text
test_id = data_file.find('test_id').text
tech_id = data_file.find('technician').text
test_routine = data_file.find('test_routine').text
batched = data_file.find('batched').text
header_ar = [m_id, test_id, tech_id, test_routine, batched]
loc_ar = parse_measurement_location(data_file, 'loc_1')
file_ar.append(header_ar + loc_ar)
loc_ar = parse_measurement_location(data_file, 'loc_2')
file_ar.append(header_ar + loc_ar)
print(file_ar)
>>> output:
[['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A', 'loc_1', '17.35', '1.3074', '752'],
['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A', 'loc_2', '2.68', '0.575979', '4483']]
XML Parser Class
We then place the transformation logic into a class that can be called for any file:
# Parsing class
class Xml_Parser:
def __init__(self):
self.ResultAr = []
return
def parse_datafiles(self,file_id,rootElem):
res = []
for data_file in rootElem.findall('DataFile'):
file_ar = self.parse_file(data_file)
res += file_ar
return res
def parse_file(self, data_file):
file_ar = []
m_id = data_file.find('machine_id').text
test_id = data_file.find('test_id').text
tech_id = data_file.find('technician').text
test_routine = data_file.find('test_routine').text
batched = data_file.find('batched').text
header_ar = [m_id, test_id, tech_id, test_routine, batched]
loc_ar = self.parse_measurement_location(data_file, 'loc_1')
file_ar.append(header_ar + loc_ar)
loc_ar = self.parse_measurement_location(data_file, 'loc_2')
file_ar.append(header_ar + loc_ar)
return file_ar
def parse_measurement_location(self, dataFile, loc_name):
res = []
measurement_loc = dataFile.find(loc_name)
measurement_loc_id = measurement_loc.tag
x_m = measurement_loc.find('x_offset').text
y_m = measurement_loc.find('y_offset').text
z_m = measurement_loc.find('z_offset').text
return [measurement_loc_id,x_m,y_m,z_m]
Batch Execution
To do a batch conversion on a set of files we'll need to:
- Parse a given file by executing our parsing class
- Find all test files
# Conversion executor function
def convert_xml_to_list(file_path):
with open(file_path) as f:
tree = ET.parse(f)
root = tree.getroot()
parser_obj = Xml_Parser()
xml_list = parser_obj.parse_datafiles(f,root)
return xml_list
# Batch executor
def batch_convert_xml_to_df(xml_dir, dataset_columns,file_limit=-1):
i=0
converted_dataset = []
# Recursively convert each target file
for filepath in glob.iglob(xml_dir, recursive=True):
c_ds = convert_xml_to_list(filepath)
converted_dataset+=c_ds
i+=1
if (i >= file_limit) and (file_limit>=0): break
df = pd.DataFrame(converted_dataset, columns = dataset_columns)
return df
Save Converted Dataset
The last step is to bring all the pieces together convert and save a folder of xml files to a single csv.
measurement_file_path = '/sample-data-set/auto-gen/xml/*.xml'
measurement_columns = ['machine_id','test_id','technician','test_routine','batched','measurement_location_id','x_offset','y_offset','z_offset']
file_process_limit = 20 # Set to -1 for unlimited
converted_file_dir = '/sample-data-set/auto-gen/converted'
lab_measurement_df = batch_convert_xml_to_df(measurement_file_path, measurement_columns, file_process_limit)
# Save dataset to csv using unique name
destPath = '/'.join([converted_file_dir,str(uuid.uuid4().hex)])
destPath = '.'.join([destPath,'csv'])
lab_measurement_df.to_csv(destPath, index = False, header=True)
lab_measurement_df.head()
machine_id | test_id | technician | test_routine | batched | measurement_location_id | x_offset | y_offset | z_offset | |
---|---|---|---|---|---|---|---|---|---|
0 | Machine_03 | 87729203dfce4e9da7efbbb985c83bd9 | Stacey Simpson | d/a/f | Yes | loc_1 | 47.81 | 1.878128 | 871 |
1 | Machine_03 | 87729203dfce4e9da7efbbb985c83bd9 | Stacey Simpson | d/a/f | Yes | loc_2 | -2.1 | 0.995672 | 4694 |
2 | Machine_03 | 98910cae86864c3697e6e7fccfb8cc33 | Brittney Gray | f/d/a | Yes | loc_1 | 2.5 | 0.236024 | 3034 |
3 | Machine_03 | 98910cae86864c3697e6e7fccfb8cc33 | Brittney Gray | f/d/a | Yes | loc_2 | -13.38 | 0.795762 | 4337 |
4 | Machine_03 | f505ffc52ad34b56a9f7f95451a813c3 | John King | c/f/e | No | loc_1 | 44.2 | 1.982816 | 830 |
TL;DR
Complete script here
Top comments (2)
Hey, great post with a helpful example.
Ideally, we shouldn't have to write so much code just to read in XML.
Very soon, we will be able to read XML as easily as a csv, with just one line of code. See pandas
Last year I've created my own pandas-read-xml. You can test it out here. Of course, my solution will be obsolete as soon as pandas releases their dev version.
Thanks for bringing the great Pandas news. After xml is directly supported I'll be sad to see these functions leave our toolboxes, they've served me well.