DEV Community

Cover image for XML Batch Data Processing
DarrylBryson
DarrylBryson

Posted on

XML Batch Data Processing

XML Batch Data Processing

For decades xml was the dominate format many applications used to communicate with one another. In recent years xml has become less and less popular as a data format for semi-structured data cases. Now in modern cloud native, microservice applications json or yaml has become the norm. However this doesn't change the fact that much of the systems, tooling, and business process are still in place. XML is also the defacto 'neutral' format for converting industry specific file types. As a professional analyst being able to process xml as easily as csv is a must.

In this post we'll create a batch data processing script for a typical lab reporting use case. The scenario, a technician, Susian, at the end of each day analyzes the departments test results. The lab equipment outputs test results as xml. Susian requires the xml to be converted into csv in order to feed an Excel report she was given.

I'll be using a sample lab-measurement-dataset I auto generated in a previous post here.

Contents

Convert XML to CSV

Example XML File

<?xml version="1.0" ?>
<DataFiles>
  <DataFile id="d7ecf46df41f4355a23c42c4607266c7">
    <machine_id>Machine_10</machine_id>
    <test_id>3c0c95f773bf426585a3a68642d2d41a</test_id>
    <technician>Brett Kerr</technician>
    <test_routine>c/f/a/b</test_routine>
    <batched>Yes</batched>
    <loc_1>
      <x_offset>8.63</x_offset>
      <y_offset>0.39661</y_offset>
      <z_offset>3736</z_offset>
    </loc_1>
    <loc_2>
      <x_offset>-4.94</x_offset>
      <y_offset>0.964363</y_offset>
      <z_offset>3182</z_offset>
    </loc_2>
  </DataFile>
</DataFiles>
Enter fullscreen mode Exit fullscreen mode

Target Table Structure

machine_id test_id technician test_routine batched measurement_location_id x_offset y_offset z_offset
Machine_03 87729203dfce4e9da7efbbb985c83bd9 Stacey Simpson d/a/f Yes loc_1 47.81 1.878128 871
Machine_03 87729203dfce4e9da7efbbb985c83bd9 Stacey Simpson d/a/f Yes loc_2 -2.1 0.995672 4694
Machine_03 98910cae86864c3697e6e7fccfb8cc33 Brittney Gray f/d/a Yes loc_1 2.5 0.236024 3034

Parsing XML: Root element

To handel the xml parsing we'll use the import xml.etree.ElementTree package. The key to parsing xml is creating a class representing a single file and methods that translate each element. In the example file the first element to isolate will be each DataFile for the DataFiles collection:

DataFile element selection

import xml.etree.ElementTree as ET
import uuid
import pandas as pd
import glob
Enter fullscreen mode Exit fullscreen mode
file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'
with open(file_path) as f:
    tree = ET.parse(f)
    rootElem = tree.getroot()

    # Select each DataFile
    for data_file in rootElem.findall('DataFile'):
        print(data_file.get('id'))
Enter fullscreen mode Exit fullscreen mode
>>> output:
d7ecf46df41f4355a23c42c4607266c7
Enter fullscreen mode Exit fullscreen mode

Parsing XML: First level elements

Next extract the 'Header' information for each test

Header element selection

file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'
with open(file_path) as f:
    tree = ET.parse(f)
    rootElem = tree.getroot()
for data_file in rootElem.findall('DataFile'):
    m_id = data_file.find('machine_id').text
    test_id = data_file.find('test_id').text
    tech_id = data_file.find('technician').text
    test_routine = data_file.find('test_routine').text
    batched = data_file.find('batched').text

    header_ar = [m_id, test_id, tech_id, test_routine, batched]
    print(header_ar)
Enter fullscreen mode Exit fullscreen mode
>>> output:
['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A']
Enter fullscreen mode Exit fullscreen mode

Parsing XML: Sub-elements

Finally, extract the measurement data for loc_1 and loc_2

Measurement element selection

file_path = '/sample-data-set/auto-gen/xml/a8971cf83bd84fd1b366bfb312278021.xml'

def parse_measurement_location(dataFile, loc_name):
    res = []
    measurement_loc = dataFile.find(loc_name)
    measurement_loc_id = measurement_loc.tag
    x_m = measurement_loc.find('x_offset').text
    y_m = measurement_loc.find('y_offset').text
    z_m = measurement_loc.find('z_offset').text

    return [measurement_loc_id,x_m,y_m,z_m]

with open(file_path) as f:
    tree = ET.parse(f)
    rootElem = tree.getroot()
for data_file in rootElem.findall('DataFile'):
    file_ar = []
    m_id = data_file.find('machine_id').text
    test_id = data_file.find('test_id').text
    tech_id = data_file.find('technician').text
    test_routine = data_file.find('test_routine').text
    batched = data_file.find('batched').text

    header_ar = [m_id, test_id, tech_id, test_routine, batched]

    loc_ar = parse_measurement_location(data_file, 'loc_1')
    file_ar.append(header_ar + loc_ar)

    loc_ar = parse_measurement_location(data_file, 'loc_2')
    file_ar.append(header_ar + loc_ar)
    print(file_ar)


Enter fullscreen mode Exit fullscreen mode
>>> output:
[['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A', 'loc_1', '17.35', '1.3074', '752'], 

['Machine_03', '9e0f95807ed44a468271eb6d3ff85a44', 'Jennifer Johnson', 'd', 'N/A', 'loc_2', '2.68', '0.575979', '4483']]
Enter fullscreen mode Exit fullscreen mode

XML Parser Class

We then place the transformation logic into a class that can be called for any file:

# Parsing class
class Xml_Parser:
    def __init__(self):
        self.ResultAr = []
        return
    def parse_datafiles(self,file_id,rootElem):
        res = []
        for data_file in rootElem.findall('DataFile'):
            file_ar = self.parse_file(data_file)
            res += file_ar
        return res
    def parse_file(self, data_file):

        file_ar = []
        m_id = data_file.find('machine_id').text
        test_id = data_file.find('test_id').text
        tech_id = data_file.find('technician').text
        test_routine = data_file.find('test_routine').text
        batched = data_file.find('batched').text

        header_ar = [m_id, test_id, tech_id, test_routine, batched]

        loc_ar = self.parse_measurement_location(data_file, 'loc_1')
        file_ar.append(header_ar + loc_ar)

        loc_ar = self.parse_measurement_location(data_file, 'loc_2')
        file_ar.append(header_ar + loc_ar)

        return file_ar
    def parse_measurement_location(self, dataFile, loc_name):
        res = []
        measurement_loc = dataFile.find(loc_name)
        measurement_loc_id = measurement_loc.tag
        x_m = measurement_loc.find('x_offset').text
        y_m = measurement_loc.find('y_offset').text
        z_m = measurement_loc.find('z_offset').text

        return [measurement_loc_id,x_m,y_m,z_m]
Enter fullscreen mode Exit fullscreen mode

Batch Execution

To do a batch conversion on a set of files we'll need to:

  1. Parse a given file by executing our parsing class
  2. Find all test files

# Conversion executor function
def convert_xml_to_list(file_path):
    with open(file_path) as f:
        tree = ET.parse(f)
        root = tree.getroot()
        parser_obj = Xml_Parser()
        xml_list = parser_obj.parse_datafiles(f,root)
    return xml_list
Enter fullscreen mode Exit fullscreen mode
# Batch executor
def batch_convert_xml_to_df(xml_dir, dataset_columns,file_limit=-1):
    i=0
    converted_dataset = []
    # Recursively convert each target file
    for filepath in glob.iglob(xml_dir, recursive=True):
        c_ds = convert_xml_to_list(filepath)
        converted_dataset+=c_ds

        i+=1
        if (i >= file_limit) and (file_limit>=0): break

    df = pd.DataFrame(converted_dataset, columns = dataset_columns) 
    return df
Enter fullscreen mode Exit fullscreen mode

Save Converted Dataset

The last step is to bring all the pieces together convert and save a folder of xml files to a single csv.

measurement_file_path = '/sample-data-set/auto-gen/xml/*.xml'
measurement_columns = ['machine_id','test_id','technician','test_routine','batched','measurement_location_id','x_offset','y_offset','z_offset']
file_process_limit = 20 # Set to -1 for unlimited

converted_file_dir = '/sample-data-set/auto-gen/converted'

lab_measurement_df = batch_convert_xml_to_df(measurement_file_path, measurement_columns, file_process_limit)

# Save dataset to csv using unique name
destPath = '/'.join([converted_file_dir,str(uuid.uuid4().hex)])
destPath = '.'.join([destPath,'csv'])

lab_measurement_df.to_csv(destPath, index = False, header=True)

lab_measurement_df.head()
Enter fullscreen mode Exit fullscreen mode
machine_id test_id technician test_routine batched measurement_location_id x_offset y_offset z_offset
0 Machine_03 87729203dfce4e9da7efbbb985c83bd9 Stacey Simpson d/a/f Yes loc_1 47.81 1.878128 871
1 Machine_03 87729203dfce4e9da7efbbb985c83bd9 Stacey Simpson d/a/f Yes loc_2 -2.1 0.995672 4694
2 Machine_03 98910cae86864c3697e6e7fccfb8cc33 Brittney Gray f/d/a Yes loc_1 2.5 0.236024 3034
3 Machine_03 98910cae86864c3697e6e7fccfb8cc33 Brittney Gray f/d/a Yes loc_2 -13.38 0.795762 4337
4 Machine_03 f505ffc52ad34b56a9f7f95451a813c3 John King c/f/e No loc_1 44.2 1.982816 830

TL;DR

Complete script here

Top comments (2)

Collapse
 
minchulkim87 profile image
Min

Hey, great post with a helpful example.

Ideally, we shouldn't have to write so much code just to read in XML.

Very soon, we will be able to read XML as easily as a csv, with just one line of code. See pandas

Last year I've created my own pandas-read-xml. You can test it out here. Of course, my solution will be obsolete as soon as pandas releases their dev version.

Collapse
 
darrylbrysondev0 profile image
DarrylBryson • Edited

Thanks for bringing the great Pandas news. After xml is directly supported I'll be sad to see these functions leave our toolboxes, they've served me well.