Java: Infer Column Widths of a Fixed-Width Text File

#java #parsing #functional #streams

Delimited Files and Fixed-Width Files

Flat text files containing tables of data are usually organised in one of two ways: as delimited files, or as fixed-width files. Delimited files use one or more characters in series to separate the columns of the tabular data along each row (and line breaks are almost always used to separate rows). A common delimited file format is the CSV (comma-separated values) format:

287540,Smith,Jones,Accountant,"$55,000"
204878,Ross,Betsy,Senior Accountant,"$66,000"
208417,Arthur,Wilbur,CEO,"$123,000"

...the delimiter can sometimes show up within a value in a row, and when that happens, the value is usually surrounded by double-quotes. Quotes can also show up in values, and when they appear, they are escaped by doubling (""). RFC-4180 defines the standard CSV format.

A fixed-width file, on the other hand, enforces a fixed column width for each column (though not all columns necessarily have the same width) and pads the remaining space on the left or on the right, usually with spaces:

287540 Smith  Jones  Accountant         $55,000
204878 Ross   Betsy  Senior Accountant  $66,000
208417 Arthur Wilbur CEO               $123,000

There are advantages and distadvantages of each of these approaches. A delimited file can be easier to parse, unless there are escape characters and delimiters embedded in values. A delimited file also takes up less space than a fixed-width file, as it doesn't waste bytes padding the file full of spaces. Parsing CSV files can be simple enough if you have a good RegEx, but parsing a fixed-width file can be difficult. Either, the user has to know the column widths in advance and pass that to a parsing method, or the method has to infer the widths of the columns. The second one, being a bit more automated, is one that I would tend to prefer, so let's try to do that!

Read a Text File into a `List<String>`

The very first thing we want to do is get our fixed-width file into a List<String>. To do this, we simply get a java.io.Reader for the file as a BufferedReader and then use BufferedReader's readLine() method over and over until it returns null (it returns a String if it's successfully read a line):

jshell> String fileName = "src/main/resources/example_sql_windows.txt"
fileName ==> "src/main/resources/example_sql_windows.txt"

jshell> BufferedReader reader = new BufferedReader(new FileReader(fileName))
reader ==> java.io.BufferedReader@2353b3e6

jshell> List<String> lines = new ArrayList<>()
lines ==> []

jshell> String line = null // for use in the loop below
line ==> null

jshell> while ((line = reader.readLine()) != null) lines.add(line)

jshell> int nLines = lines.size() // save this for later
nLines ==> 22

That's it! Easy! Note that we had to instantiate an ArrayList because List is only an interface and can't be instantiated directly. We can also use the diamond operator <> to save some typing. Other than that, I hope the rest of the code above is more or less straightforward. Now we can access lines of our file by their indices in our lines list.

Count the Number of Non-Whitespace Characters Per Character Column

Next, we want to count the number of non-whitespace characters per character column (as opposed to data columns). A "character column" is a single-character-wide column of the file, while a data column is composed of one or more adjacent character columns. A character column with very few non-whitespace characters is likely to be a delimiter column (separating data columns). I'll explain the code step-by-step here, for clarity.

First, we want to take each line of our file and determine whether a character is a whitespace character or not. Basically, we want to convert our List<String> to a List<List<Boolean>>, where each element of the inner List is true if the character at that position on that line is not a whitespace character. To do that, we first break the String into a char[] array using String.toCharArray(). (To start, I'll use the first line of lines (lines.get(0)) as a placeholder for later, when we'll use a loop.)

jshell> lines.get(0).toCharArray()
$86 ==> char[771] { 'e', 'x', 'e', 'c', ...

At this point, we could convert this char[] to a Stream<Character> by surrounding the above with a CharBuffer.wrap(), then calling chars() on the resulting CharBuffer, using mapToObj() and so on, but there's a much more performant way of achieveing the same thing -- a good, old for loop:

jshell> List<List<Boolean>> charsNonWS = new ArrayList<>() // String line => List<Boolean> line
charsNonWS ==> []

jshell> for (int ll = 0; ll < nLines; ++ll) { // loop over lines read from file
   ...>   charsNonWS.add(new ArrayList<Boolean>()); // add new empty array to List
   ...>   List<Boolean> temp = charsNonWS.get(ll); // save reference to use below
   ...>   for (char ch : lines.get(ll).toCharArray()) // loop over chars in this line
   ...>     temp.add(!Character.isWhitespace(ch)); // true if char is non-whitespace
   ...> }

jshell> charsNonWS
charsNonWS ==> [[true, true, true, true, true, ...

Now, we want to count the number of non-whitespace characters per column, not per row. So we need to "rotate" our data, in a sense. To do this, let's first find the maximum number of character columns per row, then make an array of that length. Here, I use a Stream to save typing out another big for loop:

jshell> int nCharCols = charsNonWS.stream().mapToInt(e -> e.size()).max().orElse(0)
nCharCols ==> 771

charsNonWS.stream() converts charsNonWS from a List<List<Boolean>> to a Stream<List<Boolean>>. In other words, each element of the Stream is one line of the file, where characters have been converted to false/true values based on whether they're whitespace characters or not, respectively. Then, we map each List<Boolean> to a single Integer value with mapToInt(). That value is the length of the line, in number of characters, which we find by mapping each List<Boolean> to its size with mapToInt(e -> e.size()). Finally, we find the maximum value of the Stream (which is now a Stream<Integer>) with max(). max() returns an Optional, so we need to extract that value with a get() or something similar. I opted for an orElse(0), which will return 0 as the maximum line length (in characters) if something went wrong in the Stream.

So, the maximum number of characters that any row in our file has is 771. Now, let's create an int[] and count the number of non-whitespace characters in each of these 771 columns:

jshell> int[] counts = new int[nCharCols]
counts ==> int[771] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 0, 0, 0, 0, 0, 0, 0, 0 }

ints are initialised to 0, so we don't need to clear the array before we start working with it. Instead, let's go straight to our count of non-whitespace characters per column.

jshell> for (List<Boolean> row : charsNonWS) // loop over each "row" ("line" / inner List<Boolean>)
   ...>   for (int cc = 0; cc < row.size(); ++cc) // loop over each "column" (char) in that "row" (line)
   ...>     if (row.get(cc)) ++counts[cc]; // if the char is non-whitespace (true), increment column

jshell> counts
counts ==> int[771] { 4, 4, 4, 2, 4, 4, 4, 4, 2, ...

So counts now holds the number of non-whitespace characters in each character column of the text file.

Infer "Empty" Columns

Next, we want to get an overview of the number of non-whitespace characters per column. In other words, do any columns have no non-whitespace characters? Or is there a minimum number? Essentially what we want to do is make a histogram of counts. The easiest way to do this is probably by using another Stream:

jshell> Map<Integer, Long> map = Arrays.stream(counts). // convert int[] to Stream of primitive ints
   ...>   mapToObj(i -> (Integer)i). // convert primitive ints to Integers
   ...>   collect(Collectors.groupingBy( // group the Integers according to...
   ...>     Function.identity(), // their identity (value)
   ...>     Collectors.counting() // and then count the number in each group
   ...>   ))
map ==> {16=10, 0=9, 1=549, 17=113, 18=31, 2=39, 19=7, 3=2, 4=11}

So there are 9 lines with 0 non-whitespace characters, 549 lines with 1 non-whitespace character, and so on. It seems likely that those 9 "empty" character columns delimit the data columns. Let's programmatically extract the minimum number of non-whitespace characters in a given character column from the map above, using that to define "empty" columns:

jshell> int emptyColDef = Collections.min(map.keySet())
emptyColDef ==> 0

This might seem a bit overboard for this application, but in general it's a good idea to automate things like this. It makes your code more robust and reusable for future applications. The above code simply compares the keys of map (the number of non-whitespace characters per character column) and finds the minimum one.

Find Delimiting Columns

Now, we can find the character columns which define (delimit) the extents of the data columns. These are usually the columns with the fewest number of non-whitespace characters (when whitespace characters are used to pad the fixed-width data columns). We want the indices of these character columns, so let's get a Stream of counts and compare those values to our emptyColDef:

jshell> List<Boolean> emptyCols = Arrays.stream(counts). // convert int[] to Stream of primitive ints
   ...>   mapToObj(n -> n == emptyCol). // convert primitive ints to Booleans
   ...>   collect(Collectors.toList()) // collect in a List
emptyCols ==> [false, false, false, ...

Empty (delimiting) columns are those with true values in emptyCols. To find the indices, we simply loop over emptyCols:

jshell> List<Integer> emptyIndices = new ArrayList<>()
emptyIndices ==> []

jshell> for (int cc = 0; cc < nCharCols; ++cc)
   ...>   if (emptyCols.get(cc)) emptyIndices.add(cc)

jshell> emptyIndices
emptyIndices ==> [38, 89, 120, 151, 352, 553, 592, 631, 670]

The for loop above simply looks to see if the value at index cc in emptyCols is true. If it is, it adds that index to emptyIndices, which now holds the character column indices which delimit the data columns in our fixed-width file! The last thing to do is append a 0 to the beginning of the List, because we'll use adjacent values as the "start" and "end" character columns for each data column, and the first data column begins at the 0th character:

jshell> int nDataCols = emptyIndices.size()
nDataCols ==> 9

jshell> emptyIndices.add(0, 0) // add a value 0 at the 0th position in the List

jshell> emptyIndices
emptyIndices ==> [0, 38, 89, 120, 151, 352, 553, 592, 631, 670]

Parsing

Finally, we can use emptyIndices to parse our file. We can split each line at the given character indices, then do a String.trim() to remove leading and/or trailing whitespace. Note that some lines may be shorter than the "standard" line length (holding metadata or something similar) so we need to do a bounds check before we split the String line into substrings:

jshell> List<List<String>> tokens = new ArrayList<>(nLines) // pre-allocate space
tokens ==> []

jshell> for (int ll = 0; ll < nLines; ++ll) { // loop over all lines in file
   ...>   tokens.add(new ArrayList<String>()); // add new List<String> parsed tokens for line
   ...>   List<String> tokensList = tokens.get(ll); // get reference to List to use below
   ...>   String line = lines.get(ll); // get line as String
   ...>   int len = line.length(); // get length of line in characters
   ...>   for (int ii = 1; ii <= nDataCols; ++ii) { // loop over data columns
   ...>     if (len < emptyIndices.get(ii)) break; // check if line is long enough to have next token
   ...>     tokensList.add(line.substring(emptyIndices.get(ii-1), emptyIndices.get(ii)).trim()); // get token
   ...>   }
   ...> }

jshell> tokens
tokens ==> [[execBegan, SampleID, ExperimentID, ...

jshell> tokens.get(7) // for example
$142 ==> [2018-11-04 11:07:16.8570000, 0016M978, test, test, SP -> Gilson, Execution Completed, 2018-11-04 11:07:15.0000000, 2018-11-04 11:09:37.5330000, 2018-11-04 11:07:11.7870000]

Beautiful! Now, we have a List<List<String>> containing (in the outer List) the lines of the file broken up into (in the inner Lists) String tokens, with leading and trailing whitespace trimmed. We inferred the column widths of a fixed-width text file and parsed its contents! As a next step, we could attempt to infer the type of data held in each token, maybe using something like my Typifier, which infers the type of data held within Java Strings.

I hope this walkthrough was helpful and/or interesting! If you have any comments or questions, please let me know in the comments below. I've compiled the code above into a class and posted it to Gist, as well. Happy coding!