Last weekend, I set out to learn Spring Batch by creating a small project. I decided to parse WhatsApp messages to a relational database, so some interesting analysis and produce reports from the data. I used JPA for database persistence into two simple tables shown below.
┌──────────────────────────────┐ ┌────────────────────────┐
│ MESSAGES │ │ CONTACTS │
│ │ │ │
│ MESSAGE_ID (PK) ├────►│ CONTACT_ID (PK) │
│ MESSAGE_SENT (TIMESTAMP) │ │ CONTACT_NAME (VARCHAR) │
│ CONTACT_ID (FK) │ └────────────────────────┘
│ MESSAGE_TEXT (VARCHAR) │
│ MESSAGE_TYPE (VARCHAR, ENUM) │
└──────────────────────────────┘
I will not go into the details on how to set up a new Spring Batch project - there are many tutorials out there - but rather focus on a few interesting problems that I encountered. The full code for this project is at sualeh/whatstats if you are interested.
Parsing a Record Over Multiple Lines
Spring Batch is usually used to parse a record over a single line. There is an explanation of a technique on the Spring Batch website to parse files with records on multiple lines, but this does not apply to WhatsApp chat logs. The technique described assumes that each line has an identifiable regular expression pattern, with a footer line. A WhatsApp chat log is a series of messages, each with a timestamp and a message, but the message itself can span multiple lines. For example:
[2/3/22, 08:47:29] Some Chatter: First line of the chat message.
Blank line followed by another paragraph, which is still part of the same message.
There is no pattern that can identify when the message ends. The only way to identify the end of a message is to look for the first line of the next record which matches the pattern "[date, time] contact name: message". The only way to parse such a file is to create a custom item reader. So, borrowing from the technique on the Spring Batch website, I used a delegate flat file reader to read lines in a loop. An internal buffer (StringBuilder) keeps appending the lines until a line matches the pattern identifying the start of a new record. At that point, the buffer is converted to a message item. The internal buffer is reset, and the next line is added to the buffer before the message item is returned. The code looks something like this:
for (String line; (line = this.delegate.read()) != null; ) {
if (buffer.isEmpty()) {
buffer.append(line);
} else if (linePrefix.matcher(line).matches()) {
final Message message = lineMapper.mapLine(buffer.toString(), 0);
buffer = new StringBuilder();
buffer.append(line);
return message;
} else {
buffer.append(lineSeparator()).append(line);
}
}
There is an additional piece of code at the end that takes care of the last message in the buffer when an end of file is reached.
Denormalizing the Data
Another interesting problem is how to persist messages without creating duplicate contacts, when we read messages one at time. I needed to look up the database see if the contact already existed. If the contact is not found, a new contact should be created. This is easily done with Spring Boot repositories, but the question is at what point do you do it. I did not want to interrupt the flow of the reader and mix the concerns of reading and writing. It turns out that this is a good case for the ItemWriteListener
, where we can intercept the write operation, and look up or create the contact before the message write operation is performed. The message has a placeholder contact, which is substituted out with the existing contact if one is found in the database.
public void beforeWrite(final List<? extends Message> messages) {
for (final Message message : messages) {
final String from = message.getFrom().getName();
final List<Contact> contacts = contactRepository.findByName(from);
final Contact contact;
if (contacts.isEmpty()) {
contact = new Contact(from);
contactRepository.save(contact);
} else {
contact = contacts.get(0);
}
message.setFrom(contact);
}
}
Validating Data
I wanted to validate the parse and load step, and make sure that some data was loaded before proceeding to generate reports. I created a StepExecutionListener
class that runs a count query after the step completes. The code looks like this:
public ExitStatus afterStep(final StepExecution stepExecution) {
final long numMessages =
jdbcTemplate.queryForObject(
"""
SELECT
COUNT(MESSAGE_ID) AS NUM_MESSAGES
FROM
MESSAGES
""",
long.class);
if (numMessages <= 0) {
return ExitStatus.FAILED;
} else {
return ExitStatus.COMPLETED;
}
}
Command-line Parameters
I wanted to pass the path to the chat log file in on the command-line. Spring Boot allows parameters to be passed in from the command-line. Spring Batch allows you to call CommandLineJobRunner
from the command-line with job parameters too. However, I wanted a really simple command-line for my application that simply takes the name of the file, so I wrote a custom main method to call CommandLineJobRunner
, something like this:
public class WhatStatsApplication {
public static void main(final String[] args) throws Exception {
CommandLineJobRunner.main(
new String[] {
BatchConfiguration.class.getName(),
"AnalyzeMessagesJob",
"zone_offset=-05:00",
"chat_log=" + args[0])
});
}
}
and then I was able to access these parameters using Spring SPEL in a step-scoped bean, something like this:
@Bean
@StepScope
public MessageItemReader reader(
@Value("#{jobParameters['zone_offset']}") final String zoneOffsetString,
@Value("#{jobParameters['chat_log']}") final String chatLogPathString) {
// ... return a MessageItemReader;
}
Generating Reports
Reports are usually aggregated data for analysis. For generating reports, we do not need to go item by item, so I used a tasklet instead.
Conclusion
I had a fun time learning Spring Batch and developing some techniques for my toy project. And in the bargain, I got some interesting insights from chats with my group of friends. There is more to come. In the next evolution of this project, I will attempt to do some natural language parsing of the chat messages.
Top comments (0)