Previously I wrote about transforming an idea into working code using ChatGPT as an assistant. I took an idea and put ChatGPT through a BDD process and an Example Mapping process.
I was very impressed with the results, but also a bit disappointed from its lack of detail and also lack of determinism. It was not enough to rely on, but I do think it made a useful tool to simply speed this process up.
What about the second step - transforming an example scenario into code?
Picking up from Requirements
In the last post, ChatGPT helped with creating BDD-type scenarios for a simple email parser. Now we will try to convert some of those scenarios into tests and code, using a typical TDD process. We will use one failure case and one success case. The cases are not exactly how ChatGPT gave them to us because as we learnt it was not very good at going down to the detail level. So in between that step and this step there needs to be some manual review and refinement of the cases.
[In this post we will look at the failure case. The full article has details of implementing a success case.]
The failure case:
Scenario: Email with no file attachments returns error
Given an email with no image file attachments
When the email is sent to 'The System'
Then the system should return an error code with the message "One image file attachment must be present."
Setup
Configure ChatGPT
To turn ChatGPT into a developer, I gave it the following system command:
You are an expert python software developer. You know how to understand BDD scenarios of the form:
Scenario: scenario title
Given (an initial context)
When (an event occurs)
Then (a testable outcome is satisfied)
You know TDD and how to write tests first using pytest and pytest-bdd. When writing code, you favour functional programming and avoid side-effects.
You have been asked to assist another python software developer to develop tests and code for a small library to parse email content according to the given BDD specifications.
Test 1: Simple failure case
Writing the test case
I prompted ChatGPT with this prompt:
Given the following scenario:
<scenario from above>
Write a pytest-bdd test which satisfies this scenario.
and I received the following code:
import pytest
from email_parser import parse_email
@pytest.mark.parametrize("email_content, expected_output", [
("Hello World", "One image file attachment must be present."),
("", "One image file attachment must be present."),
("image.jpg", "One image file attachment must be present."),
])
def test_parse_email_no_file_attachments(email_content, expected_output):
# Given an email with no image file attachments
email = {
"content": email_content,
"attachments": []
}
# When the email is sent to 'The System'
output = parse_email(email)
# Then the system should return an error code with the message "One image file attachment must be present."
assert output == (False, expected_output)
Not quite the start I was expecting. Some things to note:
- It assumed a name for the module and for the function under test, which is fine. We will see if it remembers this
- It assumed an input format to the function under test as a single json structure
- It assumed an output from the function under test as a tuple.
- It has not used the pytest-bdd library as we requested in the system message
- The parameterization of the test is not really necessary. The test really only needs one example of having no attachments.
The second item is interesting, and I like it. The design that initially came into my head was that the function in question would parse the entire email message and handle the extraction of body text and attachments. When I initially saw the output from ChatGPT I was annoyed but then it made me think: this is actually a better solution. Perhaps it was even following my direction to program in a functional way?
I like writing things in a functional way. Typically it is good to write simple functions that do only one thing with no side-effects, which means they are easier to tests. In the long run I probably would have refactored the code to do this, but the ChatGPT output made me take this decision earlier, and I changed the design in my head to first parse the email into its sub-parts and then process the parts according to my requirements.
To fix the other issues, I instructed ChatGPT to refine the solution, by asking it to:
- use the pytest-bdd library format, providing it with a scenario file name
- return a json structure instead of a tuple
The updated test function was this:
import pytest
from pytest_bdd import given, when, then, scenario
from email_parser import parse_email
@scenario('../features/email-parsing.feature', 'Email with no file attachments returns error')
def test_parse_email_no_file_attachments():
pass
@given("an email with no image file attachments")
def email_with_no_file_attachments():
email = {
"content": "Hello World",
"attachments": []
}
return email
@when("the email is sent to 'The System'")
def send_email_to_system(email_with_no_file_attachments):
return parse_email(**email_with_no_file_attachments)
@then("the system should return an error code with the message \"One image file attachment must be present.\"")
def check_error_message(send_email_to_system):
expected_output = {
"success": False,
"error_message": "One image file attachment must be present."
}
assert send_email_to_system == expected_output
That looks like a pretty good bdd-style test to me.
Implementing Code
When I ran the code, I got the usual errors, as you would expect from real TDD:
- No module
- No function
- No scenario file
I fixed these up and then asked ChatGPT to write the code.
Now write the code which satisfies the test function.
The response was this (line-splitting done by me):
def parse_email(content, attachments):
# Check if there are any image file attachments
has_image_attachment = False
for attachment in attachments:
if attachment.get("name", "").lower().endswith(".jpg") \
or attachment.get("name", "").lower().endswith(".jpeg") \
or attachment.get("name", "").lower().endswith(".png"):
has_image_attachment = True
break
# If no image file attachments are present, return an error message
if not has_image_attachment:
return {
"success": False,
"error_message": "One image file attachment must be present."
}
# If one or more image file attachments are present, return a success message
return {"success": True, "message": "Email parsed successfully."}
The interesting thing for this first implementation is that because we had been talking about image attachments, it gave me code which already handled different types of image extensions. This was good but it is against the TDD philosophy of writing just enough code to pass the test. I decided to leave it in however.
Personally I don't find this code readable, and for the future it is likely the code will accept some extensions and not others, so we can ask ChatGPT to refactor it to use the split
command and the in
operator.
Running the test
The moment of truth... does this work? I ran pytest
and got the following error:
E fixture 'email_with_no_file_attachments' not found
Hmmm... why? Well, it turns out that the test code generated by ChatGPT had used the pytest 'fixture' style to write the tests, but it had been implemented incorrectly. It's possible that this style is too new for ChatGPT or it just didn't know how to do it. But it is something I had to dig into and fix, which took about an hour.
After fixing that, hey presto, test passed! The question remaining is: considering the time it took to fix the error, would it have been faster for me to write this myself?
Summary
This was a valuable experiment with many new things learnt about using ChatGPT in this way.
The main benefit I found was that it prompted me to think more about design decisions - it did almost feel like I was 'pair programming'. However I was hoping that I would be able to continue thinking at a high level, but that didn't happen; because some of the output had mistakes or was poorly written (in my opinion), I had to jump between design-level and code-level thinking.
This raised an interesting question though - does this in fact fit into the TDD 'Red-Green-Refactor' process? Maybe it does and I have just been executing the 'refactor' stage a bit too early? I admit I sometimes neglected to wait to see if the code that was produced passed, thus following the 'write the minimum code to turn the test green' rule.
Other associated points are:
- defining the expected inputs/outputs for unit tests is a bit laborious
- Some responses introduced assumptions in the details (e.g. I asked it to use URL-encoded-strings and it used outdated formats which I had to fix)
- The output is often verbose, leading to lengthy, duplicated or redundant code. This was especially true when parameterizing tests.
Some of these may be improved in GPT4, we will have to see.
What about writing the more complex success-case of the experiment? You can read that at the original post
Top comments (0)