Spending our limited time and resources on the right things, is something that has obsessed me for over two decades. Using data and looking for signals, have been my go to mechanism to achieve this, and are the perfect corroborating partners for your instincts and judgement (never ignore these!). One example is how I review usage of code repositories, to find out the traction and impact, and help identify future content or demos to work on.
I currently use a simple lambda function (see code here)) that grabs various metrics of interest - how many times a repo has been viewed, cloned, as well as who has referred any traffic. It uses the standard GitHub API, and then stores the data in CloudWatch.
{"eventType": "ReferralPath", "repo_name": "094459-amazon-q-git-demo", "path": "/094459/094459-amazon-q-git-demo", "title": "GitHub - 094459/094459-amazon-q-git-demo: A simple Java app that shows how Am...", "count": 3, "uniques": 2}
{"eventType": "View", "repo_name": "094459-amazon-q-git-demo", "count": 6, "uniques": 3}
{"eventType": "Clone", "repo_name": "094459-amazon-q-git-demo", "count": 5, "uniques": 4}
Why did I do this I can hear some of you ask. GitHub only stores data for fourteen days, so if you wanted to get any insights into how your code repos are performing, then you need to implement something like this. I have built up dashboards using CloudWatch Insights, which has been a super frugal and simple way to generate reports that I have been using for a few years now.
The need to move - extracting more insights from the data
My current solution helped me address my primary use case, storing my repo usage beyond the fourteen day window. Over the past eighteen months, it has worked pretty well, and I really love the power and simplicity that CloudWatch Logs together with CloudWatch Log Insights provides.
Over time however, my needs have changed and I find that I want to do more with this data. I wanted to keep/retain the data where it currently is, but also make it available to share this data with other systems, applications, or people. I also want to open up access to that data so that I can use a broader set of SQL and other data tools, that would allow me to explore that data in ways that perhaps I had not been able to to in CloudWatch. What I needed was an approach that would allow me to take that daily data, and then make it available via standard SQL so that I could run queries and build reports against those, perhaps using data visualisation tools like Amazon Quiksight or OpenSearch (or even a simple front end web application that I get Amazon Q Developer to help me build).
Note! I am currently creating an updated Lambda Function that will directly store the data outside of CloudWatch logs, but first wanted to experiment with what that might look like. The value of being able to quickly prototype and experiment with new ideas is something that Amazon Q Developer has really helped me with. Across different AWS data services, to different approaches to storing the data, and to suggested SQL to create data structures, Amazon Q Developer will help me iterate faster and then get to a place that helps me solve my problem in the best way. I find this totally awesome!
This is actually something that I have been wanting to do for some time, but it always felt that this would take up a lot of time. Working with and manipulating data is something that I probably find the hardest thing to do, and I think this contributed to me always putting this off! I think what changed things for me was that I was able to use Amazon Q Developer to help me close that knowledge gap and provide me with more confidence in working with this data. In fact, I was able to put together a solution in less than four hours!
So what was the magic this time? There is no magic, although sometimes, if you like Arthur C Clark's definition of magic, using Amazon Q Developer can seem like the best kind of magic there is. In this post I am going to share how I started from data structures, and from there was able to use Amazon Q Developer to help me put together code that would allow me to update that data so I could then use in other AWS data services. Whilst this is a very simple example, I hope that it will get you thinking about the possibilities of how you can apply using Amazon Q Developer to do similar.
Lets dive right into it.
The approach
In the past, I learned it is critical to know WHAT questions you want to answer, before you start to think about design and technical decisions. In the context of working with coding assistants like Amazon Q Developer, this is even more so, as those questions helps provide useful context for your prompts. In my case I wanted to be able to generate graphs that showed both view and clone data across each repositories, I wanted to identify the top performing repositories, I was hoping to get insights into trending patterns, as well as see if there were any referring sites that were consistently driving traffic to those repos.
With that in mind I turned to my approach. Whilst I have been using CloudWatch to store my logging data for as long as I can remember, I would not say that I am up to speed with all the possibilities it provides. So my first port of call is going to gather information that will help me both provide options, and then based on what I want to do, narrow those down. In my head at least, I had this idea that I could generate standardised data output (for example, csv or similar) files from the original source data that was in CloudWatch Logs. Once I had that data, I would then ask Amazon Q Developer to see if it could provide some recommendations into what approaches I should consider, following up with suggestions around potential AWS data services that I should consider.
So with some ideas and some questions, I turn to my trusted companion, Amazon Q Developer.
Exporting CloudWatch log data
The first thing I ask Amazon Q Developer is a simple prompt:
Amazon Q Developer prompt "If I want to export my CloudWatch log data, what are my options?"
I am surprised by some options that I had not considered or know about. (Try this for yourself and see what you get). Here is my output (summarised)
- Export to Amazon S3
- Export via Subscription filters
- Export to Third-Party Log Management Services
- Export to CloudWatch Logs Insights
As I mentioned before, one of the things that Amazon Q Developer has provided me with is the confidence to experiment more when approaching new capabilities within AWS that I have never used before. I think this is a great example, as three of these are new to me.
I spent around five to ten minutes using Amazon Q Developer to help me with the first one, using a prompt
Amazon Q Developer prompt "Provide me with a step by step guide on how I can take an existing CloudWatch log group and export it to an s3 bucket".
In no time, I have a new gz file, which when I open up provides me with some very familiar looking data:
2024-07-02T15:14:16.347Z Getting views for building-data-pipelines-apache-airflow
2024-07-02T15:14:16.527Z {'eventType': 'View', 'repo_name': 'building-data-pipelines-apache-airflow', 'count': 27, 'uniques': 8}
2024-07-02T15:14:16.527Z b'{"eventType": "View", "repo_name": "building-data-pipelines-apache-airflow", "count": 27, "uniques": 8}'
2024-07-02T15:14:16.527Z Getting clones for building-data-pipelines-apache-airflow
2024-07-02T15:14:16.699Z b'{"eventType": "Clone", "repo_name": "building-data-pipelines-apache-airflow", "count": 2, "uniques": 2}'
24-07-02T15:14:16.699Z Getting referral data for cdk-mwaa-redshift
2024-07-02T15:14:16.883Z {'eventType': 'Referral', 'repo_name': 'cdk-mwaa-redshift', 'referrer': 'Google', 'count': 20, 'uniques': 2}
2024-07-02T15:14:16.883Z b'{"eventType": "Referral", "repo_name": "cdk-mwaa-redshift", "referrer": "Google", "count": 20, "uniques": 2}'
2024-07-02T15:14:16.883Z {'eventType': 'Referral', 'repo_name': 'cdk-mwaa-redshift', 'referrer': 'blog.beachgeek.co.uk', 'count': 3, 'uniques': 2}
Experimentation leads to unlikely insights sometimes, and one of the things I noticed by reviewing the output was that the code that was writing data to CloudWatch logs was duplicating data - you can see it in the above example in the second and third lines. This means that one of the things I was going to have to do is clean up the raw data before I was going to be able to use this (and yeah, try and figure out in the source code why it was doing this too - but one thing at a time!).
One of the things I had not considered was that I could use my existing CloudWatch Logs Insights queries via the command line. Amazon Q Developer provided me with some helpful examples, and a few minutes later I was up and running.
LOG_GROUP_NAME="/aws/lambda/github-traffic-cron"
START_TIME=$(date -d "yesterday 00:00" +%s)000 # Linux
START_TIME=$(date -j -f "%Y-%m-%d %H:%M:%S" "$(date -v-1d +"%Y-%m-%d 00:00:00")" +%s)000 #MacOS
END_TIME=$(date +%s)000 # Current time
QUERY_STRING="fields @timestamp, @message | filter @message like /b'/ | filter eventType = 'Clone'"
aws logs start-query \
--log-group-name "$LOG_GROUP_NAME" \
--region=eu-central-1 \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--query-string "$QUERY_STRING" \
--output text \
--query 'queryId' \
> query_id.txt
cat query_id.txt
9b6cb741-17ca-4387-8ac1-de65822ac52b
When I then run the following, again provided by Amazon Q Developer
QUERY_ID=9b6cb741-17ca-4387-8ac1-de65822ac52b
aws logs get-query-results \
--query-id "$QUERY_ID" \
--region=eu-central-1 \
--cli-binary-format raw-in-base64-out \
--output text | sed 's/\x1e/,/g' > logs.csv
Which provided me with something that looked familiar
RESULTS @ptr Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARA5GAE=
RESULTS @timestamp 2024-07-04 15:13:58.270
RESULTS @message b'{"eventType": "Clone", "repo_name": "ada-python-demo-app", "count": 3, "uniques": 3}'
RESULTS @ptr Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAwGAE=
RESULTS @timestamp 2024-07-04 15:13:57.402
RESULTS @message b'{"eventType": "Clone", "repo_name": "active-directory-on-aws-cdk", "count": 1, "uniques": 1}'
RESULTS @ptr Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAhGAE=
RESULTS @timestamp 2024-07-04 15:13:56.645
RESULTS @message b'{"eventType": "Clone", "repo_name": "094459-amazon-q-git-demo", "count": 4, "uniques": 3}'
RESULTS @ptr Cn0KQAosNzA0NTMzMDY2Mzc0Oi9hd3MvbGFtYmRhL2dpdGh1Yi10cmFmZmljLWNyb24QAiIOCICDztaHMhCI7pWZiDISNRoYAgZiLvg6AAAAAbEoDuIABmhrviAAAAaiIAEorse88ocyMLSNvfKHMjhkQO14SK4pUMYhGAAgARAOGAE=
STATISTICS 114774.0 32.0 706.0
I also spent some time looking at subscription filters in CloudWatch, again using Amazon Q Developer to help answer questions on this functionality. I think it might be a useful part of a solution, to help specifically clean up the data. Something to think about - but it got me thinking I perhaps need to tweak my prompt. Whilst these approaches provided some good alternatives, I wanted something that allowed me more programatic control. I ask a slightly different prompt
Amazon Q Developer prompt " If I want to export my CloudWatch log data, what are my options? I want to do this programatically, and run this on a daily schedule."
I then get slightly different responses
- Use AWS Lambda with CloudWatch Events
- Use the AWS CLI to create an export task
- Use AWS Step Functions
- Use AWS Batch
I can also ask some additional questions to understand trade offs or ask Amazon Q Developer to help me prioritise on specific requirements (maybe cost for example, or a specific AWS region I might be interested in). This is the prompt I end up with:
Amazon Q Developer prompt "If I wanted to export my CloudWatch log data, what are my options? I want to do this programatically, and run this on a daily schedule. Can you prioritise this list for 1/simplicity, 2/cost, 3/availability in the eu-central-1 region. I am a Python developer, with some bash scripting knowledge."
The steer I get is that the two recommended approaches are "Based on your preference for simplicity, cost-effectiveness, and availability in the eu-central-1 region, as well as your Python and bash scripting knowledge, using the AWS CLI or AWS Lambda with CloudWatch Events would be the most suitable options for exporting your CloudWatch log data programmatically on a daily schedule."
So at the end of this stage, I have used Amazon Q Developer to help me get some broader insights into potential approaches, used Amazon Q Developer to help me quickly experiment and validate some of those ideas, and then use some follow up questions to help me refine and end up with an approach that seems to make sense - lets build a new Lambda function!
Building the CloudWatch data export function with Amazon Q Developer
I had been thinking that I might end up writing a lambda function to do much of the work, so was happy with Amazon Q Developer confirming this was a good option was my first. The next stage was to get Amazon Q Developer to help me write the code.
I start off with the following prompt:
Amazon Q Developer prompt "I need to create a Python script that will process every CloudWatch log event. I want to display all events, but please skip/ignore any events that do have "eventType" in the Message."
Which provides me with my skeleton code, which initially does not do much other than dump all the GitHub data I have been recording in CloudWatch.
Reviewing this data was important as it reminded me that I needed to clean up the data. There was duplicated data, something that I had been manually fixing in my CloudWatch Insight queries, but that I would need to address now. Reviewing the data, it was clear that I only needed to capture CloudWatch log data that met a specific criteria, so I ask Amazon Q Developer to help me adjust the code.
Amazon Q Developer prompt How do I select only those entries that have the b' prefix from all event messages'
And it provides me a completely rewritten code,
# Check if the log event message contains the string "eventType" and doesn't start with b'
if "eventType" in event['message'] and event['message'].startswith("b'"):
When I re-run the script, I now have de-duped GitHub repo data.
1720019679366 {'eventType': 'View', 'repo_name': 'ragna', 'count': 1, 'uniques': 1}
1720019680304 {'eventType': 'Clone', 'repo_name': 'ragna-bedrock', 'count': 5, 'uniques': 4}
1720019683064 {'eventType': 'Clone', 'repo_name': 'sigstore-demo', 'count': 1, 'uniques': 1}
The next thing I want to do is make the time stamp more human readable, as not everyone knows epoch time! We can ask Amazon Q Developer to help us work with this by asking:
Amazon Q Developer prompt Can you update the script so that the Timestamp value is converted to something more readable
And when I re-run the script, we now get date and time in a format that is much easier to understand.
2024-07-03 16:14:39 {'eventType': 'View', 'repo_name': 'ragna', 'count': '1', 'uniques': '1'}
2024-07-03 16:14:40 {'eventType': 'Clone', 'repo_name': 'ragna-bedrock', 'count': '5', 'uniques': '4'}
2024-07-03 16:14:43 {'eventType': 'Clone', 'repo_name': 'sigstore-demo', 'count': '1', 'uniques': '1'}
The next thing I want is for this data to be flattened, as I eventually want to export this data as a csv file and make that data more useful in other tools.
I ask Amazon Q Developer with the following prompt:
Amazon Q Developer prompt How can I update the code so that the ouput Timestamp: 2024-07-01 16:14:35, Message: {'eventType': 'View', 'repo_name': 'ragna', 'count': 1, 'uniques': 1} is flattened?
Amazon Q Developer provided additional code this time. It created a new function (called flatten_json) and then amended the code so that it would use this function.
for event in response['events']:
# Check if the log event message contains the string "eventType" and doesn't start with b'
if "eventType" in event['message'] and event['message'].startswith("b'"):
timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')
clean_message = event['message'].replace("b'", '').replace("'", '')
message_data = json.loads(clean_message)
flattened_message = flatten_json(message_data)
The final part of this script was to generate the csv file, so I ask Amazon Q Developer:
Amazon Q Developer prompt "update the code so that it outputs to a csv file, and the output is comma separated. Make sure that Clones and Views are exported to one csv file, and Referral is exported to another csv file. Use the date as a prefix for the files."
Amazon Q Developer provides more code that completes this more complex task.
# Open two CSV files for writing
timestamp = datetime.now().strftime('%Y%m%d')
view_clone_file = open(f"{timestamp}_repo_activity.csv", "w", newline='')
other_events_file = open(f"{timestamp}_repo_referal.csv", "w", newline='')
view_clone_writer = csv.writer(view_clone_file)
other_events_writer = csv.writer(other_events_file)
# Process the log events
for event in response['events']:
# Check if the log event message contains the string "eventType" and doesn't start with b'
if "eventType" in event['message'] and event['message'].startswith("b'"):
timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')
# Parse the JSON message and flatten it
try:
clean_message = event['message'].replace("b'", '').replace("'", '')
message_data = json.loads(clean_message)
flattened_message = flatten_json(message_data)
# Write the event to the appropriate file based on the eventType
if message_data['eventType'] in ['View', 'Clone']:
csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
view_clone_writer.writerow(csv_row)
else:
csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
other_events_writer.writerow(csv_row)
except json.JSONDecodeError:
# Print the original message if it's not a valid JSON
print(f"Timestamp: {readable_timestamp}, Message: {event['message']} (Not a valid JSON)")
# Close the CSV files
view_clone_file.close()
other_events_file.close()
When I run the updated script, I am delighted that I now have two files, {date}_repo_activity.csv and {date}_repo_referal.csv. When I look at the file contents, it looks pretty awesome!
2024-07-10 16:14:35,eventType=View,repo_name=ragna,count=1,uniques=1
2024-07-10 16:14:38,eventType=Clone,repo_name=robotics-resources,count=1,uniques=1
2024-07-10 16:14:39,eventType=Clone,repo_name=til,count=1,uniques=1
Thinking ahead, I realise that this data structure might be harder to work with, so I want to simplify by removing the key in the following code:
csv_row = [readable_timestamp] + [f"{key}={value}" for key, value in flattened_message.items()]
to the followiing
csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
When I rerun the script now, the output I get is cleaner.
2024-07-10 16:14:35,View,ragna,1,1
2024-07-10 16:14:38,Clone,robotics-resources,1,1
2024-07-10 16:14:39,Clone,til,1,1
I realise that I want to also add one last feature for this script, to be able to upload it to an S3 bucket. I ask Amazon Q Developer:
Amazon Q Developer prompt "Update the script so that it looks for an environment variable called S3_TARGET and if found, it copies the csv files to this bucket"
Amazon Q Developer does not disappoint. After a short while, I get some additional code
s3_target_bucket = os.environ.get('S3_TARGET')
if s3_target_bucket:
timestamp = datetime.now().strftime('%Y%m%d')
view_clone_file_key = f"logs/activity/{timestamp}_repo_activity.csv"
other_events_file_key = f"logs/referal/{timestamp}_repo_referal.csv"
with open(f"{timestamp}_repo_activity.csv", "rb") as f:
s3_client.upload_fileobj(f, s3_target_bucket, view_clone_file_key)
with open(f"{timestamp}_repo_referal.csv", "rb") as f:
s3_client.upload_fileobj(f, s3_target_bucket, other_events_file_key)
else:
print("S3_TARGET environment variable is not set. Skipping file upload.")
I try this with and without setting the S3_TARGET environment variable, and confirm that I now have my csv files in my S3 bucket.
Now that this works, I ask Amazon Q Developer how to alter this script so that I can deploy this as a lambda function, and it provides me with code that I am able to deploy.
Amazon Q Developer prompt "Convert this script so that it can run as a Lambda function"
Here is the completed code.
import boto3
from datetime import datetime
import json
import csv
import os
import io
def lambda_handler(event, context):
def flatten_json(data, parent_key='', separator='.'):
"""
Flatten a nested JSON data structure.
"""
items = []
for key, value in data.items():
new_key = parent_key + separator + key if parent_key else key
if isinstance(value, dict):
items.extend(flatten_json(value, new_key, separator).items())
elif isinstance(value, list):
for idx, item in enumerate(value):
if isinstance(item, dict):
items.extend(flatten_json(item, new_key + separator + str(idx), separator).items())
else:
items.append((new_key + separator + str(idx), str(item)))
else:
items.append((new_key, str(value)))
return dict(items)
session = boto3.Session()
logs_client = session.client('logs')
s3_client = session.client('s3')
# Specify the log group
log_group_name = '/aws/lambda/github-traffic-cron'
# Find the latest log stream in the log group
response = logs_client.describe_log_streams(
logGroupName=log_group_name,
orderBy='LastEventTime',
descending=True,
limit=1
)
latest_log_stream_name = response['logStreams'][0]['logStreamName']
response = logs_client.filter_log_events(
logGroupName=log_group_name,
logStreamNames=[latest_log_stream_name]
)
# Create in-memory CSV files
view_clone_file = io.StringIO()
other_events_file = io.StringIO()
view_clone_writer = csv.writer(view_clone_file)
other_events_writer = csv.writer(other_events_file)
# Process the log events
for event in response['events']:
# Check if the log event message contains the string "eventType" and doesn't start with b'
if "eventType" in event['message'] and event['message'].startswith("b'"):
timestamp = datetime.fromtimestamp(event['timestamp'] / 1000.0)
readable_timestamp = timestamp.strftime('%Y-%m-%d %H:%M:%S')
# Parse the JSON message and flatten it
try:
clean_message = event['message'].replace("b'", '').replace("'", '')
message_data = json.loads(clean_message)
flattened_message = flatten_json(message_data)
# Write the event to the appropriate file based on the eventType
if message_data['eventType'] in ['View', 'Clone']:
csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
view_clone_writer.writerow(csv_row)
else:
csv_row = [readable_timestamp] + [f"{value}" for key, value in flattened_message.items()]
other_events_writer.writerow(csv_row)
except json.JSONDecodeError:
# Print the original message if it's not a valid JSON
print(f"Timestamp: {readable_timestamp}, Message: {event['message']} (Not a valid JSON)")
# Get the CSV file contents as strings
view_clone_file_contents = view_clone_file.getvalue()
other_events_file_contents = other_events_file.getvalue()
# Upload the CSV files to the S3 bucket
s3_target_bucket = os.environ.get('S3_TARGET')
if s3_target_bucket:
timestamp = datetime.now().strftime('%Y%m%d')
view_clone_file_key = f"logs/activity/{timestamp}_repo_activity.csv"
other_events_file_key = f"logs/referal/{timestamp}_repo_referals.csv"
s3_client.put_object(Body=view_clone_file_contents.encode('utf-8'), Bucket=s3_target_bucket, Key=view_clone_file_key)
s3_client.put_object(Body=other_events_file_contents.encode('utf-8'), Bucket=s3_target_bucket, Key=other_events_file_key)
else:
print("S3_TARGET environment variable is not set. Skipping file upload.")
return {
'statusCode': 200,
'body': 'CSV files processed and uploaded to S3 bucket.'
}
The only thing I need to do when I configure the lambda function is to 1/ Ensure that the timeout is set to around 30 seconds, 2/ set an Environment Variable for S3_TARGET, and 3/ Make sure that the lambda function execution role has permissions to both read CloudWatch logs and write data to the S3 bucket.
I schedule this script once day, to run at 11am, and I can see when I review my S3 bucket later, that the script has been scheduled and executed as expected.
stree 094459-oss-projects
094459-oss-projects
└── logs
├── activity
│ └── 20240715_repo_activity.csv
└── referal
└── 20240715_repo_referals.csv
Note! If you are not familiar with stree, it is a really cool open source project that I featured in #189 of my open source newsletter, and a tool I use almost every day - its awesome.
Extracting insights from my GitHub data
Now that I have sanitised and controlled data that is being uploaded to a defined S3 bucket on a daily basis, I can now begin the next step which is using AWS analytics and data services to help me extract insights from that data.
First of all I want to see what Amazon Q Developer might suggest, with the following prompt:
Amazon Q Developer prompt "I have an S3 bucket that contains daily uploads of this csv data. Which AWS services should I consider to be able to query and get insights from this data."
I kind of knew what to expect, as in my head I had already thought that Amazon Athena was the way to go, and sure enough Amazon Q Developer confirms my choice. The guidance you get will be determined by the context and additional information you provide. With the above prompt, I got some additional suggestions which did not make that much sense. I tweaked my prompt as follows:
Amazon Q Developer prompt "I have an S3 bucket that contains daily uploads of this csv data. Which AWS services should I consider to be able to query and get insights from this data. Provide recommendations that are simple to implement, cost effective, and serverless. The amount of processed data is very low, with a few thousand records being ingested daily. I need to ability to run SQL queries against the data, in an ad-hoc as well as planned way. I also need to make this data available to other AWS services"
This provided a much shorter list, with Amazon Athena as the top choice.
I have used Amazon Athena many times as part of various demos and presentations. I do not use it every day though, so I am a little rusty and need some time to get back in the groove. Luckily, Amazon Q Developer is just the perfect guide to help me with this, making it easy to remind me of what I need to do and get me up and running in no time.
I am going to ask Amazon Q Developer for a quick refresher on how to set up Amazon Athena against my data. The first thing I do is open up in VSCode one of the sample data files (in my case, the repo_activity.csv), and then ask the following prompt.
Amazon Q Developer prompt "Using the data structure in the repo_activity.csv, provide step by step instructions on how I can use Amazon Athena to run queries on this data that will allow me to sort by repo_name, identify the repos with the highest count, and provide summarised views of total number of clones by a given date. The data files are stored in an S3 bucket called 094459-oss-projects"
The output was good, I just needed to adjust the S3 bucket details.
When we now run that in Amazon Athena, we can see that the query runs ok.
When we preview the data, we can see that our GitHub data is now available for us to run queries against.
The original output from the prompt also provided me with some sample queries to help me look at the data. The first one was to provide a list of repos by name, which looked very much like the preview data, so not that interesting. It then provided a query to identify repos with the highest count.
SELECT repo_name, SUM(count) AS total_count
FROM repo_activity
GROUP BY repo_name
ORDER BY total_count DESC;
Which generates the expected output
The next query was to find out the total clones by date, and the suggested query
SELECT date(event_time) AS event_date, SUM(count) AS total_clones
FROM repo_activity
WHERE event_type = 'Clone'
GROUP BY event_date
ORDER BY event_date;
generated an error "COLUMN_NOT_FOUND: line 4:10: Column 'event_date' cannot be resolved or requester is not authorized to access requested resources"
No problems, I can use Amazon Q Developer to help me figure out how to fix this. This is the prompt I use:
Amazon Q Developer prompt "When I run the summarise total clones by date query, it generates this error "COLUMN_NOT_FOUND: line 4:10: Column 'event_date' cannot be resolved or requester is not authorized to access requested resources This query ran against the "default" database, unless qualified by the query. " How do I fix this"
It suggests a new query
SELECT event_date, SUM(total_clones) AS total_clones
FROM (
SELECT DATE(event_time) AS event_date, count AS total_clones
FROM repo_activity
WHERE event_type = 'Clone'
)
GROUP BY event_date
ORDER BY event_date;
This also fails, but with a different error. Again I turn to Amazon Q Developer to help with. This is the prompt I use
Amazon Q Developer prompt "This generates a different error "INVALID_CAST_ARGUMENT: Value cannot be cast to date: 2024-07-15 15:13:56"
This time Amazon Q provides a more detailed response, together with an updated suggestion. In fact, it provides a couple of suggestions as it has quite rightly determined that the date format is in fact date and time. The new query:
SELECT CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE) AS event_date, SUM(count) AS total_clones
FROM repo_activity
WHERE event_type = 'Clone'
GROUP BY CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE)
ORDER BY CAST(PARSE_DATETIME(event_time, 'yyyy-MM-dd HH:mm:ss') AS DATE);
runs perfectly, and provides me with exactly what I wanted to know.
As you can see, the output from Amazon Q Developer is not always perfect, but by combining the initial output, with oversight of what I am trying to do, as well as using the interactive nature of the chat interface, you can quickly get resolution of the errors you come across.
We are nearly finished, but there is one more thing I need to do. I need to ask Amazon Q Developer is what happens or what do I need to do as I add more data to the S3 bucket (or rather, when the scheduled Lambda function does that).
Amazon Q Developer prompt "When more data gets added to my S3 bucket, how do I make sure that Athena will use all the new data that has been added"
Amazon Q Developer provides me with a nice summary an options. The one I need however, is
MSCK REPAIR TABLE repo_activity;
I wait for the next day so that there will be more data in the S3 bucket before running this query. It takes a few seconds to run,
but when I then re-run the queries from before, I can now see that I have two days worth of data.
You might be thinking, how do you automate that (or was it just me!). To find out my options, lets ask Amazon Q Developer
Amazon Q Developer prompt "What is the best way of running MSCK REPAIR on a daily basis so that the indexes are always up to date"
And it provides a couple of options, the one that I like most is running a daily lambda function and it provides some code you can use.
I think that pretty much is where I want to leave this. I have only spent a couple of hours doing this across a couple of days. Time very well spent I think.
What next
This has been a fun experiment to see how Amazon Q Developer could help me with an area that I had a good understand and grip of, but needed some hand holding in the details and implementation. I certainly feel that using Amazon Q Developer in my data related tasks is going to be a major productivity boost, but also I think, going to help me explore and experiment more.
Stay tuned for further adventures of data with Amazon Q Developer. Who knew working with data could be so much fun!
If you want to learn more, check out the Amazon Q Developer Dev Centre which contains lots of resources to help get you started. You can also get more Amazon Q Developer related content by checking out the dedicated Amazon Q space, and if you are looking for more hands on content, take a look at my interactive tutorial that helps you get started with Amazon Q Developer.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.