If you are working on the customer-facing products, then you might have experienced the production issues. For others, Production issues are the high-level bugs that are in production and affecting the customers to a great extent.
Then and their production issue happens because of the ignored test case, No test cases or environmental issues etc.
I follow the below things when there is a production issue. The issue may be notified to you by either your team fellow/support people/ Twitter/other sources.
Check and classify the issue of what type it is. The issue usually belongs to one of the below categories.
1) DOS - People might complain that there is connection time out issue
2) Slow response - Almost all/ certain type of requests are slow, which might lead to the first case
3) improper behaviour - The product is not working properly
4) Login issues - Session based issues.
5) High error rate - These are the silent issue that only developers know about.
Almost all organisation will be having some tool for monitoring the servers. If you are a new startup/product who didn't use any tool, It is the right time to check it out.
Check the stats in the order below
DOS and slow request. Check if all request is a slow one or a particular type.
If all requests are slow, then it is a problem with either current app server/ Dependency services like MySQL, Redis, Kafka etc. So check the below for respective machine
Including checks for CPU(> 80%), Free memory(<10%), Active threads/ process(>90% of allocated), request wait time before processing(< 2s), GC Time(If you use GC based language > 10s),Free Disk size(< 5% ), heap memory(< few MBs).
- If any of the above occurs, Take the necessary steps below.
- High CPU - wait for some time/ check threads/process which takes a long time. It will always be code issue or increase machine size
- Free memory - Always code issue. Check the last deployment or increase machine size
- Active threads - Too many active threads is because of too many requests got affected because of downstream / code issue. Check accordingly 5.GC Time - Your code is the problem. Check logs to see where GC occurs. Get a thread dump and check it out. 6.Free Disk size - Your code wrote something in the machine and it is not cleaned up. Check your app server for folder wise disk usage.
- Head memory - Your code used many objects and didn't clear it out.
- Dependency services - Check the above for all the services that your app depend upon like MYSQL, Redis, etc
Most of the above cases can be fixed by reverting your code.
If the above turns out empty handed, monitor your logs for recent errors. It will give you an overview or idea of what is happening. If you are using some log tool use dashboard for error module wise or region wise.
All the above check should reveal all the necessary things needed to fix. If not check with experts in your team immediately.
Check what are the areas that were affected by the issue. If it is customer facing then immediately inform affected customers or severely affected customer. There are chances that downstream services might be affected. Check the below list and tick the areas where it was affected
- DB/Cache/Downstream services data corruption
- Data not sent to downstream
- Replication Lag
If it is severe fix the issue by making the quick patch and update it all server. It should be solid and might require approval.
If it not server try to run all the test coverage and fix the main issue.
For downstream related problems write a script to correct the data. If data is completely corrupted use backup for recovery but customer concern is required.
Put down a neat RCA and inform the team. write down test cases for all those scenarios.
This is what I follow which enables me to fix the issue in production between 10 - 20 mins.
** If you follow some good practice. Let me know. **