What we did for our microservices is to genrate a correlationId (or callId, or requredtId...) in the first ms of the call chain (in the API Gateway) and pass it on each resulting request from one ms to another.
Then this id would be printed in each stacktrace and each log line. Then you just have to know your call chain and you could look for the id in the different kubernetes pod consoles.
Once you identified the faulty ms, you can reproduce with an unit test by mocking the request respinsible for the error.
What we did for our microservices is to genrate a correlationId (or callId, or requredtId...) in the first ms of the call chain (in the API Gateway) and pass it on each resulting request from one ms to another.
Then this id would be printed in each stacktrace and each log line. Then you just have to know your call chain and you could look for the id in the different kubernetes pod consoles.
Once you identified the faulty ms, you can reproduce with an unit test by mocking the request respinsible for the error.
I was thinking more on the lines of tooling to visualize and monitor, but yea definitely correlationId's are very useful for debugging.