I'll need some practical advice on this.
The crash can happen anytime. When it crashes there is probably multiple if not thousands of operations running if it's a popular application.
Like a simple example would be signing up. Steps include checking the user, creating record, create token, send verification mail, dispatch some events, those events doing their jobs and bunch of other stuff. The crash can happen in more complex scenario creating multiple insert/update queries and stuff.
This is what i could figure out.
Creating logs of two state of each operation. Like
done_op1. So when the server boots up again, it can restore where it left off by checking what started but couldn't finish.
But if I push logs of each operation, it creates an overhead and potential latency in some applications where it matters. Plus If I'm using something like redis(even with persistence) for the state logs, the server crash can affect these logs too, provided that not many will use a second server for this.