Discussion on: Surviving the Linux OOM Killer

View post

Thanks Raunak, interestingly in 20+ years of developing for Linux systems I've never played with the oom_score_adj feature, not even experimentally never mind in production :) This path may well end up as a tragedy of the commons, where every process lowers it's score drastically - cf: IP packets have a user-settable priority field, guess what it always is?

I feel that your caveat is worth restating:

"Remember that OOM is a symptom of a bigger problem - low available memory."

I would add that well before the OOM killer does it's thing, you should be getting alerts from your monitoring (you have monitoring in production right?), and the system will likely be swapping madly (you have swap space right?) - it's like working in treacle, but it buys you time to act!

Your fixes are good for keeping the show on the road - throw money at it in the form of more hardware / VMs, to buy more time to resolve the design / implementation errors...

I /have/ had to track down and fix numerous memory leaks (usually me being a lazy C coder), poor allocation strategies (looking at you long running Python apps!), and poor configuration choices (let's allow 1000 Apache instances!) to fix memory issues - eg: recently resorting to scheduled restarts of the Azure Linux agent (waagent) to prevent it eating my small server every 48-72 hours.

May the OOM never strike twice :)

edited to add: Julia (@b0rk) has an excellent series of Linux drawings, including one on memory management: drawings.jvns.ca/

Raunak Ramakrishnan • Oct 5 '18 • Edited

Agreed! There is no substitute for good monitoring. It catches many issues before they become bigger problems. Ultimately, we must be fixing the root cause for high memory which is generally poor design/architecture.

What you said about the tragedy of commons is exactly what happened to nice scores for process priority.