DEV Community 👩‍💻👨‍💻

rndmh3ro
rndmh3ro

Posted on • Originally published at zufallsheld.de on

Dockercontainer won’t start - Getting the final child’s pid from pipe caused “EOF”

Randomly some Docker-containers on a clients Linux machine wouldn’t start - they’d fail with the error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

Enter fullscreen mode Exit fullscreen mode

It didn’t matter what container it was, nor where or who started it. Sometimes it happened with Gitlab-CI jobs, sometimes with manually run containers. It also didn’t happen everytime. Sometimes it just worked, other times it didn’t work for hours and I’d ignore it.

The last time it happened I googled the error again and again found this huge open issue on Docker’s Github issue-tracker. This time I read every response in the thread and came upon this answer:

Okay, just found another interesting post on other forum. If you are running an vps which is virtualized with Virtuozzo you hosting provider maybe locked your tasks…

Im using strato and it seems to be that they have limited my server. Under /proc/user_beancounters you can find those settings. The numprocs is set to 700 and my actual held is 661. Starting an bigger docker stock seems to be impossible…

You can find more in this post https://serverfault.com/questions/1017994/docker-compose-oci-runtime-create-failed-pthread-create-failed/1018402

It seems to be there is no bug…

This sounded like my kind of problem as my client is using a Strato server, too!

Now my native language isn’t English, however I remembered the term “beancounter” from the BOFH and I really do hope that this feature originates somewhat from there (though I know that the term is older than the BOFH).

Looking at the servers beancounter gives me the following output:

root@h2939459:~# cat /proc/user_beancounters 
Version: 2.5
    uid resource held maxheld barrier limit failcnt
2939459: kmemsize 353345536 1649811456 9223372036854775807 9223372036854775807 0
            lockedpages 0 32 9223372036854775807 9223372036854775807 0
            privvmpages 5861268 6833651 9223372036854775807 9223372036854775807 0
            shmpages 2222328 2255102 9223372036854775807 9223372036854775807 0
            dummy 0 0 9223372036854775807 9223372036854775807 0
            numproc 541 541 1100 1100 0
            physpages 1461605 8319910 8388608 8388608 0
            vmguarpages 0 0 9223372036854775807 9223372036854775807 0
            oomguarpages 1525697 8388608 0 0 0
            numtcpsock 0 0 9223372036854775807 9223372036854775807 0
            numflock 0 0 9223372036854775807 9223372036854775807 0
            numpty 1 2 9223372036854775807 9223372036854775807 0
            numsiginfo 16 147 9223372036854775807 9223372036854775807 0
            tcpsndbuf 0 0 9223372036854775807 9223372036854775807 0
            tcprcvbuf 0 0 9223372036854775807 9223372036854775807 0
            othersockbuf 0 0 9223372036854775807 9223372036854775807 0
            dgramrcvbuf 0 0 9223372036854775807 9223372036854775807 0
            numothersock 0 0 9223372036854775807 9223372036854775807 0
            dcachesize 264458240 1497632768 9223372036854775807 9223372036854775807 0
            numfile 4284 7930 9223372036854775807 9223372036854775807 0
            dummy 0 0 9223372036854775807 9223372036854775807 0
            dummy 0 0 9223372036854775807 9223372036854775807 0
            dummy 0 0 9223372036854775807 9223372036854775807 0
            numiptent 1997 2000 2000 2000 74

Enter fullscreen mode Exit fullscreen mode

Take a look at the last line:

uid resource held maxheld barrier limit failcnt
     numiptent 1997 2000 2000 2000 74

Enter fullscreen mode Exit fullscreen mode

This shows that the resource “numiptent” is currently using 1997 “units”, where 2000 can be held totally. “failcnt” shows the count of refused allocations. When I run another container, this count increased so it must be the problem!

A quick search revealed that “numiptent” are the number of NETFILTER (IP packet filtering) entries, or in simpler terms: the number of iptables rules.

I immediatly knew the reason for this: the client is using fail2ban which blocks IP-addresses that try to bruteforce ssh-logins on the server. Looking at the fail2ban-overview I noticed that 1900 IP-addresses where blocked, conspicuously close to the beancounter-limit of 2000!

Restarting fail2ban threw out all these IP-addresses which decreased the beancounter on “numiptent”.

Reason found! Now why Strato decided to limit the number of iptables-entries and what to do against this (apart from disabling fail2ban) - I don’t know yet.

Top comments (0)

🌚 Life is too short to browse without dark mode