In the process of using doris, there will be BE(backend) down,But many users don’t know how to locate why BE is down, and don’t know how to deal with this situation, This article mainly introduces how to deal with BE down.
Among the users we are currently in contact with, there are two main reasons for BE down: 1. OOM KILLER killed the BE process , 2.BE Core Dump.
If we find that be is down, we first need to open the log/be.out file,If there is log information in it, then this is the second case,Just look at the BE Core Dump processing method,If there is no log information in it, it is likely to be OOM.
1、The first and most common case：OOM
For this OOM situation, we can check the system log through dmesg -T to confirm whether the be process is KILLed because of memory OOM.
# dmesg -T | less
You can see that the corresponding doris_be has an Out of memory error log, and RSS is the actual memory occupied by the BE process when it was KILL.Pay attention to this, check whether the time below is consistent with the time of BE down
#For example [Fri Sep 2 22:58:19 2022] Out of memory: Killed process 1332190 (doris_be) total-vm:21484964304kB, anon-rss:91048156kB, file-rss:1484kB, shmem-rss:0kB, UID:0 pgtables:225764kB oom_score_adj:0 [Fri Sep 2 22:58:22 2022] oom_reaper: reaped process 1332190 (doris_be), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
If this is the reason, it is because BE did not protect itself. This mission is realized by memtracker, that is, to control the memory of the query, so that the overall process memory of BE does not exceed the mem_limit configured by the user (the default is 80%), so that it does not trigger The OOM KILLER of the system kills BE.
If the user encounters this problem in the short term, they can appropriately add an item in be’s conf/be.conf to explicitly reduce the maximum memory usage to reduce the trigger probability, and rely on Memtracker for thorough protection in the long run.
#Configure this item in be.conf of be and restart BE to take effect mem_limit=60%
2、BE Core Dump
In this case, be.out will have corresponding stack information.
For versions before 1.1.5, there is no record of query_id in be.out.This is a bit complicated, I will specifically introduce how to locate and process in a later article.
In versions 1.1.5 and later, we printed the SQL query_id of be core in the stack information, so that we can use this query_id to find the corresponding SQL and the corresponding SQL in the fe.audit.log of fe Create a table statement, contact the community developers in slack for fast reading and recurrence positioning processing.
Specific steps are as follows：
Find the TID (Thread ID) that generated the coredump in be.out
** SIGSEGV (@0x0) received by PID 31685 (TID 0x7f921416c700))
Convert hexadecimal thread id to decimal
[linux-terminal] printf "%d\n" 0x7f921416c700 140265378989824
grep decimal thread id in be.INFO then get query_id
[linux-terminal] grep 140265378989824 be.INFO I1019 16:57:26.899314 260721 fragment_mgr.cpp:441] _exec_actual(): query_id=xxxxx-xxxx fragment_instance_id=xxxx-xxxxxx thread id by pthread_self()= 140265378989824
Go to grep query_id in the fe.audit.log to find the corresponding sql
At this point you get the sql that causes be down
At this time, you need to ask your question in the github or slack community, and put the corresponding table creation statement, sql, stack information in be.out, etc.I will see your information in time in the community and give you a reply
If you want to further locate the reason, for example, it may be because the use of a function caused be down, you can restart be, modify your sql to verify your idea.
Finally,welcome to the slack community to discuss issues together: https://doris.apache.org/ welcome to give doris a star： https://github.com/apache/doris
Top comments (0)