Предупреждение о неназначенных шардах Elasticsearch каждые пару часов

В нашем кластере 3 модуля данных elasticsearch / 3 основных модуля / 1 клиент и 1 экспортер. Проблема заключается в предупреждении о неназначенных шардах Elasticsearch из-за исключения разрыва цепи. Вы можете подробнее узнать об этом в этом вопросе

Теперь, выполнив curl-вызов http: // localhost: 9200 / _nodes / stats, я понял, что использование кучи является средним для всех модулей данных.

Heap_used_percent для eleasticsearch-data-0, 1 и 2 составляет 68%, 61% и 63% соответственно.

Я сделал следующие вызовы API и вижу, что осколки почти равномерно распределены.

curl -s http: // localhost: 9200 / _cat / shards | grep elasticsearch-data-0 | wc -l

145

curl -s http: // localhost: 9200 / _cat / shards | grep elasticsearch-data-1 | wc -l

145

curl -s http: // localhost: 9200 / _cat / shards | grep elasticsearch-data-2 | wc -l

142

Ниже приведен результат вызова функции allocate объяснять curl.

curl -s http: // localhost: 9200 / _cluster / allocation / объяснять | Python -m json.tool

{
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
    "can_allocate": "no",
    "current_state": "unassigned",
    "index": "graph_24_18549",
    "node_allocation_decisions": [
        {
            "deciders": [
                {
                    "decider": "max_retry",
                    "decision": "NO",
                    "explanation": "shard has exceeded the maximum number of retries [50] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:18:44.115Z], failed_attempts[50], delayed=false, details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid1], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:16:42.146Z], failed_attempts[49], delayed=false, details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid2], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:15:05.849Z], failed_attempts[48], delayed=false, details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [tsg_ngf_graph_1_mtermmetrics1_vertex_24_18549][0], node[nodeid1], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid3], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:11:50.143Z], failed_attempts[47], delayed=false, details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[o_9jyrmOSca9T12J4bY0Nw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid4], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:08:10.182Z], failed_attempts[46], delayed=false, details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid1], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid6], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:07:03.102Z], failed_attempts[45], delayed=false, details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid7], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:05:53.267Z], failed_attempts[44], delayed=false, details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid8], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:04:24.507Z], failed_attempts[43], delayed=false, details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid1], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid9], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:03:02.018Z], failed_attempts[42], delayed=false, details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0], node[nodeid2], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=someid10], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-31T09:01:38.094Z], failed_attempts[41], delayed=false, details[failed shard on node [nodeid1]: failed recovery, failure RecoveryFailedException[[graph_24_18549][0]: Recovery failed from {elasticsearch-data-2}{}{} into {elasticsearch-data-1}{}{}{IP}{IP:9300}]; nested: RemoteTransportException[[elasticsearch-data-2][IP:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2012997826/1.8gb], which is larger than the limit of [1972122419/1.8gb], real usage: [2012934784/1.8gb], new bytes reserved: [63042/61.5kb]]; ], allocation_status[no_attempt]], expected_shard_size[4338334540], failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[engine is closed]; ], allocation_status[no_attempt]], expected_shard_size[5040039519], failure RemoteTransportException[[elasticsearch-data-1][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2452709390/2.2gb], which is larger than the limit of [1972122419/1.8gb], real usage: [2060112120/1.9gb], new bytes reserved: [392597270/374.4mb]]; ], allocation_status[no_attempt]], expected_shard_size[2606804616], failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[engine is closed]; ], allocation_status[no_attempt]], expected_shard_size[4799579998], failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[engine is closed]; ], allocation_status[no_attempt]], expected_shard_size[4012459974], failure RemoteTransportException[[elasticsearch-data-1][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2045921066/1.9gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1770141176/1.6gb], new bytes reserved: [275779890/263mb]]; ], allocation_status[no_attempt]], expected_shard_size[3764296412], failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[engine is closed]; ], allocation_status[no_attempt]], expected_shard_size[2631720247], failure RemoteTransportException[[elasticsearch-data-1][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2064366222/1.9gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1838754456/1.7gb], new bytes reserved: [225611766/215.1mb]]; ], allocation_status[no_attempt]], expected_shard_size[3255872204], failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2132674062/1.9gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1902340880/1.7gb], new bytes reserved: [230333182/219.6mb]]; ], allocation_status[no_attempt]], expected_shard_size[2956220256], failure RemoteTransportException[[elasticsearch-data-1][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2092139364/1.9gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1855009224/1.7gb], new bytes reserved: [237130140/226.1mb]]; ], allocation_status[no_attempt]]]"
                },
{
                    "decider": "same_shard",
                    "decision": "NO",
                    "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[graph_24_18549][0], node[nodeid2], [P], s[STARTED], a[id=someid]]"
                }
            ],
            "node_decision": "no",
            "node_id": "nodeid2",
            "node_name": "elasticsearch-data-2",
            "transport_address": "IP:9300"
        }

Что нужно делать сейчас? Потому что я не вижу, что куча стреляет. Я уже пробовал нижеприведенный API, который помогает и назначает все неназначенные шарды, но проблема повторяется каждые пару часов.

curl -XPOST ': 9200 / _cluster / reroute? retry_failed


person Gokul    schedule 02.11.2020    source источник
comment
добавьте / _cat / allocation? v и / _cat / thread_pool / search? v и / _cat / thread_pool / write? v и / _cat / nodes? v в вопрос.   -  person hamid bayat    schedule 02.11.2020
comment
какое количество шардов на индексы? у вас есть тяжелые индексы индекса / обновления с 1 осколком?   -  person hamid bayat    schedule 02.11.2020
comment
Кажется, что использование кучи отсортировано, и я не замечаю всплеска. И все же это не удается. Правильно отредактировали вопрос. Спасибо Хамид   -  person Gokul    schedule 02.11.2020


Ответы (1)


Какую версию ElasticSearch вы используете? 7.9.1 и 7.10.1 лучше повторить неудачную репликацию из-за CircuitBreakingException и лучше давление индексации

Я бы порекомендовал вам попробовать обновить кластер < / а>. Версия 7.10.1, похоже, устранила эту проблему для меня. Подробнее: Справка по неназначенным шардам / CircuitBreakingException / Значения меньше -1 байта не поддерживаются

person Ricardo    schedule 08.01.2021