Elasticsearch - Cluster Intro PART2

Cluster 管理

集群部署

单节点单职责
- 不同角色的节点（前一篇文章）部署在不同的节点
- 多Node模式
Dedicate Coordinating Only Node
- 大的集群配置 Coordinating Only Nodes
- load balance，降低master/data node负担
- 负责gather/reduce
- 扛住客户端流量
Dedicate Master Node
- 从高可用 & 避免脑裂的角度出发 >=3
- 只有一个leader master
- 如果和数据节点或者 Coordinate 节点混合部署
  - 资源竞争，尤其是内存

水平扩展

非Data node 扩展比较容易
如查询较多，增加Coordinating Node 或者replica data node
读写分离
- 读 Coordinating node
- 写 ingest node
Data Node扩展
- re-index
- 增加replica shard

HA支持

单集群单AZ部署
- 按照 👆集群部署环节操作即可
- 网络性能最好的方案
单集群多AZ部署
- Two-zone clusters
  - master 分散在不同的可用区，A 2 B 1，不能均匀分布master node
  - replica >=2 确保每个分片在每个区域都有一份副本
  - 确保网络足够好，cluster state 更新及时
  - 缺点：3master，当有一个AZ出现网络抖动时，两个可用区可能无法选举出leader master
- 三个及以上可用区部署时可以解决👆的缺点
多集群
- active-passive
  - 也可以理解为cross-cluster replication
  - 只有一个cluster 提供写支持，其他集群follow，做relication
  - 可以跨region部署
  - 优点
    - Failover
    - Disaster Recovery
    - Data locality
    - Centralized reporting
    - Chain Replication
    - Bi-directional replication 每个cluster可以承担不同的index 写入，互相做replication
  - 缺点
    - 跨集群复制通过重放在主索引的分片上执行的单个写操作的历史记录来工作。Elasticsearch 需要在主分片上保留这些操作的历史记录
    - 数据同步依赖网络延时
- active-active
  - 多cluster同时支持写入
  - 需要抽象一层 global transaction management，做request分发
  - 同一个index 数据分布在不同的cluster
  - 数据同步和一致性是个大挑战
  - 如果跨region部署，受网络影响最大，不建议

Data tier 架构

主要针对log类文档（基于时序）

通过node.my_node.type进行设置,通过index life cycle进行移动处理

Hot tier nodes
- 保存最新、最频繁访问的数据
- 硬件配置需要高一些
Warm tier nodes
- 保存较少频繁访问且很少需要更新的时间序列数据
- 磁盘大一些
Cold tier nodes
- 保存不常访问且通常不更新的时间序列数据
- 为了节省空间，可以在冷层上保留完全挂载的可搜索快照索引
- 这些快走不需要副本，节省磁盘空间
- 与常规索引相比，所需磁盘空间减少了约50%
Frozen tier nodes
- 保存很少访问且从不更新的时间序列数据。冻结层仅存储部分挂载的可搜索快照索引。这进一步扩展了存储容量——与温层相比，最多可增加20倍

容量管理

分片管理
- 分片数 > 节点数时
  - 一旦集群中有新的数据节点加入，分片就可以自动进行分配
- 较少的数据单分片
- 大量数据，多分片
  - 优势
    - 水平扩展
    - 查询并行处理
    - 数据分散在多分片上，每个shard 容量相对单一分片少，有更快的recovery速度
  - 劣势
    - 每个分片是一个 Lucene 的索引，会使用机器的资源。过多的分片会导致额外的性能开销
    - 每次搜索的请求，需要从每个分片上获取数据
    - 分片的 Meta 信息由 Master 节点维护。过多，会增加管理的负担
      - 腾讯给出的经验时3-5w片是上限(节点数百情况下)
如何确定分片的数量？
- 存储角度，预估数据量和增长幅度
- 读写请求的场景tps latency
- 可用性要求SLA 99.99%
- 数据保持的周期life cycle
- 硬件资源,主要是内存和磁盘
  - FST 吃内存
  - index 吃磁盘
  - Node 分片数量小于1000
- 有一个内存参考线
  - jvm 内存不要超过node内存的一半👇有解释
  - 相对应shard分片的size要小于内存的一半
- 官方建议
  - 虽然在理论上每个分片的物理大小没有硬性限制，并且每个分片可以容纳超过20亿个文档，但经验表明，大小在10GB到50GB之间的分片在许多使用场景中通常表现良好，只要每个分片的文档数量保持在2亿以下
  - 如果您使用ILM（索引生命周期管理），请将滚动动作的max_primary_shard_size阈值设置为50GB，以避免分片超过50GB
- 假设之后，就是在生产硬件上使用生产数据进行基准测试，使用与生产中相同的查询和索引负载，根据测试结果进行调整
cluster scaling
- 增加 Coordinating / Ingest Node 解决 CPU 和内存开销的问题
- 增加Data Node
  - 解决存储的容量的问题
  - 为避免分片分布不均的问题，要提前监控磁盘空间，提前清理数据或增加节点（ 70%）

优化配置

系统配置

File Descriptors
- ulimit -n 65535
- 或者 /etc/security/limits.conf
  - elasticsearch - nofile 65535
Disable swapping
- swapoff -a
- 确保 sysctl 值 vm.swappiness 设置为 1。这样可以减少内核的交换倾向，在正常情况下不应导致交换，同时在紧急情况下仍然允许整个系统进行交换
Virtual memory
- sysctl -w vm.max_map_count = 262144
- 字节数自己根据业务情况调整
Number of threads
- ulimit -u 4096
TCP配置
- Linux用户应减少最大TCP重传次数
- sysctl -w net.ipv4.tcp_retries2=5 13s

JVM 配置

2016文档

单个节点上，最大内存建议不要超过 32 G 内存
将 Xms 和 Xmx 设置为不超过总内存的 50%。Elasticsearch 需要内存用于 JVM 堆以外的其他用途。例如，Elasticsearch 使用堆外缓冲区进行高效的网络通信，并依赖操作系统的文件系统缓存来高效访问文件。JVM 本身也需要一些内存

gc log/fatal error log

-Xlog:gc*,gc+age=trace,safepoint:file=/opt/my-app/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m

-XX:ErrorFile=

gc 策略
- XX:+UseG1GC
- XX:InitiatingHeapOccupancyPercent=30
- XX:G1ReservePercent=25
- XX:MaxDirectMemorySize=4294967296
netty相关
- io.netty.noUnsafe=true
- io.netty.noKeySetOptimization=true
- io.netty.recycler.maxCapacityPerThread=0
- io.netty.allocator.numDirectArenas=0
DNS cache
- es.networkaddress.cache.ttl and es.networkaddress.cache.negative.ttl

ES配置

network (system default)
- network.tcp.keep_alive
- network.tcp.keep_idle
- network.tcp.keep_interval
- network.tcp.keep_count
- network.tcp.no_delay
- network.tcp.reuse_address
- network.tcp.send_buffer_size
- network.tcp.receive_buffer_size
- http.max_content_length
- http.compression
- http.cors.enabled
Thread pool
- bulkhead模式
- 不同类型的thread pool 相互隔离
- 根据业务需要调整即可
Cricuit Breaker
- indices.breaker.total.use_real_memory
- indices.breaker.total.limit
- indices.breaker.fielddata.limit
- indices.breaker.fielddata.overhead
- indices.breaker.request.limit
- indices.breaker.request.overhead
- network.breaker.inflight_requests.limit
- network.breaker.inflight_requests.overhead
- indices.breaker.accounting.limit
- indices.breaker.accounting.overhead
index buffer & pressure & merge
- indices.memory.index_buffer_size
  - 请确保index_buffer_size足够大，以便每个进行大量索引操作的分片最多分配512MB的索引缓冲区（超过这个值通常不会显著提高索引性能）。Elasticsearch会将该设置（Java堆的百分比或绝对字节大小）用作所有活动分片的共享缓冲区。非常活跃的分片自然会比进行轻量级索引的分片更多地使用此缓冲区。
  - 默认值是10%，这通常已经足够了：例如，如果给JVM分配10GB内存，则会给索引缓冲区分配1GB，这足以支持两个进行大量索引的分片
- indices.memory.min_index_buffer_size
- indices.memory.max_index_buffer_size
- indexing_pressure.memory.limit
- index.merge.scheduler.max_thread_count
index blocks
- 索引阻塞限制了某个索引上可用的操作类型。阻塞有不同的类型，可以阻止写操作、读操作或元数据操作。阻塞可以通过动态索引设置进行设置/移除，也可以使用专用API添加。专用API还确保在写操作阻塞时，一旦成功返回给用户，索引的所有分片都能正确处理该阻塞。例如，添加写操作阻塞后，所有正在进行中的写操作都已完成
- index.blocks.read_only: true
- index.blocks.read_only_allow_delete
- index.blocks.read: true
- index.blocks.write: true
  - 将此项设置为 true 禁止对索引的数据写入操作。与 read_only 不同，此设置不会影响元数据
- index.blocks.metadata
Translog
- index.translog.sync_interval
- index.translog.durability
- index.translog.flush_threshold_size
Cache
- field cache
  - indices.fielddata.cache.size
- shard query cache
  - indices.requests.cache.size: 2%
  - indices.requests.cache.expire
    - 可以不进行设置，refresh时会自动过期
- node query cache
  - indices.queries.cache.size
Discovery and cluster
- discovery.seed_hosts
- cluster.initial_master_nodes
- expert setting 没有特殊需求可以不进行调整
  - 这里不进行展开
Shard allocation and routing
- 没有特殊需求可以不进行调整
- Enabling shard allocation awareness
  - node.attr.rack_id: rack_one
  - 根据rack进行感知调整
- shard allocation filtering
  - cluster.routing.allocation.exclude._ip" : "10.0.0.1"
Cross-cluster replication
- 这里不展开
index
Index recovery
- indices.recovery.max_bytes_per_sec
  - ≤ 4 GB 40 MB/s
  - >4 GB and ≤ 8 GB 60MB/s
  - > 8 GB and ≤ 16 GB 90 MB/s
  - > 16 GB and ≤ 32 GB 125MB/s
  - > 32 GB 250MB/s
- indices.recovery.max_concurrent_file_chunks
- indices.recovery.max_concurrent_operations
- indices.recovery.max_concurrent_snapshot_file_downloads
- indices.recovery.max_concurrent_snapshot_file_downloads_per_node
- node.bandwidth.recovery.disk.read
- node.bandwidth.recovery.disk.write
- node.bandwidth.recovery.*

health check

    health.master_history.has_master_lookup_timeframe
    （静态）在执行其他检查之前，节点回溯以查看是否曾观察到主节点的时间量。默认为 30 秒（30s）。

    master_history.max_age
    （静态）记录主节点历史以用于诊断集群健康的时间范围。超过此时间的主节点变化在诊断集群健康时将不予考虑。默认为 30 分钟（30m）。

    health.master_history.identity_changes_threshold
    （静态）节点目睹的主节点身份变化次数，指示集群不健康的阈值。默认为 4。

    health.master_history.no_master_transitions_threshold
    （静态）节点目睹的转变为无主节点的次数，指示集群不健康的阈值。默认为 4。

    health.node.enabled
    （动态）启用健康节点，允许健康API提供有关磁盘空间等集群范围的健康状况指示。

    health.reporting.local.monitor.interval
    （动态）确定集群中每个节点监控其本地健康状况（如磁盘使用情况）的间隔时间。

    health.ilm.max_time_on_action
    （动态）索引在被视为停滞之前必须处于索引生命周期管理（ILM）操作中的最短时间。默认为 1 天（1d）。

    health.ilm.max_time_on_step
    （动态）索引在被视为停滞之前必须处于ILM步骤中的最短时间。默认为 1 天（1d）。

    health.ilm.max_retries_per_step
    （动态）索引在被视为停滞之前由ILM步骤重试的最少次数。默认为 100。

    health.periodic_logger.enabled
    （动态）启用健康定期记录器，该记录器记录每个健康指示器以及健康API观察到的顶级健康状态。默认为 false。

    health.periodic_logger.poll_interval
    （动态，时间单位值）Elasticsearch记录集群及每个健康指示器的健康状态的频率，如健康API所观察到的。默认为 60 秒（60s）。

官方建议

Avoid large documents
Don’t return large result sets
Use bulk requests
Use multiple workers/threads to send data to Elasticsearch
Unset or increase the refresh interval
Disable replicas for initial loads
Disable swapping
Give memory to the filesystem cache (node memory/2)
Use auto-generated ids
Use faster hardware
Avoid hot spotting
increase indexing buffer size
Disable the features you do not need
- 如disable _source (json body)
Watch your shard size
Shrink index
Put fields in the same order in documents
Don’t use default dynamic string mappings
Use best_compression
Use the smallest numeric type that is sufficient
Use index sorting to colocate similar documents
Force-merge read-only indices
Put fields in the same order in documents
Roll up historical data
Search as few fields as possible
Pre-index data 减少聚合查询
Avoid scripts
Replicas might help with throughput, but not always
Tune your queries with the Search Profile
Faster phrase queries with index_phrases
Use constant_keyword to speed up filtering

参考

官方文档