Building a Resilient Log Aggregation Pipeline Using Elasticsearch and Kafka – Rafał Kuć
Time-based data, especially logs are all around us. Every application, system or hardware piece logs something - from simple messages, to large stack traces. In this talk we will learn how to build and tune resilient log aggregation pipeline using Elasticsearch and Kafka as its heart. We will start by looking at the overall architecture and how we can connect Elasticsearch and Kafka together. We will look at how to scale our system through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages. Finally, we'll take a look at the pipeline of getting the logs to Elasticsearch and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.