Centralized Logging and APM at Scale with OpenSearch

In a distributed microservices architecture, logging is the first line of defense for troubleshooting application issues. However, if logs are scattered across individual virtual machines, containers, and serverless tasks, debugging an incident requires SSH-ing into multiple instances and manually grepping text files.

SRE teams resolve this by building centralized logging architectures. OpenSearch (an open-source fork of Elasticsearch and Kibana) has become a primary standard for aggregating, indexing, searching, and visualizing high-volume log data and Application Performance Monitoring (APM) traces.

Here is how to design a scalable centralized logging ingestion pipeline.


The Log Ingestion Pipeline Architecture

A scalable logging pipeline consists of three core stages: Collection, Buffer/Processing, and Storage/Indexing:

[Application Containers] ──> [FluentBit Agent] ──> [Logstash Buffer] ──> [OpenSearch Cluster]

1. Collection (FluentBit)

Rather than writing logs directly to network endpoints (which blocks application threads), services write logs to standard output (stdout). A lightweight log forwarding agent—like FluentBit or Logstash Forwarder—runs on each host, reads the container log streams, and forwards them. FluentBit is highly efficient, consuming only a few megabytes of memory per instance.

2. Buffering & Processing (Logstash)

In high-throughput environments, a sudden spike in application logs (e.g. during an outage) can overwhelm your indexing database. To protect the cluster, place a processing buffer like Logstash (or a message queue like Apache Kafka or AWS Kinesis) in front of OpenSearch.

Logstash parses the logs (e.g., extracting JSON properties, converting timestamps, and scrubbing sensitive data like credit card numbers) before sending them to OpenSearch.

3. Storage and Indexing (OpenSearch)

OpenSearch receives the processed logs, indexes them, and makes them searchable in near real-time.


Scaling OpenSearch: Index Lifecycle Management (ISM)

Logging data is write-heavy and time-series based. If you write all logs to a single index, it will grow too large, query performance will degrade, and storage costs will surge.

To scale effectively, SREs implement Index State Management (ISM) policies using hot-warm-cold storage architectures:

  • Hot Nodes: Active indexes that are currently receiving writes. These nodes run on high-performance compute instances with fast NVMe SSD storage.
  • Warm Nodes: Indexes from 3 to 7 days ago. They are no longer receiving writes but are queried frequently for debugging. These nodes run on cost-effective instances with standard SSDs.
  • Cold Nodes: Indexes older than 7 days. They are rarely queried but retained for compliance. Data is compressed and stored on cheap block storage (EBS or S3).
  • Deletion: After a defined retention period (e.g., 30 days), indexes are deleted automatically.

Optimizing Query Performance

To ensure OpenSearch queries return in milliseconds:

  1. Avoid Wildcard Queries: Encourage developers to search specific fields (e.g., service.name: "auth-service") instead of performing full-text searches across all fields (*).
  2. Log in Structured JSON: Ensure all applications output logs as structured JSON rather than unstructured plain text. Structured logs automatically parse into indexable fields (e.g., http.response.status_code: 500), allowing for rapid filtering.
  3. Map Fields Explicitly: Disable dynamic mapping for fields with high cardinality (like user IDs or UUIDs) to prevent the "mapping explosion" that degrades cluster performance.

Summary

Centralized logging is essential for maintaining microservices. By implementing a lightweight collector like FluentBit, buffering spikes with Logstash, and managing index lifecycles in OpenSearch, you can give your engineering teams a high-performance debugging platform without inflating your cloud storage bill.