Learn how to monitor your Fluent Bit data pipelines
Fluent Bit comes with built-it features to allow you to monitor the internals of your pipeline, connect to Prometheus and Grafana, Health checks and also connectors to use external services for such purposes:
By default configured plugins on runtime get an internal name in the format plugin_name.ID. For monitoring purposes, this can be confusing if many plugins of the same type were configured. To make a distinction each configured input or output section can get an alias that will be used as the parent name for the metric.
The following example set an alias to the INPUT section which is using the CPU input plugin:
Now when querying the metrics we get the aliases in place instead of the plugin name:
Fluent bit now supports four new configs to set up the health check.
enable Health check feature
the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: [2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)
the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: [2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1
The time period by second to count the error and retry failure data point
Note: Not every error log means an error nor be counted, the errors retry failures count only on specific errors which is the example in config table description
So the feature works as: Based on the HC_Period customer setup, if the real error number is over HC_Errors_Count or retry failure is over HC_Retry_Failure_Count, fluent bit will be considered as unhealthy. The health endpoint will return HTTP status 500 and String error. Otherwise it's healthy, will return HTTP status 200 and string ok
The equation is:
health status = (HC_Errors_Count > HC_Errors_Count config value) OR (HC_Retry_Failure_Count > HC_Retry_Failure_Count config value) IN the HC_Period interval
Note: the HC_Errors_Count and HC_Retry_Failure_Count only count for output plugins and count a sum for errors and retry failures from all output plugins which is running.
See the config example:
The command to call health endpoint
$ curl -s http://127.0.0.1:2020/api/v1/health
Based on the fluent bit status, the result will be:
HTTP status 200 and "ok" in response to healthy status
HTTP status 500 and "error" in response for unhealthy status
With the example config, the health status is determined by following equation:
Health status = (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds
If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is TRUE, then it's unhealthy.
If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is FALSE, then it's healthy.
Calyptia Cloud is a hosted service that allows you to monitor your Fluent Bit agents including data flow, metrics and configurations.
Get Started with Calyptia Cloud
Register your Fluent Bit agent will take less than one minute, steps:
On the left menu click on Settings and generate/copy your API key
In your Fluent Bit configuration file, append the following configuration section:
Make sure to replace your API key in the configuration.
After a few seconds upon restart your Fluent Bit agent, the Calyptia Cloud Dashboard will list your agent. Metrics will take around 30 seconds to shows up.