Send logs, data, metrics to Amazon S3
The plugin can upload data to S3 using the multipart upload API or using S3 PutObject. Multipart is the default and is recommended; Fluent Bit will stream data in a series of 'parts'. This limits the amount of data it has to buffer on disk at any point in time. By default, every time 5 MiB of data have been received, a new 'part' will be uploaded. The plugin can create files up to gigabytes in size from many small chunks/parts using the multipart API. All aspects of the upload process are configurable using the configuration options.
The plugin allows you to specify a maximum file size, and a timeout for uploads. A file will be created in S3 when the max size is reached, or the timeout is reached- whichever comes first.
Records are stored in files in S3 as newline delimited JSON.
NOTE: The Prometheus success/retry/error metrics values outputted by Fluent Bit's built-in http server are meaningless for the S3 output. This is because S3 has its own buffering and retry mechanisms. The Fluent Bit AWS S3 maintainers apologize for this feature gap; you can track our progress fixing it on GitHub.
The plugin requires the following AWS IAM permissions:
The s3 output plugin is special because its use case is to upload files of non-trivial size to an Amazon S3 bucket. This is in contrast to most other outputs which send many requests to upload data in batches of a few Megabytes or less.
When Fluent Bit recieves logs, it stores them in chunks, either in memory or the filesystem depending on your settings. A chunk is usually around 2 MB in size. Fluent Bit sends the chunks in order to each output that matches their tag. Most outputs then send the chunk immediately to their destination. A chunk is sent to the output's "flush callback function", which must return one of
FLB_ERROR. Fluent Bit keeps count of the return values from each outputs "flush callback function"; these counters are the data source for Fluent Bit's error, retry, and success metrics available in prometheus format via its monitoring interface.
The S3 output plugin is a Fluent Bit output plugin and thus it conforms to the Fluent Bit output plugin specification. However, since the S3 use case is to upload large files, generally much larger than 2 MB, its behavior is different. The S3 "flush callback function" simply buffers the incoming chunk to the filesystem, and returns an
FLB_OK. Consequently, the prometheus metrics available via the Fluent Bit http server are meaningless for S3. In addition, the
storage.total_limit_sizeparameter is not meaningful for S3 since it has its own buffering system in the
store_dir. Instead, use
S3 uploads are primarily initiated via the S3 "timer callback function", which runs separately from its "flush callback function". Because S3 has its own system of buffering and its own callback to upload data, the normal sequential data ordering of chunks provided by the Fluent Bit engine may be compromised. Consequently, S3 has the
presevere_data_orderingoption which will ensure data is uploaded in the original order it was collected by Fluent Bit.
- 1.The HTTP Monitoring interface output metrics are not meaningful for S3: AWS understands that this is non-ideal; we have opened an issue with a design that will allow S3 to manage its own output metrics.
- 2.You must use
store_dir_limit_sizeto limit the space on disk used by S3 buffer files.
- 3.The original ordering of data inputted to Fluent Bit may not be preserved unless you enable
In Fluent Bit, all logs have an associated tag. The
s3_key_formatoption lets you inject the tag into the s3 key using the following syntax:
$TAG=> the full tag
$TAG[n]=> the nth part of the tag (index starting at zero). This syntax is copied from the rewrite tag filter. By default, “parts” of the tag are separated with dots, but you can change this with
In the example below, assume the date is January 1st, 2020 00:00:00 and the tag associated with the logs in question is
With the delimiters as . and -, the tag will be split into parts as follows:
So the key in S3 will be
The Fluent Bit S3 output was designed to ensure that previous uploads will never be over-written by a subsequent upload. Consequently, the
s3_key_formatsupports time formatters,
$INDEXis special because it is saved in the
store_dir; if you restart Fluent Bit with the same disk, then it can continue incrementing the index from its last value in the previous run.
For files uploaded with the PutObject API, the S3 output requires that a unique random string be present in the S3 key. This is because many of the use cases for PutObject uploads involve a short time period between uploads such that a timestamp in the S3 key may not be unique enough between uploads. For example, if you only specify minute granularity timestamps in the S3 key, with a small upload size, it is possible to have two uploads that have timestamps set in the same minute. This "requirement" can be disabled with
There are three cases where the PutObject API is used:
- 1.When you explicitly set
- 2.On startup when the S3 output finds old buffer files in the
store_dirfrom a previous run and attempts to send all of them at once.
- 3.On shutdown, when to prevent data loss the S3 output attempts to send all currently buffered data at once.
Consequently, you should always specify
$UUIDsomewhere in your S3 key format. Otherwise, if the PutObject API is used, S3 will append a random 8 character UUID to the end of your S3 key. This means that a file extension set at the end of an S3 key will have the random UUID appended to it. This behavior can be disabled with
Let's walk through this via an example. First case, we attempt to set a
.gzextension without specifying
In the case where pending data is uploaded on shutdown, if the tag was
app, the S3 key in the S3 bucket might be:
The S3 output appended a random string to the "extension", since this upload on shutdown used the PutObject API.
There are two ways of disabling this behavior. Option 1, use
Option 2, explicitly define where the random UUID will go in the S3 key format:
store_diris used to temporarily store data before it is uploaded. If Fluent Bit is stopped suddenly it will try to send all data and complete all uploads before it shuts down. If it can not send some data, on restart it will look in the
store_dirfor existing data and will try to send it.
Multipart uploads are ideal for most use cases because they allow the plugin to upload data in small chunks over time. For example, 1 GB file can be created from 200 5MB chunks. While the file size in S3 will be 1 GB, only 5 MB will be buffered on disk at any one point in time.
There is one minor drawback to multipart uploads- the file and data will not be visible in S3 until the upload is completed with a CompleteMultipartUpload call. The plugin will attempt to make this call whenever Fluent Bit is shut down to ensure your data is available in s3. It will also store metadata about each upload in the
store_dir, ensuring that uploads can be completed when Fluent Bit restarts (assuming it has access to persistent disk and the
store_dirfiles will still be present on restart).
If you run Fluent Bit in an environment without persistent disk, or without the ability to restart Fluent Bit and give it access to the data stored in the
store_dirfrom previous executions- some considerations apply. This might occur if you run Fluent Bit on AWS Fargate.
In these situations, we recommend using the PutObject API, and sending data frequently, to avoid local buffering as much as possible. This will limit data loss in the event Fluent Bit is killed unexpectedly.
The following settings are recommended for this use case:
Fluent Bit 1.7 adds a new feature called
workerswhich enables outputs to have dedicated threads. This
s3plugin has partial support for workers. The plugin can only support a single worker; enabling multiple workers will lead to errors/indeterminate behavior.
If you enable a single worker, you are enabling a dedicated thread for your S3 output. We recommend starting without workers, evaluating the performance, and then enabling a worker if needed. For most users, the plugin can provide sufficient throughput without workers.
Then, the records will be stored into the MinIO server.
In order to send records into Amazon S3, you can run the plugin from the command line or through the configuration file.
The s3 plugin, can read the parameters from the command line through the -p argument (property), e.g:
$ fluent-bit -i cpu -o s3 -p bucket=my-bucket -p region=us-west-2 -p -m '*' -f 1
In your main configuration file append the following Output section:
An example that using PutObject instead of multipart:
Amazon distributes a container image with Fluent Bit and this plugins.
Our images are available in Amazon ECR Public Gallery. You can download images with different tags by following command:
docker pull public.ecr.aws/aws-observability/aws-for-fluent-bit:<tag>
For example, you can pull the image with latest version by:
docker pull public.ecr.aws/aws-observability/aws-for-fluent-bit:latest
If you see errors for image pull limits, try log into public ECR with your AWS credentials:
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
You can use our SSM Public Parameters to find the Amazon ECR image URI in your region:
aws ssm get-parameters-by-path --path /aws/service/aws-for-fluent-bit/
Starting from Fluent Bit v1.8, the Amazon S3 plugin includes the support for Apache Arrow. The support is currently not enabled by default, as it depends on a shared version of
libarrowas the prerequisite.
To use this feature,
FLB_ARROWmust be turned on at compile time:
$ cd build/
$ cmake -DFLB_ARROW=On ..
$ cmake --build .
Once compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow format. For example:
As shown in this example, setting
arrowmakes Fluent Bit to convert payload into Apache Arrow format.
The stored data is very easy to load, analyze and process using popular data processing tools (such as Python pandas, Apache Spark and Tensorflow). The following code uses
pyarrowto analyze the uploaded data:
>>> import pyarrow.feather as feather
>>> import pyarrow.fs as fs
>>> s3 = fs.S3FileSystem()
>>> file = s3.open_input_file("my-bucket/fluent-bit-logs/cpu.0/2021/04/27/09/36/15-object969o67ZF")
>>> df = feather.read_feather(file)
date cpu_p user_p system_p cpu0.p_cpu cpu0.p_user cpu0.p_system
0 2021-04-27T09:33:53.539346Z 1.0 1.0 0.0 1.0 1.0 0.0
1 2021-04-27T09:33:54.539330Z 0.0 0.0 0.0 0.0 0.0 0.0
2 2021-04-27T09:33:55.539305Z 1.0 0.0 1.0 1.0 0.0 1.0
3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0
4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0