amazon web services - Firehose Stream Delivers to S3 in Uncompressed Format Despite Compression Enabled - Stack Overflow

admin2025-04-24  4

I have Lambda function that direct put's JSON strings to a Firehose stream to deliver batches of records to S3, and I wish to deliver these records as compressed .gz files.

However, despite having Destination settings > Compression for data records for the stream set to GZIP, the files are delivered in plaintext even though they even get assigned a .gz extension. I can tell this because a) I can download the file from S3 and it opens as text with no modification and b) gzip -d ~/path/my_file.gz returns gzip: /path/my_file.gz: not in gzip format

Why would Firehose deliver the data uncompressed even though compression is enabled? Am I missing something?

Code:

Lambda:

import json
import boto3
firehose = boto3.client("firehose")

record = {'field_1': 'test'}               # dict/json
record_string = json.dumps(record) + '\n'  # Firehose expects ndjson

response = firehose.put_record(
    DeliveryStreamName=my_stream_name,
    Record={ 'Data': record_string }
)

Firehose (Terraform):

resource "aws_kinesis_firehose_delivery_stream" "my_firehose_stream" {
  name        = my_stream_name
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = my_role_arn
    bucket_arn = my_bucket_arn

    prefix              = "my_prefix/!{partitionKeyFromQuery:extracted}/"
    error_output_prefix = "my_error_prefix/"

    buffering_size      = 64     # MB
    buffering_interval  = 900    # seconds
    compression_format  = "GZIP" # Compress as GZIP

    # Enabled to dynamic extract
    processing_configuration {
      enabled = true
      processors {
        type = "MetadataExtraction"
        parameters {
          parameter_name  = "JsonParsingEngine"
          parameter_value = "JQ-1.6"
        }
        parameters {
          parameter_name  = "MetadataExtractionQuery"
          parameter_value = "{extracted:.extracted}"
        }
      }
    }

    dynamic_partitioning_configuration {
      enabled        = true
    }
  }
}

I have Lambda function that direct put's JSON strings to a Firehose stream to deliver batches of records to S3, and I wish to deliver these records as compressed .gz files.

However, despite having Destination settings > Compression for data records for the stream set to GZIP, the files are delivered in plaintext even though they even get assigned a .gz extension. I can tell this because a) I can download the file from S3 and it opens as text with no modification and b) gzip -d ~/path/my_file.gz returns gzip: /path/my_file.gz: not in gzip format

Why would Firehose deliver the data uncompressed even though compression is enabled? Am I missing something?

Code:

Lambda:

import json
import boto3
firehose = boto3.client("firehose")

record = {'field_1': 'test'}               # dict/json
record_string = json.dumps(record) + '\n'  # Firehose expects ndjson

response = firehose.put_record(
    DeliveryStreamName=my_stream_name,
    Record={ 'Data': record_string }
)

Firehose (Terraform):

resource "aws_kinesis_firehose_delivery_stream" "my_firehose_stream" {
  name        = my_stream_name
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = my_role_arn
    bucket_arn = my_bucket_arn

    prefix              = "my_prefix/!{partitionKeyFromQuery:extracted}/"
    error_output_prefix = "my_error_prefix/"

    buffering_size      = 64     # MB
    buffering_interval  = 900    # seconds
    compression_format  = "GZIP" # Compress as GZIP

    # Enabled to dynamic extract
    processing_configuration {
      enabled = true
      processors {
        type = "MetadataExtraction"
        parameters {
          parameter_name  = "JsonParsingEngine"
          parameter_value = "JQ-1.6"
        }
        parameters {
          parameter_name  = "MetadataExtractionQuery"
          parameter_value = "{extracted:.extracted}"
        }
      }
    }

    dynamic_partitioning_configuration {
      enabled        = true
    }
  }
}

Share Improve this question asked Jan 16 at 22:06 mmarionmmarion 1,09511 silver badges32 bronze badges 2
  • 2 How are you downloading the files before testing their format? If you are doing it via a web browser, it is possible that the browser is auto-decompressing them because browsers know how to handle web pages that are gzip compressed. To fully test what is happening, you should download the file via the AWS CLI and then check the file contents. You could also compare the size of the file shown in S3 vs the size on your local disk. – John Rotenstein Commented Jan 17 at 2:22
  • @JohnRotenstein You nailed it, it is compressed in S3 but it's being auto-decompressed when downloaded from the AWS UI in Chrome which is what was causing my confusion. Wow. Super unfortunate it doesn't indicate this at all, Chrome downloads the file but keeps the .gzip extension (indicating it's compressed) even though it's been decompressed and no longer really a .gzip. You just solved a huge issue of mine. Thank you so much!! – mmarion Commented Jan 17 at 19:14
Add a comment  | 

1 Answer 1

Reset to default 2

If you are downloading the file via a web browser, it is possible that the browser is auto-decompressing the file because browsers know how to handle web pages that are gzip-compressed.

To fully test what is happening, you should download the file via the AWS CLI and then check the file contents.

You could also compare the size of the file shown in S3 vs the size on your local disk.

See: Is GZIP Automatically Decompressed by Browser?

转载请注明原文地址:http://anycun.com/QandA/1745501610a90786.html