Connecting S3 to Golden Suite

Set up the S3 connector for CSV / Parquet / JSON files in a bucket.

The S3 connector reads CSV, Parquet, or newline-JSON files from a bucket. Use it when your data plane writes scheduled exports to S3 (a common pattern with data warehouses, Kafka sinks, or third-party tools).

Prereqs

An S3 bucket with read access
IAM credentials (access-key + secret) with s3:GetObject + s3:ListBucket on the target prefix
Either:
- One file per source — point at s3://bucket/path/to/file.csv
- A prefix — Golden Suite reads all matching files under the prefix

Setup

Create an IAM user for Golden Suite — read-only on the bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/exports/*"
      ]
    }
  ]
}

Generate access keys for the IAM user
/golden/sources → Add source → S3
Paste credentials + bucket + prefix or file path
Pick file format — CSV / Parquet / JSON
Test connection — Golden Suite hits HEAD on the path

Supported formats

Format	Notes
CSV	UTF-8 expected. Header row required. Same parsing as the upload connector.
Parquet	Best choice for large files. Columnar; schema is preserved.
JSONL (newline-delimited JSON)	One JSON object per line. Flat or nested — nested objects need to be flattened in your source pipeline first.
JSON (single document)	Only for small files. The whole doc loads into memory.
gzip-compressed	`.csv.gz`, `.json.gz` etc. transparently decompressed.

Cursor strategy

For ongoing ingest of a prefix that gets new files daily, the cursor is file modified-time. Golden Suite tracks the latest-seen LastModified per source and only reads files newer than that on subsequent runs.

If you replace files in-place (same path, new contents), the cursor still detects the LastModified update and re-reads. If you want to FORCE a re-read of unchanged files, use the "Reset cursor" button in the source detail page.

Sample bucket layout

A common pattern — daily exports under a date-partitioned prefix:

s3://your-bucket/exports/customers/
  ├── 2026-05-10/customers.parquet
  ├── 2026-05-11/customers.parquet
  ├── 2026-05-12/customers.parquet
  └── ...

Configure the source with prefix exports/customers/. The connector reads new daily files; combined with goldenmatch.dedupe_df it produces fresh golden records each day.

Common gotchas

IAM permission scoping. Use s3:GetObject on the specific prefix, not the whole bucket. Following principle-of-least-privilege.
Endpoint URL for non-AWS S3. The connector supports custom endpoint_url for S3-compatible services (Cloudflare R2, MinIO, Backblaze). Same code path.
Region. If your bucket is in eu-west-1 and your IAM user defaults to us-east-1, sign requests with the right region — Golden Suite asks for region during setup.
Large files. Single CSV files over 1 GB will OOM the parser. Use Parquet (columnar streaming) or split into multiple files.
Permissions checked at file level, not just bucket-list. If the user can list the bucket but not GetObject on some files, the connector errors on first read; the error message identifies the failing key.
KMS-encrypted objects. Add kms:Decrypt to the IAM policy for the bucket's KMS key.

Cost considerations

S3 GET requests: $0.0004 per 1000. Even a daily ingest of 1000 files = $0.0004/day = trivial.
S3 data transfer out: free within the same region; $0.09/GB cross-region. Pin Golden Suite's backend region to match the bucket (Enterprise tier supports region pinning).

Next steps

/docs/guides/use-case/customer-360 — S3 is often the "data warehouse export" source
/docs/runbooks/soc2-logging — same S3 patterns as our own SOC2 log retention setup

← CSV upload — the simplest source