Configuration
The syntax for running scratchdata
is:
The following describes all of the configuration options. The Databases section describes DB-specific connection settings.
The most important sections are api and workers.
REST API
Configuration options for the RESTful API. This is used for both inserting and querying data. The API is designed to return very quickly when new data is inserted.
api:
enabled: true
port: 3000
# If this file exists, then the /healthcheck endpoint will
# return an error. This is used to force a server to return
# and unhealthy healthcheck, to rotate out of a load balancer
health_check_path: ./healthcheck
Workers
Workers are responsible for batch inserting data into the database.
workers:
enabled: true
# The number of threads consuming data and
# batch inserting to the DB
count: 1
# Where to store temporary data on disk
data_directory: ./data/worker
free_space_required_bytes: 1000000000
Logging
logs:
# By default (pretty=false) logs will be JSON format. Otherwise we pretty-print
pretty: true
# panic, fatal, error, warn, info, debug, trace
level: trace
Data Sink
When data is written via the API, it's buffered before being inserted to the database.
Memory
This data sink receives data and immediately sends to the destination. Most commonly used for local testing.
Filesystem
Data is buffered in a directory. Data is uploaded to the database in bulk once a limit is reached: the number of rows in the file, the file size, or the duration.
data_sink:
type: filesystem
settings:
# Where we store intermediate data
data: ./data/sink
# How much free space is required (to avoid filling up the disk)
free_space_required_bytes: 1000000000
# The API batch inserts data to the database. It will upload a new
# batch after max_age_seconds or we have max_size_bytes or max_rows
# worth of data, whichever comes first.
max_age_seconds: 5
max_size_bytes: 1000000
max_rows: 10
# Threads used to upload files to blob storage, where
# data is staged before being inserted to database
workers: 4
Queue
Scratch uses a queue to communicate metadata to workers. Note, the queue is not used to transmit each individual piece of data ingested. Rather, it is used to signal when data is ready for batch insert to the database.
Scratch comes with the following queue implementations:
Memory
This is an in-memory queue, mostly useful for local development
SQS
queue:
type: sqs
settings:
region: "us-east-1"
access_key_id: "ACCESS_KEY_ID"
secret_access_key: "SECRET_ACCESS_KEY"
sqs: "https://sqs.us-east-1.amazonaws.com/.../..."
Blob Storage
This is where data is stored while waiting to be loaded to the database.
Memory
This stores raw bytes in memory, useful for local development
S3
You may use any S3-compatible storage. We also recommend setting up lifecycle rules so data doesn't grow infinitely (S3 can be used as backup.)