Configuration

The syntax for running scratchdata is:

$ ./scratchdata config.yaml

The following describes all of the configuration options. The Databases section describes DB-specific connection settings.

The most important sections are api and workers.

REST API

Configuration options for the RESTful API. This is used for both inserting and querying data. The API is designed to return very quickly when new data is inserted.

api:
  enabled: true
  port: 3000

  # If this file exists, then the /healthcheck endpoint will 
  # return an error. This is used to force a server to return
  # and unhealthy healthcheck, to rotate out of a load balancer
  health_check_path: ./healthcheck

Workers

Workers are responsible for batch inserting data into the database.

workers:
  enabled: true

  # The number of threads consuming data and
  # batch inserting to the DB
  count: 1

  # Where to store temporary data on disk
  data_directory: ./data/worker
  free_space_required_bytes: 1000000000

Logging

logs:
  # By default (pretty=false) logs will be JSON format. Otherwise we pretty-print
  pretty: true

  # panic, fatal, error, warn, info, debug, trace
  level: trace

Data Sink

When data is written via the API, it's buffered before being inserted to the database.

Memory

This data sink receives data and immediately sends to the destination. Most commonly used for local testing.

data_sink:
  type: memory

Filesystem

Data is buffered in a directory. Data is uploaded to the database in bulk once a limit is reached: the number of rows in the file, the file size, or the duration.

data_sink:
  type: filesystem
  settings:
    # Where we store intermediate data
    data: ./data/sink

    # How much free space is required (to avoid filling up the disk) 
    free_space_required_bytes: 1000000000

    # The API batch inserts data to the database. It will upload a new
    # batch after max_age_seconds or we have max_size_bytes or max_rows
    # worth of data, whichever comes first.
    max_age_seconds: 5
    max_size_bytes: 1000000
    max_rows: 10

    # Threads used to upload files to blob storage, where
    # data is staged before being inserted to database
    workers: 4

Queue

Scratch uses a queue to communicate metadata to workers. Note, the queue is not used to transmit each individual piece of data ingested. Rather, it is used to signal when data is ready for batch insert to the database.

Scratch comes with the following queue implementations:

Memory

This is an in-memory queue, mostly useful for local development

queue:
  type: memory

SQS

queue:
  type: sqs
  settings:
    region: "us-east-1"
    access_key_id: "ACCESS_KEY_ID"
    secret_access_key: "SECRET_ACCESS_KEY"
    sqs: "https://sqs.us-east-1.amazonaws.com/.../..."

Blob Storage

This is where data is stored while waiting to be loaded to the database.

Memory

This stores raw bytes in memory, useful for local development

blob_store:
  type: memory

S3

You may use any S3-compatible storage. We also recommend setting up lifecycle rules so data doesn't grow infinitely (S3 can be used as backup.)

blob_store:
  type: s3
  settings:
    region: "us-east-005"
    access_key_id: "ACCESS_KEY_ID"
    secret_access_key: "SECRET_ACCESS_KEY"
    bucket: "bucketname"
    endpoint: "https://s3.us-east-005.backblazeb2.com"