Skip to main content

ClickHouse Infrastructure & Setup

This document details the ClickHouse database infrastructure running in our GKE clusters. It covers the deployment architecture, tenant management, integration with Cloud SQL, and backup strategies.

Overview​

ClickHouse is deployed on Kubernetes (GKE) using the Altinity ClickHouse Operator. It serves as our primary analytics engine.

  • Environment: Staging and Production (each in their own GCP Project).
  • Cluster Topology:
    • ClickHouse: 1 Shard, 2 Replicas (High Availability).
    • Keeper: 3 Replicas (Zookeeper replacement for coordination).
  • Operator: altinity-clickhouse-operator manages the cluster state.
  • Storage: Persistent Volumes using premium-rwo-cmek storage class.

Infrastructure Hierarchy​

Our infrastructure is managed via Terraform across two main repositories:

  1. Base Infrastructure (**deepdesk-gcp**):
    • Deploys the ClickHouse Operator.
    • Deploys the ClickHouse Cluster (ClickHouseInstallation CRD) and Keeper (ClickHouseKeeperInstallation CRD).
    • Sets up the underlying Kubernetes resources (Service Accounts, PVCs, ConfigMaps).
    • Manages the global backup schedule.
  2. Account Provisioning (**deepdesk-terraform**):
    • Manages tenant-specific resources (Customers).
    • Creates Databases ({account}_analytics) and Users.
    • Configures Federation with Cloud SQL (PostgreSQL).
    • Manages Access Control Lists (ACLs) and Grants.

Architecture Components​

1. The ClickHouse Cluster (deepdesk-gcp)​

The cluster is defined in the clickhouse Terraform module. Key configurations include:

  • Resources: CPU and Memory requests are parameterized variables (e.g., cpu_requests, memory_requests).
  • Networking:
    • Exposed via an internal Nginx Ingress at clickhouse{region_subdomain}.deepdesk.com.
    • Service-to-Service communication is permitted via Istio Ambient mesh policies (namespace labeled istio.io/dataplane-mode: ambient).
  • Monitoring:
    • Prometheus metrics are exposed on port 9363.
    • ServiceMonitor resources automatically scrape metrics for the ClickHouse cluster and Keeper.

2. Cloud SQL Federation​

Each ClickHouse pod runs a Cloud SQL Proxy sidecar (implemented as an initContainer with restartPolicy: Always).

  • Purpose: Allows ClickHouse to connect to our Cloud SQL (PostgreSQL) instances via localhost:5432.
  • Implementation:
    • A Named Collection is created for each tenant pointing to localhost:5432.
    • A federated database ${account}_admin_db is created using the PostgreSQL engine.
    • This allows us to query live data from Cloud SQL directly within ClickHouse for specific use cases.

3. ClickHouse Keeper​

We use ClickHouse Keeper (a C++ implementation compatible with ZooKeeper) for cluster coordination.

  • It runs as a separate stateful set with 3 replicas to ensure quorum.
  • Configured via the ClickHouseKeeperInstallation CRD.

Tenant & Database Management (deepdesk-terraform)​

When a new customer account is provisioned:

  1. Analytics Database: A native ClickHouse database named ${account}_analytics is created.
  2. Federated Database: A PostgreSQL-backed database named ${account}_admin_db is connected.
  3. Permissions:
    • The Terraform Provider for ClickHouse (clickhousedbops) manages grants.
    • Specific services (e.g., admin_celery_worker, backend, analytics_readonly) are granted granular permissions (SELECT, INSERT, ALTER DELETE) on these databases.

Backup Strategy​

Backups are automated using a custom sidecar container running Supercronic.

  • Image: deepdesk-custom-docker-images/clickhouse-backup-cron.
  • Tool: clickhouse-backup utility.
  • Schedule: Runs hourly (0 * * * *).

Backup Workflow (clickhouse-backups.sh)​

  1. Leader Only: The script executes only on the first replica (chi-cluster-main-0-0-0) to avoid duplication.
  2. Discovery: Iterates through all databases ending in _analytics.
  3. Storage: Backups are uploaded to Google Cloud Storage (GCS).
    • Bucket pattern: ${project}-backups-${account}.
  4. Full vs. Incremental:
    • Incremental: The default backup mode.
    • Full: Triggered if:
      • No previous full backup exists.
      • The latest full backup is > 24 hours old.
      • A "mutation" (cleanup/delete) event was detected since the last full backup (ensuring consistency).

Recovery​

Backups are stored in GCS and can be restored using the clickhouse-backup tool.

  • List Backups: You can list available backups by executing the command inside the clickhouse-backup container on any Clickhouse cluster pod.

    kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
    GCS_BUCKET="${project}-backups-${account}" clickhouse-backup list remote
  • Restore Procedure: Restoring in a replicated environment requires specific steps to ensure consistency. You must restore the schema on all replicas but restore the data only on the first replica.

    In this example, we restore the ${account}_analytics database to a new temporary database ${account}_analytics_temp. This pattern allows for safe verification and manual recovery of specific data subsets without overwriting live data immediately.

    1. Restore Schema on Replica 0:

      kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
      GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
      exit
    2. Restore Schema on Replica 1:

      kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
      GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --schema --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
      exit
    3. Restore Data on Replica 0:

      kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
      GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --restore-database-mapping="${account}_analytics:${account}_analytics_temp" <backup_name>
      exit

    Disaster Recovery (Destructive Restore)​

    ⚠️ WARNING: This procedure DROPS existing tables and data before restoring. Use this ONLY in catastrophic scenarios where data is lost or corrupted and you need to revert to a backup completely.

    1. Restore Schema on Replica 1 (with --drop):

      kubectl -n clickhouse exec -it chi-cluster-main-0-1-0 -c clickhouse-backup -- /bin/bash
      GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --drop <backup_name>
      exit
    2. Restore Schema & Data on Replica 0 (with --drop):

      kubectl -n clickhouse exec -it chi-cluster-main-0-0-0 -c clickhouse-backup -- /bin/bash
      GCS_BUCKET="${project}-backups-${account}" clickhouse-backup restore_remote --tables="${account}_analytics.*" --schema --data --drop <backup_name>
      exit
    3. Extras:

      Some account specific configurations (like users, grants, etc.) are managed via Terraform in the deepdesk-terraform repo.

      The safest way to make sure these are fully applied is to:

      • Go to the Cloud Build triggers page in the right GCP project (eg. staging)
      • Find the trigger named deploy-accounts-infra
      • Click on Run
      • Set the Branch to the tag of latest release from the deepdesk-terraform repo for any production environments, or to main for staging
      • Set the value of the _ACCOUNT variable to the account that has been restored (${account})
      • Click Run trigger

      The above can also be achieved with the following gcloud command:

      gcloud builds triggers run \
      $(gcloud builds triggers list --filter name=deploy-accounts-infra --format json --project ${project} | jq -r '.[0].id') \
      --region global \
      --substitutions _ACCOUNT=${account} \
      $(
      if [ "${project}" = "deepdesk-nl-staging" ]; then
      echo -n "--branch main"
      else
      echo "--tag $(git ls-remote --tags --sort="v:refname" git@github.com:Deepdesk/deepdesk-terraform.git | tail -n1 | sed 's/.*\/\|\^{}//g')"
      fi
      ) \
      --project ${project}

Key Repositories​

RepositoryPurposePath
deepdesk-gcpBase Infrastructure & Cluster Deploymentterraform/modules/clickhouse
deepdesk-terraformTenant DBs, Users & Grantsprojects/analytics/, modules/database/
deepdesk-custom-docker-imagesBackup Cron Imageclickhouse-backup-cron/