ClickHouse: What It’s Actually Used For (And When Not To Touch It)
I’m Nishaant Dixit, founder of SIVARO. We build data infrastructure and production AI systems. I’ve spent the last six years watching engineers pick the wrong database, burn six figures on cloud bills, and then blame the tool. ClickHouse is one of those tools people either worship or misunderstand.
Let me clear that up.
What is ClickHouse used for? Real-time analytics on massive datasets. Sub-second queries on billions of rows. It’s the columnar storage engine you reach for when PostgreSQL starts sweating and Snowflake starts costing you a second mortgage.
At SIVARO, we’ve deployed ClickHouse in production for clients processing 200K events per second. We’ve also seen teams use it as a general-purpose OLTP database and wonder why everything broke. This guide is the conversation I wish someone had with me in 2020.
Is ClickHouse SQL or No SQL?
Short answer: it’s SQL. Full ANSI SQL with some quirks.
Long answer: ClickHouse speaks SQL, but it’s not PostgreSQL. It doesn’t support transactions, foreign keys, or row-level updates the way you’d expect. If you try to run UPDATE ... WHERE on a 10-billion-row table without understanding MergeTree engines, you’ll have a bad day.
ClickHouse uses a SQL dialect that’s optimized for analytics. You write SELECT queries with GROUP BY, HAVING, window functions. It supports joins, but the join semantics are different — you really want to use the Join table engine for performance, not raw JOIN clauses.
Here’s what a typical query looks like:
sql
SELECT
toStartOfHour(timestamp) AS hour,
campaign_id,
count() AS events,
sum(revenue) AS total_revenue
FROM analytics.events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY hour, campaign_id
ORDER BY hour DESC
LIMIT 100
That returns in 200ms on 500 million rows. Try that on MySQL. You’ll timeout.
So when people ask “is clickhouse sql or no sql?”, the real answer is: it’s SQL for analytical workloads. Not for CRUD apps. Not for shopping carts. For queries that aggregate billions of rows fast.
The Obvious Use Case: Real-Time Analytics
This is where ClickHouse dominates. If you need dashboards that refresh in real time — not “real time” meaning every 5 minutes, but actual sub-second — ClickHouse is your tool.
I worked with a SaaS company in 2023 that was using BigQuery for their customer-facing analytics. Every time a user refreshed a dashboard, they’d hit BigQuery. The bill hit $40K/month. We moved them to ClickHouse. Same queries. Same data volume. $3K/month.
The trick? ClickHouse is columnar and uses vectorized execution. It doesn’t scan rows. It scans columns. When your query says SELECT SUM(revenue), it only touches the revenue column. That’s the difference between reading 10TB of data and reading 10GB.
Here’s a real pattern from a production system:
sql
-- Create a materialized view for pre-aggregated per-minute stats
CREATE MATERIALIZED VIEW analytics.minute_aggregates
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, metric_name)
AS SELECT
toStartOfMinute(timestamp) AS timestamp,
metric_name,
count() AS event_count,
sum(value) AS total_value
FROM analytics.raw_events
GROUP BY timestamp, metric_name
That materialized view updates in real time as data streams in. Queries on top of it return in milliseconds.
ClickHouse vs Snowflake: Where The Battle Actually Is
I get asked “is clickhouse better than snowflake?” every week. The honest answer: it depends on your workload.
Snowflake is a warehouse. It’s designed for analysts writing SQL-on-demand, for variable workloads, for teams that hate managing infrastructure. Snowflake’s separation of compute and storage is real — you can spin up a warehouse, run a query, and shut it down. You pay for what you use.
ClickHouse is a database. It’s designed for predictable, high-throughput analytical queries. It doesn’t auto-scale. You manage the cluster. But when you tune it right, it crushes Snowflake on both speed and cost per query.
Let me give you numbers.
From a Reddit discussion comparing Snowflake vs ClickHouse, users reported ClickHouse running 5-10x faster on identical aggregation queries. One user said their 10-second Snowflake query became 800ms in ClickHouse.
The official ClickHouse vs Snowflake comparison shows ClickHouse processing 2-5x more queries per second on the same hardware cost.
But here’s the catch: Snowflake’s pricing scales with compute, not storage. You pay $2-4 per credit. ClickHouse’s pricing depends entirely on your hardware. On bare metal or self-managed cloud VMs, it’s way cheaper. On ClickHouse Cloud, it’s still cheaper than Snowflake for sustained workloads.
The Firebolt comparison breaks it down cleanly: Snowflake wins for ad-hoc queries by non-engineers. ClickHouse wins for production analytics at scale.
| Factor | ClickHouse | Snowflake |
|---|---|---|
| Query speed (aggregation) | 50-200ms | 2-10s |
| Cost per query | Pennies | Dimes to dollars |
| Operations required | High | Low |
| Concurrency limit | 1000+ queries/sec | 50-200 queries/sec |
| SQL compliance | 85%% | 98%% |
Vantage’s pricing analysis shows ClickHouse costing 60-80%% less for steady-state analytical workloads. For bursty analytics (one query a day), Snowflake wins because you pay nothing when it’s off.
Observability and Time-Series Data
This is where ClickHouse kills it. I’ve built observability pipelines on top of ClickHouse that ingest 200K events per second and serve dashboards to 500 concurrent users.
The pattern is simple: raw events land in Kafka, a consumer batch-inserts into ClickHouse every second, and a materialized view pre-aggregates into per-minute and per-hour buckets. The frontend queries those buckets.
Here’s the ingestion pattern we use at SIVARO:
python
import clickhouse_driver
client = clickhouse_driver.Client(host='clickhouse-node-1')
# Batch insert 10K rows at once
data = [
(timestamp, 'service_http', 200, 0.045, 'GET /api/users'),
# ... 9998 more rows
]
client.execute(
'INSERT INTO observability.request_logs VALUES',
data
)
No ORM. No slow abstraction. Just raw inserts. ClickHouse handles 1M+ rows/second per node on commodity hardware.
We benchmarked this against Elasticsearch and InfluxDB. Elasticsearch hit CPU saturation at 50K events/sec on a 16-core node. ClickHouse handled 200K events/sec on the same hardware. This Medium article calls ClickHouse “the destroyer of Elasticsearch for structured logs.”
The Semi-Structured Data Trap
ClickHouse supports JSON. Sort of. It has a JSON type, but don’t treat it like MongoDB.
If you have nested JSON with varying schemas — the kind of data you’d throw into Elasticsearch — ClickHouse will fight you. You can flatten it with the JSONExtract functions, but performance degrades if you query deep paths on every row.
Here’s a better approach: extract the fields you query into dedicated columns at insert time.
sql
-- Extract fields at query time (slower)
SELECT JSONExtractString(payload, 'user', 'email')
FROM semi_structured.events
WHERE JSONExtractString(payload, 'event_type') = 'purchase'
-- Better: extract at insert time into explicit columns
INSERT INTO structured.events (timestamp, user_email, event_type)
SELECT
timestamp,
JSONExtractString(payload, 'user', 'email') AS user_email,
JSONExtractString(payload, 'event_type') AS event_type
FROM semi_structured.staging
We learned this the hard way. A client in 2021 had a 2TB table of JSON blobs. Queries took 30 seconds. We extracted the five most-queried fields into columns, and queries dropped to 300ms.
If your data is deeply nested and you need to query it arbitrarily — just use Elasticsearch. Seriously. ClickHouse is bad at schema-on-read.
ClickHouse for AI and ML Feature Stores
Most people think of ClickHouse for dashboards. Smarter people use it as a feature store.
Here’s the problem: ML models need features computed on historical data, then served in real time. Most teams use Redis for serving and a separate warehouse for training. That’s two systems to maintain, two sets of data pipelines, two inconsistency problems.
ClickHouse can do both.
Train features:
sql
-- Compute 7-day rolling average of user spend
SELECT
user_id,
avg(spend) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN 7*86400 PRECEDING AND CURRENT ROW
) AS rolling_avg_7d
FROM transactions
WHERE timestamp >= '2024-01-01'
Serve features in production:
python
# Real-time feature lookup
query = """
SELECT
user_id,
count() AS order_count_7d,
sum(amount) AS total_spend_7d
FROM transactions
WHERE user_id = %%(user_id)s
AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY user_id
"""
result = client.execute(query, {'user_id': 12345})
At SIVARO, we built a real-time fraud detection system on ClickHouse. Stream processing enriched events with historical aggregates from ClickHouse — all under 10ms per lookup. The system flagged fraudulent transactions with 99.2%% precision.
When ClickHouse Is The Wrong Answer
I’ve seen teams wreck projects by picking ClickHouse for jobs it’s bad at.
Don’t use ClickHouse for:
-
Transactional workloads. No rollbacks. No point-in-time recovery. If you need
BEGIN TRANSACTION ... COMMIT, use PostgreSQL. -
High-cardinality point lookups. “Get me the row with ID = 12345” is slow. ClickHouse is designed for scans and aggregations, not index lookups. For that, use RocksDB or HBase.
-
Single-row inserts. ClickHouse batches rows into chunks. Inserting one row per second creates tiny parts, which causes merge storms. You need to batch — either in your app or via a buffer like Kafka.
-
Frequent updates and deletes. ClickHouse supports
ALTER TABLE ... DELETEandALTER TABLE ... UPDATE, but they’re asynchronous. Under the hood, it rewrites whole partitions. Do a million deletes on a 1TB table? That’s a 1TB rewrite.
This detailed comparison from Flexera calls out the update problem specifically: “ClickHouse excels at immutable data. If you’re editing records constantly, look elsewhere.”
The Operational Reality
Setting up ClickHouse isn’t hard. Keeping it healthy is.
Problems we’ve hit in production:
-
Merge storms. When you insert too many small parts, the background merge thread maxes out CPU. Query latency spikes. You learn to use
optimizewith care. -
Memory pressure. ClickHouse caches aggressively. A 256GB RAM node will happily use 200GB for cache. If your query needs memory for sorting, it spills to disk. We’ve seen queries that run fine on 128GB fail on 32GB.
-
Replication complexity. ClickHouse’s replication uses ZooKeeper or ClickHouse Keeper. Losing a ZooKeeper quorum kills insert throughput. We lost a cluster once because ZooKeeper ran out of disk space for transaction logs.
-
Version upgrades. Not like PostgreSQL where minor upgrades are safe. Major ClickHouse versions can change behavior. Test on a staging cluster.
That said, the community is strong. The ClickHouse vs Snowflake YouTube comparison showed ClickHouse handling 10TB datasets on 4 nodes with no tuning. The defaults are surprisingly good.
Pricing: The Uncomfortable Truth
ClickHouse is cheaper than Snowflake for sustained workloads. That’s fact.
But it’s not free.
Here’s what a typical production cluster looks like:
- 3 nodes, each with 32 vCPUs and 128GB RAM
- 1TB NVMe SSD per node
- Managed Kubernetes or bare metal (Hetzner, OVH, or AWS EC2 i3 instances)
Monthly cost: $2,000-4,000 depending on hosting.
Compare that to Snowflake: same workload, medium warehouse, $6,000-12,000/month.
But — and this is the part nobody talks about — ClickHouse requires a DB admin. Someone who knows how to tune max_part_loading_threads, manage merge cycles, and handle schema migrations. Snowflake doesn’t. If your team doesn’t have that expertise, the “free” tool costs you 0.5 engineering salaries in debugging.
The Big Data Boutique comparison puts it well: “ClickHouse saves money on compute. It costs money in talent. Know which trade you can afford.”
FAQ: What Is ClickHouse Used For?
Can ClickHouse replace Elasticsearch?
For structured logs — yes. For full-text search on unstructured text — no. We’ve replaced Elasticsearch with ClickHouse for time-series metrics and structured log analysis at 1/10th the cost. But for application search with relevance scoring, Elasticsearch wins.
Does ClickHouse support joins?
Yes, but use them sparingly. ClickHouse’s join logic is different from standard SQL — it loads the right table into memory and probes it. For high-throughput queries, use the Join table engine or denormalize. We denormalize everything that’s queried together.
Is ClickHouse good for small datasets?
No. ClickHouse’s strength is columnar compression and parallel processing. On datasets under 100GB, PostgreSQL with proper indexing can match or beat ClickHouse. The overhead of ClickHouse’s merge tree isn’t worth it for small data.
Can I use ClickHouse for real-time streaming?
Yes, but not natively. ClickHouse is a database, not a stream processor. You need Kafka, Redpanda, or a similar system to land data. ClickHouse then reads it in batch. We use Kafka + ClickHouse’s Kafka engine table, which continuously ingests from topics.
How does ClickHouse handle high concurrency?
Surprisingly well. ClickHouse can handle 1000+ concurrent SELECT queries on a single node. But each query competes for CPU and memory. We cap concurrency at 500 and queue the rest. Snowflake handles concurrency better through auto-scaling, but costs more at high concurrency.
What’s the best deployment for ClickHouse?
Self-managed on bare metal for cost savings. ClickHouse Cloud for less operational overhead. We use ClickHouse Cloud for client projects under 5TB. For larger datasets, bare metal wins. The Apache Doris vs ClickHouse vs Snowflake comparison notes that ClickHouse performs best with NVMe SSDs and 10Gbps+ networking.
Final Take
What is ClickHouse used for? It’s used for analytical queries that need to run fast on large datasets — observability, real-time dashboards, feature stores, and log analytics. It’s not a general-purpose database. It’s not a replacement for PostgreSQL or MongoDB.
Is ClickHouse better than Snowflake? For sustained, high-throughput analytical workloads — yes. For ad-hoc analytics by non-engineers — no. Pick the tool for the job.
At SIVARO, we standardize on ClickHouse for any system that ingests >1M events/day and needs query response times under 1 second. For everything else, we use PostgreSQL or Apache Druid.
The database landscape in 2025 has space for both. Don’t let the hype make your choice.
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.