2024-11-04
EngineeringSimplifying Queries with ClickHouse Dictionaries
If you didn’t know, one of the core technologies powering Aggregations.io is ClickHouse and it has a LOT of features. Let’s dive into one that is both bespoke and lesser known: the Dictionary.
What’s a Dictionary?
You can read the ClickHouse Docs for the technicalities - but in practice - a ClickHouse dictionary allows you to build fast lookups kept in memory, commonly used to eliminate joins and/or simplify (and centralize) complex case statements/queries.
What makes a ClickHouse dictionary really useful though:
- You can populate it from a number of sources like HTTP APIs, ClickHouse tables, remote DBs and (if you’re self-hosting) local files.
- You can use complex types as keys
- You can build range-based dictionaries
- You can configure a
LIFETIME
to automatically keep a dictionary updated (and manually force a reload if necessary).
Sources
You have a number of options when it comes to sourcing your dictionary data. A common usecase might be to retrieve the current “friendly” names for some identifiers. This is where the direct connection to other DBs like Postgres/MySQL come in handy. Maybe you want to expose a simple http endpoint to convert enum values from your production system to readable strings. Having your ClickHouse cluster talk directly to a datastore to pull in and cache data via a dictionary means you don’t need to worry about writing small data to ClickHouse and getting into something like a ReplacingMergeTree and the semantics there. There’s a ton of detail in the docs about the different sources available. The flexibility here means less effort wiring up workflows and an overall simpler and more maintainable system design.
Complex Keys
Most of the storage formats for a ClickHouse dictionary involve a UInt64
key (or a value that can be casted to a UInt64
like a Date) which makes sense, computers love talking in numbers. The flat
format simply creates an array the size of your largest value and performs point lookups. But you don’t HAVE to try and reduce your values to a number. You can use any of the storage options with the complex_key_
prefix. These let you use strings, tuples, etc as the primary key. ClickHouse wrote a great walk through of choosing your storage layout, which includes a flow chart to help you decide.
Simply put:
- If your lookup value is monotomic(ish) and has fewer than 500k values – use
flat
- If your lookup value has > 500k values or is random/not monotomic …
- If you have many attributes – use
hashed_array
orcomplex_key_hashed_array
depending on whether it can be cast toUInt64
- If you have few attributes – use
hashed
orcomplex_key_hashed
depending on whether it can be cast toUInt64
Range Keys
Let’s say you’re using a different fiscal calendar for your business, you’ve got a few options when writing queries:
- Write a
CASE
statement that’s big and ugly but very efficient. Don’t forget to update it in every query everywhere though as time marches on. - Create a UDF/Stored procedure to centralize the logic
- Create a “Calendar” table that has the day or ranges forever into the future, and join to it. (Almost like a dictionary!)
None of these are great options. You’re either entering into a maintainability nightmare or adding an unnecessary performance drag. With a range_hashed
ClickHouse dictionary, you can create custom calendars with ease. Like everything in ClickHouse, it’s very flexible too - you can decide how to treat overlapping ranges and deal with “open” ranges on either the start of end, like if something is “current” because it has no range_end
.
Dictionary LIFETIME
When creating a Dictionary, you provide a LIFETIME
statement that defines how often the dictionary is refreshed/reloaded from its source. The LIFETIME is defined in seconds and can either be a fixed interval or a min/max range. Understanding the LIFETIME
on your Dictionary will ensure you’ve got fresh data in your queries while also not overwhelming your source with unnecessary fetches.
Some things to consider:
- A
LIFETIME(0)
will keep a dictionary frozen / never updating. - When providing a min/max range, ClickHouse will choose a “uniformly random” time in the range as a means to prevent downstream sources from being overloaded, especially in a distributed cluster.
There are 2 other ways to ensure you’re not doing extraneous work
- You can provide an
invalidate_query
that can serve as a means to check if anything has actually changed before fetching the full source. - You can provide an
update_field
and optionally anupdate_lag
when using sources like ClickHouse, Postgres, ODBC etc. These will be added into the outerWHERE
of your query or replace a{condition}
placeholder.
Let’s say your ClickHouse sourced dictionary looks like this:
CREATE DICTIONARY dict_product_names (
product_id UInt64,
product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'products'))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90)
But in reality, your product_name
doesn’t change often, and there’s no sense in overwriting the whole dictionary all the time. Instead you can do:
CREATE DICTIONARY dict_product_names (
product_id UInt64,
product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'products' update_field 'updated_at' update_lag 10))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90)
You’ve just told Clickhouse to go from a query like
SELECT product_id, product_name
FROM products
to
SELECT product_id, product_name
FROM products
WHERE updated_at > {last_update_time - 10s}
Refreshing Manually
If you need to, you can instruct ClickHouse to refresh a dictionary manually. Use the SYSTEM RELOAD DICTIONARY
command. This is handy if you’re building/populating sources as part of an ETL and want to ensure the order of operations.
WHY?
Why would you want to do all this?
There are 2 key benefits. Performance and Maintainability. Performance is clear, you’re being very explicit to ClickHouse, “this set of data matters, keep it in memory, I am restricting my join(ish) semantics to a key lookup.” Maintainability is interesting, because you could say this adds more complexity and more to keep in mind when writing queries/designing workflows. But anyone working in large scale analytics knows how quickly “dimension” tables can balloon and how messy those ETLs can be to maintain. Similar to the benefits of ClickHouse Materialized Views - Dictionaries remove some of that plumbing for you.
But seriously, don’t sleep on the performance benefits. Here’s a super basic test, and I know there’s a lot better ways to do this. This is for demonstration purposes.
The Setup
We’re going to create 2 tables, one to hold random/fake sales data and one to hold product names.
-- Sales Data
CREATE TABLE sales_data
(
product_id UInt64,
event_time DateTime64,
sale_amount Float64
)
ENGINE = MergeTree
ORDER BY (product_id, event_time);
-- Insert some fake data, across 2 years, 2k product ids and the sale ranging from $1 to $100
INSERT INTO sales_data
select randUniform(0,2000),
now()-randUniform(0,60*60*24*365*2),
randUniform(1,100.0)
from numbers(10000000);
-- Product Names
CREATE TABLE product_names
(
product_id UInt64,
product_name LowCardinality(String)
)
Engine = MergeTree
ORDER BY (product_id);
-- Insert 5k random names
INSERT INTO product_names
select number, randomPrintableASCII(12)
from numbers(5000);
-- Create Our Dictionary
CREATE DICTIONARY dict_product_names
(
product_id UInt64,
product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'product_names'))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90);
Comparing performance
This is not meant to be scientific. There are many ways to do this better. But for a simple look at some tangible benefits…
-- Joining the names table to get the top 10 total per product name, all time.
SELECT
product_name,
sum(sale_amount)
FROM sales_data JOIN product_names ON sales_data.product_id=product_names.product_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
-- ~ 370 - 600 ms
-- Using our Dictionary
SELECT dictGetOrNull('dict_product_names','product_name', product_id),
sum(sale_amount)
FROM sales_data
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
-- ~207 - 308ms
sales_data
table, the results were more staggering. The join took ~10.3s whereas the dictionary was consistently ~2.1s. 80% reduction. The stats give you the hint at why. When querying for the join you see
Read 110,005,000 rows (1.76 GB)
but with the Dictionary, Read: 110,000,000 rows (1.76 GB)
, it’s “reading” the product names and having to deal with full join semantics… vs a simple array index lookup based on the product’s ID. Those extra 5k rows might seem innocuous, but they’re certainly not. Our Implementation
Now you might be wondering, beyond analytics - how can you utilize ClickHouse Dictionaries in your product development. Here’s one of our more recent usecases.
When building AutoDocs, one of the bedrock features was allowing users to restrict properties for PII or other sensitive reasons. We implement this at multiple layers of the stack. Once you disable a property - we skip over it when parsing payloads - but it’s possible you’ve already sent some of that into the system. We enqueue this for deletion/scrubbing, since ClickHouse doesn’t “delete” or “update” like a normal relational database - it’s not immediate. In this interim time, we want to ensure it never even leaves ClickHouse (although we also have app-side checks when returning results as well to cover all bases).
We built this as a Dictionary (per-organization), so we can easily check if a given descriptor set and/or property has been disabled when returning events for the Events Explorer or Version Explorer. We don’t have to be “careful” with joins on large complex tables, we just throw in a few dictHas() = FALSE
params to our WHERE
s to ensure we’re not looking at any disabled datas.
Conclusion
If you’re interested in learning more about how to utilize ClickHouse Dictionaries - definitely read the docs. There are numerous even more advanced options like resolving GEO info or IP ranges as well as hierarchical data and more. Dictionaries are definitely one of ClickHouses’s more underappreciated features.
I wouldn’t call Dictionaries a silver bullet for removing joins, but pretty close. You should test out your usecase and understand the options. If you’re struggling with slower-than-optimal query performance and have a few slowly updating dimension tables - they’re definitely something to consider.
Have other ClickHouse topics you’d like a practical look at? Let us know!