Simplifying Queries with ClickHouse Dictionaries

If you didn’t know, one of the core technologies powering Aggregations.io is ClickHouse and it has a LOT of features. Let’s dive into one that is both bespoke and lesser known: the Dictionary.

What’s a Dictionary?

You can read the ClickHouse Docs for the technicalities - but in practice - a ClickHouse dictionary allows you to build fast lookups kept in memory, commonly used to eliminate joins and/or simplify (and centralize) complex case statements/queries.

What makes a ClickHouse dictionary really useful though:

You can populate it from a number of sources like HTTP APIs, ClickHouse tables, remote DBs and (if you’re self-hosting) local files.
You can use complex types as keys
You can build range-based dictionaries
You can configure a LIFETIME to automatically keep a dictionary updated (and manually force a reload if necessary).

Sources

You have a number of options when it comes to sourcing your dictionary data. A common usecase might be to retrieve the current “friendly” names for some identifiers. This is where the direct connection to other DBs like Postgres/MySQL come in handy. Maybe you want to expose a simple http endpoint to convert enum values from your production system to readable strings. Having your ClickHouse cluster talk directly to a datastore to pull in and cache data via a dictionary means you don’t need to worry about writing small data to ClickHouse and getting into something like a ReplacingMergeTree and the semantics there. There’s a ton of detail in the docs about the different sources available. The flexibility here means less effort wiring up workflows and an overall simpler and more maintainable system design.

Complex Keys

Most of the storage formats for a ClickHouse dictionary involve a UInt64 key (or a value that can be casted to a UInt64 like a Date) which makes sense, computers love talking in numbers. The flat format simply creates an array the size of your largest value and performs point lookups. But you don’t HAVE to try and reduce your values to a number. You can use any of the storage options with the complex_key_ prefix. These let you use strings, tuples, etc as the primary key. ClickHouse wrote a great walk through of choosing your storage layout, which includes a flow chart to help you decide.

Simply put:

If your lookup value is monotomic(ish) and has fewer than 500k values – use flat
If your lookup value has > 500k values or is random/not monotomic …
If you have many attributes – use hashed_array or complex_key_hashed_array depending on whether it can be cast to UInt64
If you have few attributes – use hashed or complex_key_hashed depending on whether it can be cast to UInt64

Range Keys

Let’s say you’re using a different fiscal calendar for your business, you’ve got a few options when writing queries:

Write a CASE statement that’s big and ugly but very efficient. Don’t forget to update it in every query everywhere though as time marches on.
Create a UDF/Stored procedure to centralize the logic
Create a “Calendar” table that has the day or ranges forever into the future, and join to it. (Almost like a dictionary!)

None of these are great options. You’re either entering into a maintainability nightmare or adding an unnecessary performance drag. With a range_hashed ClickHouse dictionary, you can create custom calendars with ease. Like everything in ClickHouse, it’s very flexible too - you can decide how to treat overlapping ranges and deal with “open” ranges on either the start of end, like if something is “current” because it has no range_end.

Dictionary LIFETIME

When creating a Dictionary, you provide a LIFETIME statement that defines how often the dictionary is refreshed/reloaded from its source. The LIFETIME is defined in seconds and can either be a fixed interval or a min/max range. Understanding the LIFETIME on your Dictionary will ensure you’ve got fresh data in your queries while also not overwhelming your source with unnecessary fetches.

Some things to consider:

A LIFETIME(0) will keep a dictionary frozen / never updating.
When providing a min/max range, ClickHouse will choose a “uniformly random” time in the range as a means to prevent downstream sources from being overloaded, especially in a distributed cluster.

There are 2 other ways to ensure you’re not doing extraneous work

You can provide an invalidate_query that can serve as a means to check if anything has actually changed before fetching the full source.
You can provide an update_field and optionally an update_lag when using sources like ClickHouse, Postgres, ODBC etc. These will be added into the outer WHERE of your query or replace a {condition} placeholder.

Let’s say your ClickHouse sourced dictionary looks like this:

CREATE DICTIONARY dict_product_names (
    product_id UInt64,
    product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'products'))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90)

But in reality, your product_name doesn’t change often, and there’s no sense in overwriting the whole dictionary all the time. Instead you can do:

CREATE DICTIONARY dict_product_names (
    product_id UInt64,
    product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'products' update_field 'updated_at' update_lag 10))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90)

You’ve just told Clickhouse to go from a query like

SELECT product_id, product_name 
FROM products

SELECT product_id, product_name 
FROM products
WHERE updated_at > {last_update_time - 10s}

Refreshing Manually

If you need to, you can instruct ClickHouse to refresh a dictionary manually. Use the SYSTEM RELOAD DICTIONARY command. This is handy if you’re building/populating sources as part of an ETL and want to ensure the order of operations.

WHY?

Why would you want to do all this?

There are 2 key benefits. Performance and Maintainability. Performance is clear, you’re being very explicit to ClickHouse, “this set of data matters, keep it in memory, I am restricting my join(ish) semantics to a key lookup.” Maintainability is interesting, because you could say this adds more complexity and more to keep in mind when writing queries/designing workflows. But anyone working in large scale analytics knows how quickly “dimension” tables can balloon and how messy those ETLs can be to maintain. Similar to the benefits of ClickHouse Materialized Views - Dictionaries remove some of that plumbing for you.

But seriously, don’t sleep on the performance benefits. Here’s a super basic test, and I know there’s a lot better ways to do this. This is for demonstration purposes.

The Setup

We’re going to create 2 tables, one to hold random/fake sales data and one to hold product names.

-- Sales Data
CREATE TABLE sales_data
(
    product_id UInt64,
    event_time DateTime64,
    sale_amount Float64
)
ENGINE = MergeTree
ORDER BY (product_id, event_time);

-- Insert some fake data, across 2 years, 2k product ids and the sale ranging from $1 to $100

INSERT INTO sales_data
select randUniform(0,2000),
    now()-randUniform(0,60*60*24*365*2),
    randUniform(1,100.0)
    from numbers(10000000);

-- Product Names
CREATE TABLE product_names
(
    product_id UInt64,
    product_name LowCardinality(String)
)
Engine = MergeTree
ORDER BY (product_id);

-- Insert 5k random names

INSERT INTO product_names
select number, randomPrintableASCII(12)
from numbers(5000);

-- Create Our Dictionary

CREATE DICTIONARY dict_product_names
(
    product_id UInt64,
    product_name String
)
PRIMARY KEY product_id
SOURCE (CLICKHOUSE (TABLE 'product_names'))
LAYOUT (FLAT)
LIFETIME(MIN 10 MAX 90);

Comparing performance

This is not meant to be scientific. There are many ways to do this better. But for a simple look at some tangible benefits…

-- Joining the names table to get the top 10 total per product name, all time.
SELECT 
product_name, 
sum(sale_amount) 
FROM sales_data JOIN product_names ON sales_data.product_id=product_names.product_id
GROUP BY 1 
ORDER BY 2 DESC
LIMIT 10;

-- ~ 370 - 600 ms 

-- Using our Dictionary

SELECT dictGetOrNull('dict_product_names','product_name', product_id), 
sum(sale_amount) 
FROM sales_data 
GROUP BY 1 
ORDER BY 2 DESC 
LIMIT 10;

-- ~207 - 308ms

Almost 50% savingsCutting queries in half isn’t usually that easy. When inserting another 100m random records into the sales_data table, the results were more staggering. The join took ~10.3s whereas the dictionary was consistently ~2.1s. 80% reduction. The stats give you the hint at why.

When querying for the join you see Read 110,005,000 rows (1.76 GB) but with the Dictionary, Read: 110,000,000 rows (1.76 GB), it’s “reading” the product names and having to deal with full join semantics… vs a simple array index lookup based on the product’s ID. Those extra 5k rows might seem innocuous, but they’re certainly not.

Our Implementation

Now you might be wondering, beyond analytics - how can you utilize ClickHouse Dictionaries in your product development. Here’s one of our more recent usecases.

When building AutoDocs, one of the bedrock features was allowing users to restrict properties for PII or other sensitive reasons. We implement this at multiple layers of the stack. Once you disable a property - we skip over it when parsing payloads - but it’s possible you’ve already sent some of that into the system. We enqueue this for deletion/scrubbing, since ClickHouse doesn’t “delete” or “update” like a normal relational database - it’s not immediate. In this interim time, we want to ensure it never even leaves ClickHouse (although we also have app-side checks when returning results as well to cover all bases).

We built this as a Dictionary (per-organization), so we can easily check if a given descriptor set and/or property has been disabled when returning events for the Events Explorer or Version Explorer. We don’t have to be “careful” with joins on large complex tables, we just throw in a few dictHas() = FALSE params to our WHEREs to ensure we’re not looking at any disabled datas.

Conclusion

If you’re interested in learning more about how to utilize ClickHouse Dictionaries - definitely read the docs. There are numerous even more advanced options like resolving GEO info or IP ranges as well as hierarchical data and more. Dictionaries are definitely one of ClickHouses’s more underappreciated features.

I wouldn’t call Dictionaries a silver bullet for removing joins, but pretty close. You should test out your usecase and understand the options. If you’re struggling with slower-than-optimal query performance and have a few slowly updating dimension tables - they’re definitely something to consider.

Have other ClickHouse topics you’d like a practical look at? Let us know!