管理数据的TTL(生存时间)
TTL概述
TTL(生存时间)是指在一定时间间隔后,行或列可以移动、删除或汇总的能力。虽然“生存时间”这个表达听起来好像只适用于删除旧数据,但TTL有几个用例:
- 删除旧数据:毫无疑问,您可以在指定的时间间隔后删除行或列
- 在磁盘之间移动数据:在一定时间后,您可以在存储卷之间移动数据,这对于部署热/温/冷架构非常有用
- 数据汇总:在删除数据之前,对旧数据进行各种有用的聚合和计算
TTL可以应用于整个表或特定列。
TTL语法
TTL
子句可以出现在列定义之后和/或表定义的末尾。使用INTERVAL
子句定义一个时间长度(需要是Date
或DateTime
数据类型)。例如,以下表有两个带有TTL
子句的列:
CREATE TABLE example1 (
timestamp DateTime,
x UInt32 TTL now() + INTERVAL 1 MONTH,
y String TTL timestamp + INTERVAL 1 DAY,
z String
)
ENGINE = MergeTree
ORDER BY tuple()
- x列从现在开始的1个月内具有生存时间
- y列从timestamp列开始的1天内具有生存时间
- 当时间间隔结束时,列将过期。ClickHouse将列值替换为其数据类型的默认值。如果数据部分中的所有列值过期,则ClickHouse将从文件系统中删除该列。
TTL规则可以更改或删除。有关更多详细信息,请参见使用表TTL进行操作页面。
触发TTL事件
删除或聚合过期行并不是立即发生的,它只会在表合并时发生。如果您有一个不活跃合并的表(无论什么原因),有两个设置可以触发TTL事件:
merge_with_ttl_timeout
:重复带有删除TTL的合并之前的最小延迟(以秒为单位)。默认值为14400秒(4小时)。merge_with_recompression_ttl_timeout
:重复带有重新压缩TTL的合并之前的最小延迟(在删除数据之前进行数据汇总的规则)。默认值为14400秒(4小时)。
因此,默认情况下,您的TTL规则将至少每4小时应用于您的表。如果需要更频繁地应用TTL规则,只需修改上述设置。
不是一个很好的解决方案(或我们建议您经常使用的解决方案),但您可以使用OPTIMIZE
来强制合并:
OPTIMIZE TABLE example1 FINAL
OPTIMIZE
初始化了表的未调度合并,FINAL
强制重新优化,如果您的表已经是单个部分。
删除行
在一定时间后从表中删除整行,定义TTL规则在表级别:
CREATE TABLE customers (
timestamp DateTime,
name String,
balance Int32,
address String
)
ENGINE = MergeTree
ORDER BY timestamp
TTL timestamp + INTERVAL 12 HOUR
删除列
如果不想删除整行,而是希望余额和地址列在2小时后过期。让我们修改customers
表,并为两列添加TTL,时间为2小时:
ALTER TABLE customers
MODIFY COLUMN balance Int32 TTL timestamp + INTERVAL 2 HOUR,
MODIFY COLUMN address String TTL timestamp + INTERVAL 2 HOUR
实现汇总
假设我们想在删除旧行之前保留一些数据以进行报告。我们不需要所有的细节,只需要一些历史数据的汇总结果。这可以通过在TTL
表达式中添加GROUP BY
子句以及在表中添加一些列来存储汇总结果来实现。
假设在以下hits
表中,我们想删除旧行,但在删除行之前保留hits
列的总和和最大值。我们需要一个字段来存储这些值,并且需要在TTL
子句中添加一个GROUP BY
子句,以汇总最大值和总和:
CREATE TABLE hits (
timestamp DateTime,
id String,
hits Int32,
max_hits Int32 DEFAULT hits,
sum_hits Int64 DEFAULT hits
)
ENGINE = MergeTree
PRIMARY KEY (id, toStartOfDay(timestamp), timestamp)
TTL timestamp + INTERVAL 1 DAY
GROUP BY id, toStartOfDay(timestamp)
SET
max_hits = max(max_hits),
sum_hits = sum(sum_hits);
关于hits
表的一些说明:
TTL
子句中的GROUP BY
列必须是PRIMARY KEY
的前缀,我们希望按天开始对结果进行分组。因此,toStartOfDay(timestamp)
被添加到主键中- 我们添加了两个字段来存储汇总结果:
max_hits
和sum_hits
- 将
max_hits
和sum_hits
的默认值设置为hits
对于我们的逻辑工作是必要的,基于SET
子句的定义方式
实现热/温/冷架构
如果您使用ClickHouse Cloud,本课程中的步骤不适用。您无需担心在ClickHouse Cloud中移动旧数据。
在处理大量数据时的常见做法是随着数据变老而移动数据。以下是使用TTL
命令的TO DISK
和TO VOLUME
子句在ClickHouse中实现热/温/冷架构的步骤。(顺便说一句,它不一定是热和冷的事情 - 您可以使用TTL根据您的用例移动数据。)
TO DISK
和TO VOLUME
选项是指您的ClickHouse配置文件中定义的磁盘或卷的名称。创建一个名为my_system.xml
(或任何文件名)的新文件,定义您的磁盘,然后定义使用您的磁盘的卷。将XML文件放在/etc/clickhouse-server/config.d/
中,以便将配置应用于您的系统:
<clickhouse>
<storage_configuration>
<disks>
<default>
</default>
<hot_disk>
<path>./hot/</path>
</hot_disk>
<warm_disk>
<path>./warm/</path>
</warm_disk>
<cold_disk>
<path>./cold/</path>
</cold_disk>
</disks>
<policies>
<default>
<volumes>
<default>
<disk>default</disk>
</default>
<hot_volume>
<disk>hot_disk</disk>
</hot_volume>
<warm_volume>
<disk>warm_disk</disk>
</warm_volume>
<cold_volume>
<disk>cold_disk</disk>
</cold_volume>
</volumes>
</default>
</policies>
</storage_configuration>
</clickhouse>
- 上面的配置引用了三个指向ClickHouse可以读取和写入的文件夹的磁盘。卷可以包含一个或多个磁盘 - 我们为三个磁盘定义了一个卷。让我们查看磁盘:
SELECT name, path, free_space, total_space
FROM system.disks
┌─name────────┬─path───────────┬───free_space─┬──total_space─┐
│ cold_disk │ ./data/cold/ │ 179143311360 │ 494384795648 │
│ default │ ./ │ 179143311360 │ 494384795648 │
│ hot_disk │ ./data/hot/ │ 179143311360 │ 494384795648 │
│ warm_disk │ ./data/warm/ │ 179143311360 │ 494384795648 │
└─────────────┴────────────────┴──────────────┴──────────────┘
- 然后…让我们验证卷:
SELECT
volume_name,
disks
FROM system.storage_policies
┌─volume_name─┬─disks─────────┐
│ default │ ['default'] │
│ hot_volume │ ['hot_disk'] │
│ warm_volume │ ['warm_disk'] │
│ cold_volume │ ['cold_disk'] │
└─────────────┴───────────────┘
- 现在,我们将添加一个
TTL
规则,将数据在热、温和冷卷之间移动:
ALTER TABLE my_table
MODIFY TTL
trade_date TO VOLUME 'hot_volume',
trade_date + INTERVAL 2 YEAR TO VOLUME 'warm_volume',
trade_date + INTERVAL 4 YEAR TO VOLUME 'cold_volume';
- 新的
TTL
规则应该实现,但您可以强制执行以确保:
ALTER TABLE my_table
MATERIALIZE TTL
- 使用
system.parts
表验证数据是否已移动到预期的磁盘:
使用system.parts表,查看crypto_prices表的部分位于哪些磁盘上:
SELECT
name,
disk_name
FROM system.parts
WHERE (table = 'my_table') AND (active = 1)
响应将如下所示:
┌─name────────┬─disk_name─┐
│ all_1_3_1_5 │ warm_disk │
│ all_2_2_0 │ hot_disk │
└─────────────┴───────────┘
相关内容
- 博客和网络研讨会:使用TTL管理ClickHouse中的数据生命周期
Manage Data with TTL (Time-to-live)
Overview of TTL
TTL (time-to-live) refers to the capability of having rows or columns moved, deleted, or rolled up after a certain interval of time has passed. While the expression "time-to-live" sounds like it only applies to deleting old data, TTL has several use cases:
- Removing old data: no surprise, you can delete rows or columns after a specified time interval
- Moving data between disks: after a certain amount of time, you can move data between storage volumes - useful for deploying a hot/warm/cold architecture
- Data rollup: rollup your older data into various useful aggregations and computations before deleting it
TTL can be applied to entire tables or specific columns.
TTL Syntax
The TTL
clause can appear after a column definition and/or at the end of the table definition. Use the INTERVAL
clause to define a length of time (which needs to be a Date
or DateTime
data type). For example, the following table has two columns
with TTL
clauses:
CREATE TABLE example1 (
timestamp DateTime,
x UInt32 TTL now() + INTERVAL 1 MONTH,
y String TTL timestamp + INTERVAL 1 DAY,
z String
)
ENGINE = MergeTree
ORDER BY tuple()
- The x column has a time to live of 1 month from now
- The y column has a time to live of 1 day from the timestamp column:
- When the interval lapses, the column expires. ClickHouse replaces the column value with the default value of its data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in the filesystem.
TTL rules can be altered or deleted. See the Manipulations with Table TTL page for more details.
Triggering TTL Events
The deleting or aggregating of expired rows is not immediate - it only occurs during table merges. If you have a table that's not actively merging (for whatever reason), there are two settings that trigger TTL events:
merge_with_ttl_timeout
: the minimum delay in seconds before repeating a merge with delete TTL. The default is 14400 seconds (4 hours).merge_with_recompression_ttl_timeout
: the minimum delay in seconds before repeating a merge with recompression TTL (rules that roll up data before deleting). Default value: 14400 seconds (4 hours).
So by default, your TTL rules will be applied to your table at least once every 4 hours. Just modify the settings above if you need your TTL rules applied more frequently.
Not a great solution (or one that we recommend you use frequently), but you can also force a merge using OPTIMIZE
:
OPTIMIZE TABLE example1 FINAL
OPTIMIZE
initializes an unscheduled merge of the parts of your table, and FINAL
forces a reoptimization if your table is already a single part.
Removing Rows
To remove entire rows from a table after a certain amount of time, define the TTL rule at the table level:
CREATE TABLE customers (
timestamp DateTime,
name String,
balance Int32,
address String
)
ENGINE = MergeTree
ORDER BY timestamp
TTL timestamp + INTERVAL 12 HOUR
Removing Columns
Instead of deleting the entire row, suppose you want just the balance and address columns to expire. Let's modify the customers
table and add a TTL for both columns to be 2 hours:
ALTER TABLE customers
MODIFY COLUMN balance Int32 TTL timestamp + INTERVAL 2 HOUR,
MODIFY COLUMN address String TTL timestamp + INTERVAL 2 HOUR
Implementing a Rollup
Suppose we want to delete rows after a certain amount of time but hang on to some of the data for reporting purposes. We don't want all the details - just a few aggregated results of historical data. This can be implemented by adding a GROUP BY
clause to your TTL
expression, along with some columns in your table to store the aggregated results.
Suppose in the following hits
table we want to delete old rows, but hang on to the sum and maximum of the hits
columns before removing the rows. We will need a field to store those values in, and we will need to add a GROUP BY
clause to the TTL
clause that rolls up the sum and maximum:
CREATE TABLE hits (
timestamp DateTime,
id String,
hits Int32,
max_hits Int32 DEFAULT hits,
sum_hits Int64 DEFAULT hits
)
ENGINE = MergeTree
PRIMARY KEY (id, toStartOfDay(timestamp), timestamp)
TTL timestamp + INTERVAL 1 DAY
GROUP BY id, toStartOfDay(timestamp)
SET
max_hits = max(max_hits),
sum_hits = sum(sum_hits);
Some notes on the hits
table:
- The
GROUP BY
columns in theTTL
clause must be a prefix of thePRIMARY KEY
, and we want to group our results by the start of the day. Therefore,toStartOfDay(timestamp)
was added to the primary key - We added two fields to store the aggregated results:
max_hits
andsum_hits
- Setting the default value of
max_hits
andsum_hits
tohits
is necessary for our logic to work, based on how theSET
clause is defined
Implementing a hot/warm/cold architecture
If you are using ClickHouse Cloud, the steps in the lesson are not applicable. You do not need to worry about moving old data around in ClickHouse Cloud.
A common practice when working with large amounts of data is to move that data around as it gets older. Here are the steps for implementing a hot/warm/cold architecture in ClickHouse using the TO DISK
and TO VOLUME
clauses of the TTL
command. (By the way, it doesn't have to be a hot and cold thing - you can use TTL to move data around for whatever use case you have.)
- The
TO DISK
andTO VOLUME
options refer to the names of disks or volumes defined in your ClickHouse configuration files. Create a new file namedmy_system.xml
(or any file name) that defines your disks, then define volumes that use your disks. Place the XML file in/etc/clickhouse-server/config.d/
to have the configuration applied to your system:
<clickhouse>
<storage_configuration>
<disks>
<default>
</default>
<hot_disk>
<path>./hot/</path>
</hot_disk>
<warm_disk>
<path>./warm/</path>
</warm_disk>
<cold_disk>
<path>./cold/</path>
</cold_disk>
</disks>
<policies>
<default>
<volumes>
<default>
<disk>default</disk>
</default>
<hot_volume>
<disk>hot_disk</disk>
</hot_volume>
<warm_volume>
<disk>warm_disk</disk>
</warm_volume>
<cold_volume>
<disk>cold_disk</disk>
</cold_volume>
</volumes>
</default>
</policies>
</storage_configuration>
</clickhouse>
- The configuration above refers to three disks that point to folders that ClickHouse can read from and write to. Volumes can contain one or more disks - we defined a volume for each of the three disks. Let's view the disks:
SELECT name, path, free_space, total_space
FROM system.disks
┌─name────────┬─path───────────┬───free_space─┬──total_space─┐
│ cold_disk │ ./data/cold/ │ 179143311360 │ 494384795648 │
│ default │ ./ │ 179143311360 │ 494384795648 │
│ hot_disk │ ./data/hot/ │ 179143311360 │ 494384795648 │
│ warm_disk │ ./data/warm/ │ 179143311360 │ 494384795648 │
└─────────────┴────────────────┴──────────────┴──────────────┘
- And…let's verify the volumes:
SELECT
volume_name,
disks
FROM system.storage_policies
┌─volume_name─┬─disks─────────┐
│ default │ ['default'] │
│ hot_volume │ ['hot_disk'] │
│ warm_volume │ ['warm_disk'] │
│ cold_volume │ ['cold_disk'] │
└─────────────┴───────────────┘
- Now we will add a
TTL
rule that moves the data between the hot, warm and cold volumes:
ALTER TABLE my_table
MODIFY TTL
trade_date TO VOLUME 'hot_volume',
trade_date + INTERVAL 2 YEAR TO VOLUME 'warm_volume',
trade_date + INTERVAL 4 YEAR TO VOLUME 'cold_volume';
- The new
TTL
rule should materialize, but you can force it to make sure:
ALTER TABLE my_table
MATERIALIZE TTL
- Verify your data has moved to its expected disks using the
system.parts
table:
Using the system.parts table, view which disks the parts are on for the crypto_prices table:
SELECT
name,
disk_name
FROM system.parts
WHERE (table = 'my_table') AND (active = 1)
The response will look like:
┌─name────────┬─disk_name─┐
│ all_1_3_1_5 │ warm_disk │
│ all_2_2_0 │ hot_disk │
└─────────────┴───────────┘
Related Content
- Blog & Webinar: Using TTL to Manage Data Lifecycles in ClickHouse