Skip to main content

管理数据的TTL(生存时间)

TTL概述

TTL(生存时间)是指在一定时间间隔后,行或列可以移动、删除或汇总的能力。虽然“生存时间”这个表达听起来好像只适用于删除旧数据,但TTL有几个用例:

  • 删除旧数据:毫无疑问,您可以在指定的时间间隔后删除行或列
  • 在磁盘之间移动数据:在一定时间后,您可以在存储卷之间移动数据,这对于部署热/温/冷架构非常有用
  • 数据汇总:在删除数据之前,对旧数据进行各种有用的聚合和计算
note

TTL可以应用于整个表或特定列。

TTL语法

TTL子句可以出现在列定义之后和/或表定义的末尾。使用INTERVAL子句定义一个时间长度(需要是DateDateTime数据类型)。例如,以下表有两个带有TTL子句的列:

CREATE TABLE example1 (
timestamp DateTime,
x UInt32 TTL now() + INTERVAL 1 MONTH,
y String TTL timestamp + INTERVAL 1 DAY,
z String
)
ENGINE = MergeTree
ORDER BY tuple()
  • x列从现在开始的1个月内具有生存时间
  • y列从timestamp列开始的1天内具有生存时间
  • 当时间间隔结束时,列将过期。ClickHouse将列值替换为其数据类型的默认值。如果数据部分中的所有列值过期,则ClickHouse将从文件系统中删除该列。
note

TTL规则可以更改或删除。有关更多详细信息,请参见使用表TTL进行操作页面。

触发TTL事件

删除或聚合过期行并不是立即发生的,它只会在表合并时发生。如果您有一个不活跃合并的表(无论什么原因),有两个设置可以触发TTL事件:

  • merge_with_ttl_timeout:重复带有删除TTL的合并之前的最小延迟(以秒为单位)。默认值为14400秒(4小时)。
  • merge_with_recompression_ttl_timeout:重复带有重新压缩TTL的合并之前的最小延迟(在删除数据之前进行数据汇总的规则)。默认值为14400秒(4小时)。

因此,默认情况下,您的TTL规则将至少每4小时应用于您的表。如果需要更频繁地应用TTL规则,只需修改上述设置。

note

不是一个很好的解决方案(或我们建议您经常使用的解决方案),但您可以使用OPTIMIZE来强制合并:

OPTIMIZE TABLE example1 FINAL

OPTIMIZE初始化了表的未调度合并,FINAL强制重新优化,如果您的表已经是单个部分。

删除行

在一定时间后从表中删除整行,定义TTL规则在表级别:

CREATE TABLE customers (
timestamp DateTime,
name String,
balance Int32,
address String
)
ENGINE = MergeTree
ORDER BY timestamp
TTL timestamp + INTERVAL 12 HOUR

删除列

如果不想删除整行,而是希望余额和地址列在2小时后过期。让我们修改customers表,并为两列添加TTL,时间为2小时:

ALTER TABLE customers
MODIFY COLUMN balance Int32 TTL timestamp + INTERVAL 2 HOUR,
MODIFY COLUMN address String TTL timestamp + INTERVAL 2 HOUR

实现汇总

假设我们想在删除旧行之前保留一些数据以进行报告。我们不需要所有的细节,只需要一些历史数据的汇总结果。这可以通过在TTL表达式中添加GROUP BY子句以及在表中添加一些列来存储汇总结果来实现。

假设在以下hits表中,我们想删除旧行,但在删除行之前保留hits列的总和和最大值。我们需要一个字段来存储这些值,并且需要在TTL子句中添加一个GROUP BY子句,以汇总最大值和总和:

CREATE TABLE hits (
timestamp DateTime,
id String,
hits Int32,
max_hits Int32 DEFAULT hits,
sum_hits Int64 DEFAULT hits
)
ENGINE = MergeTree
PRIMARY KEY (id, toStartOfDay(timestamp), timestamp)
TTL timestamp + INTERVAL 1 DAY
GROUP BY id, toStartOfDay(timestamp)
SET
max_hits = max(max_hits),
sum_hits = sum(sum_hits);

关于hits表的一些说明:

  • TTL子句中的GROUP BY列必须是PRIMARY KEY的前缀,我们希望按天开始对结果进行分组。因此,toStartOfDay(timestamp)被添加到主键中
  • 我们添加了两个字段来存储汇总结果:max_hitssum_hits
  • max_hitssum_hits的默认值设置为hits对于我们的逻辑工作是必要的,基于SET子句的定义方式

实现热/温/冷架构

note

如果您使用ClickHouse Cloud,本课程中的步骤不适用。您无需担心在ClickHouse Cloud中移动旧数据。

在处理大量数据时的常见做法是随着数据变老而移动数据。以下是使用TTL命令的TO DISKTO VOLUME子句在ClickHouse中实现热/温/冷架构的步骤。(顺便说一句,它不一定是热和冷的事情 - 您可以使用TTL根据您的用例移动数据。)

  1. TO DISKTO VOLUME选项是指您的ClickHouse配置文件中定义的磁盘或卷的名称。创建一个名为my_system.xml(或任何文件名)的新文件,定义您的磁盘,然后定义使用您的磁盘的卷。将XML文件放在/etc/clickhouse-server/config.d/中,以便将配置应用于您的系统:
<clickhouse>
<storage_configuration>
<disks>
<default>
</default>
<hot_disk>
<path>./hot/</path>
</hot_disk>
<warm_disk>
<path>./warm/</path>
</warm_disk>
<cold_disk>
<path>./cold/</path>
</cold_disk>
</disks>
<policies>
<default>
<volumes>
<default>
<disk>default</disk>
</default>
<hot_volume>
<disk>hot_disk</disk>
</hot_volume>
<warm_volume>
<disk>warm_disk</disk>
</warm_volume>
<cold_volume>
<disk>cold_disk</disk>
</cold_volume>
</volumes>
</default>
</policies>
</storage_configuration>
</clickhouse>

  1. 上面的配置引用了三个指向ClickHouse可以读取和写入的文件夹的磁盘。卷可以包含一个或多个磁盘 - 我们为三个磁盘定义了一个卷。让我们查看磁盘:
SELECT name, path, free_space, total_space
FROM system.disks
┌─name────────┬─path───────────┬───free_space─┬──total_space─┐
│ cold_disk │ ./data/cold/ │ 179143311360 │ 494384795648 │
│ default │ ./ │ 179143311360 │ 494384795648 │
│ hot_disk │ ./data/hot/ │ 179143311360 │ 494384795648 │
│ warm_disk │ ./data/warm/ │ 179143311360 │ 494384795648 │
└─────────────┴────────────────┴──────────────┴──────────────┘
  1. 然后…让我们验证卷:
SELECT
volume_name,
disks
FROM system.storage_policies
┌─volume_name─┬─disks─────────┐
│ default │ ['default'] │
│ hot_volume │ ['hot_disk'] │
│ warm_volume │ ['warm_disk'] │
│ cold_volume │ ['cold_disk'] │
└─────────────┴───────────────┘
  1. 现在,我们将添加一个TTL规则,将数据在热、温和冷卷之间移动:
ALTER TABLE my_table
MODIFY TTL
trade_date TO VOLUME 'hot_volume',
trade_date + INTERVAL 2 YEAR TO VOLUME 'warm_volume',
trade_date + INTERVAL 4 YEAR TO VOLUME 'cold_volume';
  1. 新的TTL规则应该实现,但您可以强制执行以确保:
ALTER TABLE my_table
MATERIALIZE TTL
  1. 使用system.parts表验证数据是否已移动到预期的磁盘:
使用system.parts表,查看crypto_prices表的部分位于哪些磁盘上:

SELECT
name,
disk_name
FROM system.parts
WHERE (table = 'my_table') AND (active = 1)

响应将如下所示:

┌─name────────┬─disk_name─┐
│ all_1_3_1_5 │ warm_disk │
│ all_2_2_0 │ hot_disk │
└─────────────┴───────────┘

相关内容

Manage Data with TTL (Time-to-live)

Overview of TTL

TTL (time-to-live) refers to the capability of having rows or columns moved, deleted, or rolled up after a certain interval of time has passed. While the expression "time-to-live" sounds like it only applies to deleting old data, TTL has several use cases:

  • Removing old data: no surprise, you can delete rows or columns after a specified time interval
  • Moving data between disks: after a certain amount of time, you can move data between storage volumes - useful for deploying a hot/warm/cold architecture
  • Data rollup: rollup your older data into various useful aggregations and computations before deleting it
note

TTL can be applied to entire tables or specific columns.

TTL Syntax

The TTL clause can appear after a column definition and/or at the end of the table definition. Use the INTERVAL clause to define a length of time (which needs to be a Date or DateTime data type). For example, the following table has two columns with TTL clauses:

CREATE TABLE example1 (
timestamp DateTime,
x UInt32 TTL now() + INTERVAL 1 MONTH,
y String TTL timestamp + INTERVAL 1 DAY,
z String
)
ENGINE = MergeTree
ORDER BY tuple()
  • The x column has a time to live of 1 month from now
  • The y column has a time to live of 1 day from the timestamp column:
  • When the interval lapses, the column expires. ClickHouse replaces the column value with the default value of its data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in the filesystem.
note

TTL rules can be altered or deleted. See the Manipulations with Table TTL page for more details.

Triggering TTL Events

The deleting or aggregating of expired rows is not immediate - it only occurs during table merges. If you have a table that's not actively merging (for whatever reason), there are two settings that trigger TTL events:

  • merge_with_ttl_timeout: the minimum delay in seconds before repeating a merge with delete TTL. The default is 14400 seconds (4 hours).
  • merge_with_recompression_ttl_timeout: the minimum delay in seconds before repeating a merge with recompression TTL (rules that roll up data before deleting). Default value: 14400 seconds (4 hours).

So by default, your TTL rules will be applied to your table at least once every 4 hours. Just modify the settings above if you need your TTL rules applied more frequently.

note

Not a great solution (or one that we recommend you use frequently), but you can also force a merge using OPTIMIZE:

OPTIMIZE TABLE example1 FINAL

OPTIMIZE initializes an unscheduled merge of the parts of your table, and FINAL forces a reoptimization if your table is already a single part.

Removing Rows

To remove entire rows from a table after a certain amount of time, define the TTL rule at the table level:

CREATE TABLE customers (
timestamp DateTime,
name String,
balance Int32,
address String
)
ENGINE = MergeTree
ORDER BY timestamp
TTL timestamp + INTERVAL 12 HOUR

Removing Columns

Instead of deleting the entire row, suppose you want just the balance and address columns to expire. Let's modify the customers table and add a TTL for both columns to be 2 hours:

ALTER TABLE customers
MODIFY COLUMN balance Int32 TTL timestamp + INTERVAL 2 HOUR,
MODIFY COLUMN address String TTL timestamp + INTERVAL 2 HOUR

Implementing a Rollup

Suppose we want to delete rows after a certain amount of time but hang on to some of the data for reporting purposes. We don't want all the details - just a few aggregated results of historical data. This can be implemented by adding a GROUP BY clause to your TTL expression, along with some columns in your table to store the aggregated results.

Suppose in the following hits table we want to delete old rows, but hang on to the sum and maximum of the hits columns before removing the rows. We will need a field to store those values in, and we will need to add a GROUP BY clause to the TTL clause that rolls up the sum and maximum:

CREATE TABLE hits (
timestamp DateTime,
id String,
hits Int32,
max_hits Int32 DEFAULT hits,
sum_hits Int64 DEFAULT hits
)
ENGINE = MergeTree
PRIMARY KEY (id, toStartOfDay(timestamp), timestamp)
TTL timestamp + INTERVAL 1 DAY
GROUP BY id, toStartOfDay(timestamp)
SET
max_hits = max(max_hits),
sum_hits = sum(sum_hits);

Some notes on the hits table:

  • The GROUP BY columns in the TTL clause must be a prefix of the PRIMARY KEY, and we want to group our results by the start of the day. Therefore, toStartOfDay(timestamp) was added to the primary key
  • We added two fields to store the aggregated results: max_hits and sum_hits
  • Setting the default value of max_hits and sum_hits to hits is necessary for our logic to work, based on how the SET clause is defined

Implementing a hot/warm/cold architecture

note

If you are using ClickHouse Cloud, the steps in the lesson are not applicable. You do not need to worry about moving old data around in ClickHouse Cloud.

A common practice when working with large amounts of data is to move that data around as it gets older. Here are the steps for implementing a hot/warm/cold architecture in ClickHouse using the TO DISK and TO VOLUME clauses of the TTL command. (By the way, it doesn't have to be a hot and cold thing - you can use TTL to move data around for whatever use case you have.)

  1. The TO DISK and TO VOLUME options refer to the names of disks or volumes defined in your ClickHouse configuration files. Create a new file named my_system.xml (or any file name) that defines your disks, then define volumes that use your disks. Place the XML file in /etc/clickhouse-server/config.d/ to have the configuration applied to your system:
<clickhouse>
<storage_configuration>
<disks>
<default>
</default>
<hot_disk>
<path>./hot/</path>
</hot_disk>
<warm_disk>
<path>./warm/</path>
</warm_disk>
<cold_disk>
<path>./cold/</path>
</cold_disk>
</disks>
<policies>
<default>
<volumes>
<default>
<disk>default</disk>
</default>
<hot_volume>
<disk>hot_disk</disk>
</hot_volume>
<warm_volume>
<disk>warm_disk</disk>
</warm_volume>
<cold_volume>
<disk>cold_disk</disk>
</cold_volume>
</volumes>
</default>
</policies>
</storage_configuration>
</clickhouse>
  1. The configuration above refers to three disks that point to folders that ClickHouse can read from and write to. Volumes can contain one or more disks - we defined a volume for each of the three disks. Let's view the disks:
SELECT name, path, free_space, total_space
FROM system.disks
┌─name────────┬─path───────────┬───free_space─┬──total_space─┐
│ cold_disk │ ./data/cold/ │ 179143311360 │ 494384795648 │
│ default │ ./ │ 179143311360 │ 494384795648 │
│ hot_disk │ ./data/hot/ │ 179143311360 │ 494384795648 │
│ warm_disk │ ./data/warm/ │ 179143311360 │ 494384795648 │
└─────────────┴────────────────┴──────────────┴──────────────┘
  1. And…let's verify the volumes:
SELECT
volume_name,
disks
FROM system.storage_policies
┌─volume_name─┬─disks─────────┐
│ default │ ['default'] │
│ hot_volume │ ['hot_disk'] │
│ warm_volume │ ['warm_disk'] │
│ cold_volume │ ['cold_disk'] │
└─────────────┴───────────────┘
  1. Now we will add a TTL rule that moves the data between the hot, warm and cold volumes:
ALTER TABLE my_table
MODIFY TTL
trade_date TO VOLUME 'hot_volume',
trade_date + INTERVAL 2 YEAR TO VOLUME 'warm_volume',
trade_date + INTERVAL 4 YEAR TO VOLUME 'cold_volume';
  1. The new TTL rule should materialize, but you can force it to make sure:
ALTER TABLE my_table
MATERIALIZE TTL
  1. Verify your data has moved to its expected disks using the system.parts table:
Using the system.parts table, view which disks the parts are on for the crypto_prices table:

SELECT
name,
disk_name
FROM system.parts
WHERE (table = 'my_table') AND (active = 1)

The response will look like:

┌─name────────┬─disk_name─┐
│ all_1_3_1_5 │ warm_disk │
│ all_2_2_0 │ hot_disk │
└─────────────┴───────────┘