This week in Databend #47

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • transient fuse table (#5968)
  • support field comment (#5952)
  • stop parsing at insert statement (#6048)
  • configurable repr of float denormals in output formats (#6065)
  • integrate cluster select query with new processor (#4544)

optimizer

  • decorrelate EXISTS subquery (#6073)
  • support push down filter through cross apply (#6079)

functions

  • support ifnull (#5921)
  • support multi_if (#6039)
  • support date_sub (#6050)
  • support to_nullable and assume_not_null (#6055)
  • support coalesce (#5922)

clickhouse handle

  • support settings (#5945)
  • improve output format (#6027)
  • support set database. (#6097)

new planner

  • show statements (#6013)
  • show users/roles statement (#6016)
  • migrate grant (#6049)
  • migrate revoke (#6066)
  • migrate copy (#6074)
  • migrate insert statement (#5897)
  • support query log for new planner (#6053)
  • support SELECT ... FROM ... { AT TIMESTAMP } (#6056)

Improvement

  • stop supporting multiple statements (#6052)
  • make uuid return uuid values for each row (#6114)
  • add more meta cluster status (#6083)
  • cache panic of http handlers (#6090)
  • adjust table snapshot timestamp precision to micro second (#6144)

Bug fixes

  • fix de_csv with escaped quoted (#6008)
  • consider NULL for binary op in type checker (#6043)
  • fix error of EXISTS subquery (#6073)
  • transient object storage IO operation fault handling (#6045)
  • remove a wrong cut parser in subquery (#6111)

Tips

Let's learn a weekly tip from Databend.

Cluster key

Cluster key is intended to improve query performance by physically clustering data together.

For example, when you set a column as your cluster key for a table, the table data will be physically sorted by the column you set. This will maximize the query performance if your most queries are filtered by the column.

Sets a cluster key when creating a table

CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] )

Changes the cluster key for a table

ALTER TABLE <name> CLUSTER BY ( <expr1> [ , <expr2> ... ] )

Deletes the cluster key for a table

ALTER TABLE <name> DROP CLUSTER KEY

Learn more:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyAngleNetb41shBohuTANGcadldantengsky
andylokandyAngleNetb41shBohuTANGcadldantengsky
drmingdrmereverpcpcfkunerhantmacjunnplusleiysky
drmingdrmereverpcpcfkunerhantmacjunnplusleiysky
lichuangmergify[bot]PsiACERinChanNOWWWsoyeric128sundy-li
lichuangmergify[bot]PsiACERinChanNOWWWsoyeric128sundy-li
TCeasonTennyZhuangTianLangStudioVeeupupwfxrwubx
TCeasonTennyZhuangTianLangStudioVeeupupwfxrwubx
Xuanwoxudong963youngsofunyuuchzhang2014zhyass
Xuanwoxudong963youngsofunyuuchzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #46

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • WINDOW function (#5401)
  • new table engine: RANDOM (#5896)
  • clickhouse http handler support TsvWithNamesAndTypes (#5898)
  • add benchmark scripts of metasrv (#5865)
  • support remove stage files (#5788)
  • support struct data type (#5940)

optimizer

  • support some common optimizer rules to refine optimizer framework (#5877)
  • support predicate push down through join (#5914)
  • support constant folding (#5924)

new planner

  • support left outer join and right outer join (#5972)
  • support semi and anti join in new planner (#5869)
  • support set operators in parser and planner (#5833)
  • support table statements in new planner (#5907)
  • add DDL STAGE for new planner framework (#5821)
  • support rename database in new planner (#5887)
  • support alter/drop view in new planner (#5862 & #5920)
  • migrate CREATE/DROP ROLE statement to new planner (#5935)

Improvement

  • row base serializer (#5791)
  • renaming for better readability (#5879)
  • list/remove stage files from meta api (#5857)
  • use unboxed_simple to simple GAT in async-trait (#5999)
  • add meta grpc client network metrics (#5978)
  • improve clickhouse http handler (#5933)
  • use micromarshal to make references clearer (#5974)

Build / Testing / CI

  • use fuse engine instead of memory engine in test (#5530)

Bug fixes

  • fix read quoted string (#5870)
  • corretly handle catalog in statements (#5909)
  • fix correlated subquery with joins (#5947)
  • server panic when exceeds max active sessions (#5928)
  • fix status cause mysql client hang (#5961)
  • ProcessorExecutorStream lost data (#5983)
  • fix incorrect return datatype of function if (#5980)

Tips

Let's learn a weekly tip from Databend.

Using Join with Databend

As you can see, one of the major efforts of Databend in recent times has been the development and migration to the new Planner. We have implemented support for Join on the new Planner to better accommodate multi-table queries.

To use JOIN, you must enable the new Databend planner first. To do so, perform the following command in the SQL client:

set enable_planner_v2=1;

A join allows you to combine columns from two or more tables into a single result set. Databend supports the following join types:

  • Inner Join
  • Natural Join
  • Cross Join
  • Left Join
  • Right Join

Let's see a CROSS JOIN example

A cross join returns a result set that includes each row from the first table joined with each row from the second table.

SELECT select_list
FROM table_a
	CROSS JOIN table_b

Imagine we have the following tables:

Table "vip_info": This table stores the VIP client information.

| Client_ID | Region    |
| --------- | --------- |
| 101       | Toronto   |
| 102       | Quebec    |
| 103       | Vancouver |

Table "gift": This table lists the gift options for the VIP clients.

| Gift      |
| --------- |
| Croissant |
| Donut     |
| Coffee    |
| Soda      |

The following command returns a result set that assigns each gift option to each VIP client:

select * from vip_info cross join gift;

Output:

| Client_ID | Region    | Gift      |
| --------- | --------- | --------- |
| 101       | Toronto   | Croissant |
| 101       | Toronto   | Donut     |
| 101       | Toronto   | Coffee    |
| 101       | Toronto   | Soda      |
| 102       | Quebec    | Croissant |
| 102       | Quebec    | Donut     |
| 102       | Quebec    | Coffee    |
| 102       | Quebec    | Soda      |
| 103       | Vancouver | Croissant |
| 103       | Vancouver | Donut     |
| 103       | Vancouver | Coffee    |
| 103       | Vancouver | Soda      |

Learn more:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGdantengskydoki23everpcpc
andylokandyb41shBohuTANGdantengskydoki23everpcpc
fkunerhantmacjunnplusKikkonleiyskylichuang
fkunerhantmacjunnplusKikkonleiyskylichuang
mergify[bot]PsiACERinChanNOWWWsoyeric128sundy-liTCeason
mergify[bot]PsiACERinChanNOWWWsoyeric128sundy-liTCeason
Xuanwoxudong963ygf11youngsofunZeaLoVezhyass
Xuanwoxudong963ygf11youngsofunZeaLoVezhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #45

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • undrop database (#5770)
  • support async insert mode to imporve throughput (#5567)
  • support alter table cluster key (#5718)

meta

  • add gc out of drop retention time data schema API and unit tests (#5746)
  • protobuf message has to persist MIN_COMPATIBLE_VER (#5785)

functions

  • support function timezone() (#5840)
  • add function NULLIF (#5772)

new planner

  • introduce InterpreterFactoryV2 for new planner (#5729)
  • support udf (#5751)
  • CREATE VIEW (#5816)
  • CREATE DATABASE (#5804) / DROP DATABASE (#5846)
  • CREATE USER (#5802) / ALTER USER (#5823) / DROP USER (#5813)

Improvement

  • refactor location parse logic (#5790)
  • show tenant quota for current tenant (#5750)
  • support errorcode hint in new planner (#5756)
  • don't load credential while reading stage (#5783)
  • move Clickhouse HTTP handler to its own port (#5797)

new parser

  • support cross join (#5730)
  • flatten join lists (#5742)

Build / Testing / CI

  • enable logic test (#5836)

Bug fixes

  • fix compressed buf not consumed correctly (#5727)
  • support abort for pipeline executor stream (#5803)
  • lz4raw compression of zero len buffer (#5806)
  • fix server hang when sync work panic (#5814)

Tips

Let's learn a weekly tip from Databend.

Databend Supports Async Insert Mode

When thousands of clients concurrently insert a small batch of data, each insert will be executed as follows:

Parser -> Planner -> Interpreter -> Pipeline

It's inefficient because of I/O depth and cache locality.

To solve the problem, we want to buffer small inserts into batches in server which sacrifices tiny latency for better insert throughput, smaller block count and larger DataBlock in storage.

After doing this, inserts into the same table will be parsed and planned individual. The insert data will be convert to DataBlock and buffered. When some conditions are triggered, the buffered DataBlock will be interpreted once which also is beneficial for pipelines.

Learn more:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGdantengskydevillove084drmingdrmer
andylokandyb41shBohuTANGdantengskydevillove084drmingdrmer
everpcpcfkunerflaneur2020hantmacjunnplusleiysky
everpcpcfkunerflaneur2020hantmacjunnplusleiysky
lichuangLiuYuHuimergify[bot]RinChanNOWWWsoyeric128sundy-li
lichuangLiuYuHuimergify[bot]RinChanNOWWWsoyeric128sundy-li
TCeasonXuanwoxudong963ygf11youngsofunZeaLoVe
TCeasonXuanwoxudong963ygf11youngsofunZeaLoVe
zhang2014zhyass
zhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #44

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • support correlated subquery (#5593)
  • support select ... at (#5617 & #5628)
  • exchange protocol version with client (#5645)
  • add decompress support for COPY INTO and streaming loading (#5655)

new parser

  • implment string unescape (#5638)
  • allow mysql-style hex number and single-item array (#5654)

stage related

  • copy into stage support (#5579)
  • add system.stages table and show stages (#5581)

call functions

  • add call stats functions (#5646)
  • add call function system$search_tables (#5643)

new planner

  • support explain for new planner (#5587)
  • support tuple in new planner (#5640)
  • support cross join (#5715)

Variant related

  • support variant group by (#5694)
  • support variant order by (#5668)

Improvement

  • refine table optimize (#5589)
  • parse integer to u64 (#5692)
  • support FixedKey u128, u256, u512 in group query (#5678)
  • introduce ScalarEvaluator to evaluate Scalar (#5689)
  • utilize with HashMethodFixedKeys in hash join (#5693)
  • record count of tables for a tenant in KV space (#5708)

Build / Testing / CI

  • add mock module, add test of out retention time data (#5707)

Bug fixes

  • retry while meeting error during load_credential (#5590)
  • deny the root login from others host (#5588)

Tips

Let's learn a weekly tip from Databend.

COPY INTO Databend with COMPRESSION Option

After PR 5655, Databend started to support decompression reads in COPY INTO and streaming leading.

Loading Compressed Files from Amazon S3

Try to load data from a gzip compressed csv and insert into mytable.

COPY INTO mytable
  FROM s3://mybucket/data.csv.gz
  credentials=(aws_key_id='<AWS_ACCESS_KEY_ID>' aws_secret_key='<AWS_SECRET_ACCESS_KEY>')
  FILE_FORMAT = (type = "CSV" field_delimiter = ',' record_delimiter = '\n' 
  skip_header = 1 compression = GZIP) size_limit=10;

COMPRESSION Option

The COMPRESSION option is a string that represents the compression algorithm.

| Values        | Notes                                                           |
| ------------- | --------------------------------------------------------------- |
| `AUTO`        | Auto detect compression via file extensions                     |
| `GZIP`        |                                                                 |
| `BZ2`         |                                                                 |
| `BROTLI`      | Must be specified if loading/unloading Brotli-compressed files. |
| `ZSTD`        | Zstandard v0.8 (and higher) is supported.                       |
| `DEFLATE`     | Deflate-compressed files (with zlib header, RFC1950).           |
| `RAW_DEFLATE` | Deflate-compressed files (without any header, RFC1951).         |
| `NONE`        | Indicates that the files have not been compressed.              |

Learn more:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGChasen-Zhangdantengsky
andylokandyariesdevilb41shBohuTANGChasen-Zhangdantengsky
devillove084drmingdrmereverpcpcflaneur2020hanyisongjunnplus
devillove084drmingdrmereverpcpcflaneur2020hanyisongjunnplus
leiyskylichuangmergify[bot]PsiACERinChanNOWWWryanrussell
leiyskylichuangmergify[bot]PsiACERinChanNOWWWryanrussell
soyeric128sundy-liTCeasonXuanwoxudong963youngsofun
soyeric128sundy-liTCeasonXuanwoxudong963youngsofun
ZeaLoVezhyass
ZeaLoVezhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #43

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • add system$clustering_information function (#5426)
  • add statistics to TableMeta (#5476)
  • add metasrv time travel functions (#5468 & #5566)
  • snapshot timestamp & navigation (#5535)
  • undrop table & show history (#5562)

Migrate to new planner

  • translate subquery into apply operator (#5510)
  • common tree structure formatter for plan display (#5512)
  • support TRIM function in new planner (#5541)
  • support array literal in new planner (#5551)

New functions

  • support object_keys function (#5461)
  • support compare variant with other data types (#5463)
  • support variant max/min functions (#5525)
  • support variant as function (#5442)
  • add user function (#5584)

Improvement

  • replace todos in datavalues with ErrorCode (#5475)
  • add stage quota for tenant (#5575)
  • simplify catalog and meta (#5560)

hash join performance improvement

  • optimize hash join, ~6x performance improvement (#5497)
  • use DataBlock::gather_blocks in hash join, ~2x performance improvement in some cases (#5534)

Build / Testing / CI

  • fix bugs of logic test scripts and some cases (#5578)

Documentation

  • add i18n support with crowdin (#5545),

Bug fixes

  • fix retention aggregation coredump bug (#5450)
  • fix server hang when parallel execute query (#5482)
  • fix wrong output in hash join (#5538 & #5539)

Tips

Let's learn a weekly tip from Databend.

Do Conversion Funnel Analysis With Databend

Funnel analysis measures the number of unique users who has performed a set of actions, and we use it to see drop-off and conversion in multi-step processes.

Create a Table

CREATE TABLE events(user_id BIGINT, event_name VARCHAR, event_timestamp TIMESTAMP);

Now we have a table with the following fields:

  • user_id - a unique identifier for user
  • event_name - type of the event, like: login, visit, cart and purchase
  • event_timestamp - timestamp which event occurred

Funnel Analysis

It's easy and performance to use Databend WINDOW_FUNNEL Function to find out how far the user user_id could get through the chain in an hour window slides.

SELECT
    level,
    count() AS count
FROM
(
    SELECT
        user_id,
        window_funnel(3600000000)(event_timestamp, 
        event_name = 'login', event_name = 'visit', 
        event_name = 'cart', event_name = 'purchase') AS level
    FROM events
    GROUP BY user_id
)
GROUP BY level ORDER BY level ASC;

Learn more:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

ariesdevilb41shBohuTANGChasen-Zhangdantengskydrmingdrmer
ariesdevilb41shBohuTANGChasen-Zhangdantengskydrmingdrmer
everpcpcfkunerflaneur2020hantmachanyisongjunnplus
everpcpcfkunerflaneur2020hantmachanyisongjunnplus
Kikkonleiyskylichuangmergify[bot]soyeric128sundy-li
Kikkonleiyskylichuangmergify[bot]soyeric128sundy-li
TCeasonXuanwoxudong963ygf11youngsofunZeaLoVe
TCeasonXuanwoxudong963ygf11youngsofunZeaLoVe
zhang2014ZhiHanZzhyassZuoFuhong
zhang2014ZhiHanZzhyassZuoFuhong

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.