This week in Databend #53

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

RFC

meta

  • add raft store fail metrics (#6927)
  • metasrv unittest logs tracing event with customized formatter (#6874)

storage

  • enable bloom filter index (#6639)
  • support query hive partition table (#6906)

RBAC

  • add auth role to jwt (#6829)

format

  • pass FileSplit instead of Vec (#6873)

new expression

  • make chunk support scalar values (#6918)
  • migrate quote, reverse and ascii (#6907)
  • migrate trim functions to new expression framework (#6921)

Improvement

  • Dedicate See you again to the old planner (#6895)
  • Remove unused reload config (#6933)

new expression

  • add NullableColumn and NullableColumnBuilder (#6867)
  • use Scalar to store constant in Expr (#6923)

Build/Testing/CI

Bug fixes

  • don't expand null scalar to column (#6834)
  • fix mistake using try_cast for cast (#6879)
  • fix session drop early in clickhouse handler (#6888)
  • fix binder create table (#6899)
  • fix mysql return 'Empty Set' when result set is empty (#6841)
  • fix case expr with case operator equal (#6950)
  • fix cannot kill query in cluster mod (#6954)

Tips

Let's learn a weekly tip from Databend.

Call for Migrating Functions to the New Expression Framework

If you are interested in typed type system, or maybe you'd like to try your hand at a database project, take a look at how Databend does it.

We are now trying to migrate some old functions to the new expression framework, would you like to try it out?

Background

Recently Databend is working on a new expression framework that will bring some interesting features.

  • Type checking.
  • Type-safe downcast.
  • Enum-dispatched columns.
  • Generic types.

How To

Legacy functions are settle in common/functions/src/scalars. The task is to migrate all of them to common/functions-v2/src/scalars/.

Usually you can reuse the logic of the previous implementation, it just needs some rewriting to make it fit the new way.

Similarly, the legacy tests in common/functions/tests/it/scalars/ should also be migrated to common/functions-v2/tests/it/scalars/.

The new tests will be written using goldenfile, so you can easily generate test cases without a lot of painful handwriting.

Example

A unary function OCTET_LENGTH can be defined using 6 lines incommon/functions-v2/src/scalars/strings.rs.

OCTET_LENGTH will return the length of a string in bytes.

registry.register_1_arg::<StringType, NumberType<u64>, _, _>(
    "octet_length",
    FunctionProperty::default(),
    |_| None,
    |val| val.len() as u64,
);

LENGTH is a synonym for OCTET_LENGTH.

We can easily define function aliases with one line.

registry.register_aliases("octet_length", &["length"]);

Next, let's write some tests to make sure it works correctly.

Edit common/functions-v2/tests/it/scalars/string.rs.

fn test_octet_length(file: &mut impl Write) {
    run_ast(file, "octet_length('latin')", &[]);
    run_ast(file, "octet_length(NULL)", &[]);
    run_ast(file, "length(a)", &[(
        "a",
        DataType::String,
        build_string_column(&["latin", "кириллица", "кириллица and latin"]),
    )]);
}

Register it in the test_string function,

#[test]
fn test_string() {
    let mut mint = Mint::new("tests/it/scalars/testdata");
    let file = &mut mint.new_goldenfile("string.txt").unwrap();

    ...
    test_octet_length(file);
    ...
}

Next, let's try to generate these test cases from the command line.

REGENERATE_GOLDENFILES=1 cargo test -p common-functions-v2 --test it

Well done, we did it.

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskydrmingdrmer
andylokandyariesdevilb41shBohuTANGdantengskydrmingdrmer
flaneur2020gaoxingeleiyskylichuangmergify[bot]PsiACE
flaneur2020gaoxingeleiyskylichuangmergify[bot]PsiACE
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
xudong963ygf11youngsofunZeaLoVezhang2014zhyass
xudong963ygf11youngsofunZeaLoVezhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #52

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

logging

  • implement RFC The New Logging (#6845)

meta

  • add grant and revoke object API in ShareApi (#6724)
  • show share api (#6790)
  • add get_share_grant_objects API in ShareApi (#6798)

http handle

  • http handler return session state (#6846)

processor

  • implement explain fragments (#6851)
  • support distributed subquery in new cluster framework (#6666)

new planner

  • support order by expression (#6725)
  • enable delete stmt (#6768)
  • implement distributed query (#6440)
  • support push down predicates to storage (#6842)

storage

  • add support for COPY from https (#6691)
  • construct leaf column statistics (#6731)
  • support read nested columns (#6612)

new expression

  • support float32, float64 and Map(T) datatype (#6711 & #6838)
  • add serializable expression (#6712)
  • support user-defined CAST and TRY_CAST (#6663)
  • migrate Boolean functions to new expression framework (#6763)
  • migrate some String functions to new expression framework (progress of migration #6766)

Improvement

  • purge mapping data in DB/table GC (#6753)
  • fuzz with afl (#6793)
  • make auto-nullable and auto-vectorization independent (#6797)
  • refactor pipeline builder (#6820)

new planner

  • make PRESIGN works on old planner by forwarding (#6713)
  • forward COPY and STAGE to new planner entirely (#6853)
  • migrate more new planners to be enabled (#6716)

Build/Testing/CI

Bug fixes

  • fix uncorrelated scalar subquery returns error results (#6720)
  • fix bug in FileSplitter skip header (#6732)
  • fix oom when returning large results in clickhouse tcp handler (#6789)
  • Any/Exists subquery in projection (#6809)

Tips

Let's learn a weekly tip from Databend.

COPY INTO <table> FROM REMOTE FILES

After #6691 has been merged, Databend now supports loading data into a table from one or more remote files by their URL.

Syntax

COPY INTO [<database>.]<table_name>
FROM 'https://<site>/<directory>/<filename>'
[ FILE_FORMAT = ( TYPE = { CSV | JSON | PARQUET } [ formatTypeOptions ] ) ]

Example

This example loads data into the table ontime200 from the remote files ontime_2006_200.csv, ontime_2007_200.csv, and ontime_2008_200.csv:

copy into ontime200 from 'https://repo.databend.rs/dataset/stateful/ontime_200{6,7,8}_200.csv' FILE_FORMAT = (type = 'CSV' field_delimiter = ','  record_delimiter = '\n' skip_header = 1)

Of course, this example could also be written in the following form:

copy into ontime200 from 'https://repo.databend.rs/dataset/stateful/ontime_200[6-8]_200.csv' FILE_FORMAT = (type = 'CSV' field_delimiter = ','  record_delimiter = '\n' skip_header = 1)

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilaseadayb41shBohuTANGClSlaid
andylokandyariesdevilaseadayb41shBohuTANGClSlaid
dantengskyleiyskylichuangmergify[bot]PsiACEsoyeric128
dantengskyleiyskylichuangmergify[bot]PsiACEsoyeric128
sundy-liTCeasonTianLangStudioXuanwoxudong963ygf11
sundy-liTCeasonTianLangStudioXuanwoxudong963ygf11
youngsofunZeaLoVezhang2014zhyass
youngsofunZeaLoVezhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #51

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • add StageFileFormatType::Tsv (#6651)

meta

  • add share metasrv ShareApi(create_share,drop_share) (#6582)
  • add share metasrv ShareApi {add|remove}_share_account (#6656)
  • add share id to share name map, add share test suites (#6670)
  • adds cli command to send RPC to a running meta cluster (#6559)

hive catalog

  • support read boolean, float, double, date, array columns (#6629)

new planner

  • support create table as select (#6618)
  • optimize correlated subquery by decorrelation (#6632)

new expression

  • Implement domain calculation (#6649)
  • implement error report (#6661)
  • allow function to return runtime error (#6662)
  • support UInt32, UInt64, Int32, Int64 (#6660)
  • support conversion between arrow (#6674)

Improvement

  • support insert zero date and zero datetime (#6592)
  • Stage Copy use internal InputFormat (#6638)
  • decouple Table from QueryContext (#6665)
  • refactor pipeline builder (#6695)

new planner

  • stage/tables/databases DDL statements defaults to use new planner (#6648)
  • users/roles/grants DDL statements default to use new planner (#6687)

Build/Testing/CI

  • add ydb test cases (#6681)

Bug fixes

  • fix range delete panic and incorrect statistics (of in_memory_size) (#6609)
  • disable null values in join (#6616)
  • COPY shoud be able to run under new planner (#6624)
  • fix InSubquery returns error result (#6641)
  • fix variant map access filter (#6645)
  • adhoc fix session leak (#6672)
  • support read i96 timestamp from parquet file (#6668)
  • check parquet schema mismatch (#6690)

Tips

Let's learn a weekly tip from Databend.

Send & Receive gRPC Metadata

Databend allows you to send and receive gRPC (gRPC Remote Procedure Calls) metadata (key-value pairs) to and from a running meta service cluster with the command-line interface (CLI) commands.

Update and Create a Key-Value Pair

./databend-meta --grpc-api-address "<grpc-api-address>" --cmd kvapi::upsert --key <key> --value <value>

Get Value by a Key

./databend-meta --grpc-api-address "<grpc-api-address>" --cmd kvapi::get --key <key>

Get Values by Multiple Keys

./databend-meta --grpc-api-address "<grpc-api-address>" --cmd kvapi::mget --key <key1> <key2> ...

List Key-Value Pairs by a Prefix

./databend-meta --grpc-api-address "<grpc-api-address>" --cmd kvapi::list --prefix <prefix>

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskydependabot[bot]
andylokandyariesdevilb41shBohuTANGdantengskydependabot[bot]
drmingdrmereverpcpcjiaoew1991lichuangmergify[bot]PsiACE
drmingdrmereverpcpcjiaoew1991lichuangmergify[bot]PsiACE
RinChanNOWWWsandfleesoyeric128sundy-liXuanwoxudong963
RinChanNOWWWsandfleesoyeric128sundy-liXuanwoxudong963
youngsofunyuuchZeaLoVezhang2014
youngsofunyuuchZeaLoVezhang2014

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #50

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • migrate window function to new pipeline (#6500)
  • add format diagnostic (#6530)
  • add date_trunc function (#6540)
  • support global setting (#6579)
  • add {db,table}_id map to {(tenant,db_name), (db_id, table_name)} in metasrv (#6607)
  • support ALL and SOME subquery, mark join with non-equi condition, and make tpch q20 happy (#6534)

presign statement

  • add presign statement in parser (#6513)
  • implement presign support (#6529)

storage

  • allow COPY FROM/INTO different storage services (#6573)
  • allow create stage for different services (#6602)

new expression

  • add new crate common-expression (#6576)
  • implement pretty print for Chunk (#6597)

Improvement

  • improve performances for group by queries (#6551)
  • try abandon internal parquet2 patches (#6067)
  • refactor interpreter factory for reuse interpreters code (#6566)
  • replace infallible (#6568)
  • remove old processor useless code (#6584)
  • pretty format for explain (#6585)

Build/Testing/CI

Bug fixes

  • big query hang with clickhouse (#6583)
  • catchup planner update in http handler (#6572)
  • fix load json value by csv format (#6548)
  • fix input format CSV (#6524)
  • show query with limit will failed when enable planner v2 (#6381)
  • add watch txn unit test (#6526)
  • fix thread unsafe when processor schedule (#6533)
  • fix database and user related functions in planner v2 (#6473)

Tips

Let's learn a weekly tip from Databend.

Presign Statement

Generates the pre-signed URL for a staged file by the stage name and file path you provide. The pre-signed URL enables you to access the file through a web browswer or an API request.

Syntax

PRESIGN [ { DOWNLOAD | UPLOAD }] @<stage_name>/.../<file_name> [ EXPIRE = <expire_in_seconds> ]

Example

This example generates the pre-signed URL for downloading the file books.csv on the stage my-stage:

PRESIGN @my_stage/books.csv
+--------+---------+---------------------------------------------------------------------------------+
| method | headers | url                                                                             |
+--------+---------+---------------------------------------------------------------------------------+
| GET    | {}      | https://example.s3.amazonaws.com/books.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&... |
+--------+---------+---------------------------------------------------------------------------------+

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskyDefined2014
andylokandyariesdevilb41shBohuTANGdantengskyDefined2014
everpcpcfkunergaoxingeGrapeBaBajiaoew1991lichuang
everpcpcfkunergaoxingeGrapeBaBajiaoew1991lichuang
mergify[bot]PsiACEsoyeric128sundy-liTCeasonXuanwo
mergify[bot]PsiACEsoyeric128sundy-liTCeasonXuanwo
xudong963youngsofunZeaLoVezhang2014
xudong963youngsofunZeaLoVezhang2014

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #49

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • add call procedure for sync stage (#6344)
  • show settings support like (#6394)
  • support all JsonEachRowOutputFormat variants (#6434)
  • support any, all and some subquery in parser (#6438)
  • support geo_to_h3 function (#6389)

storage

  • add xz compression support (#6421)
  • introduce system.tables_with_history (#6435)

new planner

  • migrate call statement to new planner (#6361)
  • support IS [NOT] DISTINCT FROM in planner_v2 (#6170)
  • support qualified column name with database specified (#6444)
  • support mark join, (not)in/any subquery, make tpch16 and tpch18 happy (#6412)

RFC

  • add Presign statement (#6503)

Improvement

  • add span info for TableReference (#6370)
  • improve optimize table compact (#6373)

refactor

  • split formats (#6443)
  • intro common-http to reduce duplicate code (#6484)

Build/Testing/CI

  • logic test with clickhouse handler (#6329)
  • enable semantic PRs and fully migrate to mergify and gh cli (#6386, #6419 and more)

Bug fixes

  • fix hashmap memory leak (#6354)
  • fix array inner type with null (#6407)
  • fix lost event in resize processor (#6501)

cluster

  • show correctly progress in cluster mode (#6253)
  • fix cannot destroy thread in cluster mode (#6436)

format

  • add NestedCheckpointReader for input format parser (#6385)
  • fix tsv deserialization (#6453)

Tips

Let's learn a weekly tip from Databend.

Monitoring Databend with Sentry

Sentry is cross-platform application monitoring, with a focus on error reporting.

Databend supports error tracking and performance monitoring with Sentry.

Preparing

Error Tracking

This will only use the sentry-log feature, which will help us with error tracking.

DATABEND_SENTRY_DSN="<your-sentry-dsn>" ./databend-query

sentry-error

Performance Monitoring

Setting SENTRY_TRACES_SAMPLE_RATE greater than 0.0 will allow sentry to perform trace sampling, which will help set up performance monitoring.

DATABEND_SENTRY_DSN="<your-sentry-dsn>" SENTRY_TRACES_SAMPLE_RATE=1.0 LOG_LEVEL=DEBUG ./databend-query

Note: Set SENTRY_TRACES_SAMPLE_RATE to a lower value in production.

sentry-performance

Learn more

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

ariesdevilb41shBohuTANGClSlaiddantengskydatabend-bot
ariesdevilb41shBohuTANGClSlaiddantengskydatabend-bot
drmingdrmereverpcpcflaneur2020junnplusleiyskylichuang
drmingdrmereverpcpcflaneur2020junnplusleiyskylichuang
mergify[bot]PragmaTwicePsiACEsoyeric128sundy-liTCeason
mergify[bot]PragmaTwicePsiACEsoyeric128sundy-liTCeason
VeeupupXuanwoxudong963youngsofunZeaLoVezhang2014
VeeupupXuanwoxudong963youngsofunZeaLoVezhang2014
ZhiHanZzhyass
ZhiHanZzhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.