This week in Databend #63

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features โœจ

sharing

  • add shared layer to read data from sharing endpoint (#8044)

planner

  • top-down optimization (#8080)
  • convert outer join to inner join (#7967)
  • convert mark join to semi join (#8064)
  • support EXPLAIN MEMO statement (#8125)

storage

  • parallel index pruning (#7893)
  • oss backend support (#8066 & #8079)
  • try to get segments parallel (#8076)
  • table column statistics api (#8152)

new expression

  • migrate agg function: stddev, window_funnel to func-v2 (#8034)
  • migrate variant func to func-v2 (#7972)
  • numeric/string can cast to boolean when it's in filter (#8051)
  • support Date and Interval in new expression framework (#8058)
  • add take_chunks kernel (#8137)

Code Refactor ๐ŸŽ‰

cache

  • remove not used disk cache (#8166)

meta

  • move MetaStorageError to a standalone crate (#8161)

query

  • remove precision in timestamp type (#8154)

deps

  • move several common dependency declarations to workspace Cargo.toml (#8171)

Build/Testing/CI Infra Changes ๐Ÿ”Œ

ci

Thoughtful Bug Fix ๐Ÿ”ง

build

  • fix build failure on newer versions of protoc (#8047)

datavalues

  • only_null for nullable column should return false when column is empty (#8045)

parser

pipelines

  • right semi join returns wrong answer (#8083)
  • support interrupt join build side (#8037)

storage

  • using table owned dal operator instead of get from query context (#8113)
  • pruning not work as expected (#8131)

new expression

  • properly repeat Scalar::Null to column (#8111)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

EXPLAIN MEMO

Databend supports EXPLAIN MEMO statement to visualize memo structure now.

Syntax

EXPLAIN MEMO <query_statement>

explain memo

Learn More

Tips on PRESIGN

Since #6529 was merged, Databend supports PRESIGN to generate the pre-signed URL for a staged file.

PRESIGN [{ DOWNLOAD | UPLOAD }] @<stage_name>/.../<file_name> [ EXPIRE = <expire_in_seconds> ]

[{ DOWNLOAD | UPLOAD }] specifies that the pre-signed URL is used for download or upload. So once you have the corresponding pre-signed URL, you can use curl to download or upload the file.

# download
curl '<pre-signed-url>' -o <file_name>
# upload
curl -F '[email protected]<file_path>' -XPUT '<pre-signed-url>' 

Learn More

Remove Precision in Timestamp Type

Timestamp datatype now keep the precision which can be used to output the DateTime with a custom style.

The storage stores the same value for now() which represents instants as the number of microseconds (ยตs) since 1970-01-01 00:00:00+00. The precision is only for output style, so it's useless to keep it in datatype, we can use other functions like: to_yyyy_mm_dd to handle the style.

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGClSlaiddantengsky
andylokandyariesdevilb41shBohuTANGClSlaiddantengsky
drmingdrmerflaneur2020leiyskymergify[bot]PsiACErdettai
drmingdrmerflaneur2020leiyskymergify[bot]PsiACErdettai
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
xudong963youngsofunzhang2014
xudong963youngsofunzhang2014

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #62

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features โœจ

meta

  • add snapshot_id codec support (#8005)

planner

  • support update ast and planner (#7925)

query

  • jsonb parser optimize (#7947)
  • impl externalLocation for create table (#7789)
  • use common_jsonb::compare to compare variants (#8027)

storage

  • accept SESSION_TOKEN for AWS temporary credentials (#7946)

cluster

  • experimental distributed eval index (#7867)

new expression

  • migrate retention to v2 (#7952)
  • support constructing array and CAST(... AS VARIANT) (#7781)

Code Refactor ๐ŸŽ‰

settings

  • add prefix "format_" for format related settings (#7960)

new expression

  • reorder comparision funciton priority (#7991)

unit tests

  • use goldenfile in tests on system tables (#7978 & #7982)

Thoughtful Bug Fix ๐Ÿ”ง

legacy parser

  • use unicode_segmentation to truncate INSERT statement (#8011)

planner

  • find smallest column for pruning unused columns (#7955 & #7962
  • union needs more than one coercion type (#8007)

processor

  • try fix cannot kill optimize table (#7959)
  • try fix cannot kill drop table (#7963)

storage

  • shrink min max index (#7958)

new expression

  • fix the bug in logic expression or and add test cases (#7966)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Better Index in Databend

In the past, Databend used a Bloom Filter (Bitmap Index) to check if a key was exists. Databend has enabled Bloom Index at the block level (#6639) and delivered an 8x read performance improvement in certain scenarios (index / data ~= 10%). Due to the implementation policy, it can take up very large amounts of storage space and has poor performance when it comes to point queries.

Now, Databend is making a number of improvements to enhance the insert and read capabilities of large data sets. Some of this work revolves around the index.

We introduced the Xor Fliter to replace the Bloom Filter (#7870), which in some scenarios gives about twice the performance improvement and requires very little data to be scanned. Recent work has also included distributed index pruning (#7867) and local parallel execution of pruning (especially the index pruning) (#7893) , which we believe will further improve cpu and network utilisation and hence performance.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
dantengskydrmingdrmerjunaireleiyskymergify[bot]PsiACE
dantengskydrmingdrmerjunaireleiyskymergify[bot]PsiACE
RinChanNOWWWsandfleesundy-liXuanwoxudong963youngsofun
RinChanNOWWWsandfleesundy-liXuanwoxudong963youngsofun
zenriazhang2014zhyass
zenriazhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #61

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features โœจ

share

  • add share database (#7932)

meta

  • add catalog in TableMeta (#7835)

planner

  • support full outer join (#7783)
  • support right semi/anti join (#7909)

index

  • add XOR filter (#7860)
  • enable XOR filter index (#7870)

jsonb

  • add jsonb builtin functions build_array and compare (#7802)

query

  • check memory_size() for building data block (#7927)
  • support unload multi files into stage (#7910)

new expression

  • add try_downcast_builder for ValueType (#7838)
  • migrate min/max/any functions (#7787)
  • migrate aggregation covariance functions (#7926)

Code Refactor ๐ŸŽ‰

index

  • split index Filter trait to two trait: FilterBuilder and Filter (#7937)

interpreter

  • try remove InterceptorInterpreter (#7796)

query

new expression

  • manually vectorize not() and xor() (#7801)

Thoughtful Bug Fix ๐Ÿ”ง

storage

  • fix oom when recluster (#7791)
  • warmup segment cache during insertion #7803)
  • use shortcut path if filter vector is empty during pruning #7877)

compatibility

  • fix mysql pt-archive compatibility #7853)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

External Location for Fuse Engine

Problems often encountered in the past with cloud services include the fact that data files are often invisible to the user, making it very difficult to migrate back locally. In addition, there is a lack of tools to help exchange data in the Big Data ecosystem and to better exploit the value in the data. In response to this need Databend has proposed an ISSUE: External Location for Fuse Engine.

This is part of the plan for Databend as Lakehouse and once this support is complete, users will be able to use Databend to manage the lifecycle of their data and perform data governance tasks, as well as having access to key features including Data Share and Time Travel.

Learn more:

Databend Automated Testing with SQLancer

Databend Automated Testing with SQLancer is one of the Databend community's projects in the Open Source Promotion Plan 2022. @hanyisong helped us with this important work, which has now been merged into sqlancer/sqlancer repository.

SQLancer (Synthesized Query Lancer) is a tool to automatically test Database Management Systems (DBMS) in order to find logic bugs in their implementation.

Learn more at https://github.com/sqlancer/sqlancer/pull/568

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
everpcpcflaneur2020hanyisongleiyskylichuangmergify[bot]
everpcpcflaneur2020hanyisongleiyskylichuangmergify[bot]
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonTennyZhuang
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonTennyZhuang
wubxXuanwoxudong963youngsofunzhang2014zhyass
wubxXuanwoxudong963youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #60

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features โœจ

planner

  • support independent right join (#7634)

storage

  • delay start the worker for simple select hive query (#7595)
  • get all parquet file list for fuse engine (#7631)

query

  • unify pipeline for all inputs with format (#7613)
  • add security token support for AWS S3 (#7758)
  • implement copy from ipfs (#7729)
  • add and_filters function (#7712)
  • idempotent-copy file (#7719)
  • support jsonb format (#7522)
  • add select from share db and show tables from share db SQL support (#7640)

cluster

  • auto discover ip when ip is unspecified or loop back (#7617)

new expression

  • migrate regexp func to func-v2 (#7459)

Code Refactor ๐ŸŽ‰

meta

planner

  • Old Planner Never See Again (Part 2) (#7767)

interpreter

  • remove sendable stream in interpreter (#7582)

processor

  • save pipeline executor into query context (#7722)

query

  • improve in function (#7645)
  • push all filters to prewhere and prune columns for it (#7646)
  • streaming load use planner v2 (#7756)

new expression

  • make unit test goldenfile only display the used columns (#7739)

Thoughtful Bug Fix ๐Ÿ”ง

tracing

  • fix: Jaeger layer not filtered (#7621)

planner

  • fix EXPLAIN AST for invalid query (#7737)
  • fix left join returns wrong answer (#7662)

settings

  • fix server hang when concurrent requests http auth (#7733)

query

  • fix cast deterministic error (#7686)

cluster

  • add statistics receiver runtime (#7679)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Designing and Using JSON in Databend

JSON (JavaScript Object Notation) is a commonly used semi-structured data type. With the self-describing schema structure, JSON can hold all data types. The JSON data shared by various platforms through open interfaces, and the public datasets and application logs stored in JSON format.

Databend supports structured data types, as well as JSON. This post dives deeply into the JSON data type in Databend.

Learn more at https://databend.rs/blog/json-datatypes

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
everpcpchanyisongleiyskylichuangmergify[bot]PsiACE
everpcpchanyisongleiyskylichuangmergify[bot]PsiACE
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
xudong963xychuyoungsofunzhang2014zhyass
xudong963xychuyoungsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #59

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features โœจ

RFC

  • Idempotent Copy (#7541)

meta

  • new RPC to echo client ip (#7538)
  • save table stage file info into meta, remove these data when truncate table (#7558)
  • add grpc API kv_api for replacing read_msg and write_msg. (#7605)

query

  • support distributed insert select (#7527)
  • support purge option in copy into table (#7518)

storage

  • add clustering_history system table (#7535)

metrics

  • abstract active instance counting (#7545)

new expression

  • support variant type (#7572)
  • migrate string func insert to func-v2 (#7564)

Code Refactor ๐ŸŽ‰

meta

  • remove redundant ActionHandler; move logic into MetaServiceImpl (#7555)

planner

  • Old Planner Never See Again (Part 1) (#7576)
  • make planner depends on TableContext trait (#7600)

query

  • replace recursion for fast-path insert with loop (#7530)
  • always list from OpenDAL instead of meta (#7547)
  • fix set operation err format (#7575)

new expression

  • codegen function registers (#7556)
  • extract number types (#7553)
  • improve floats (#7574)

Build/Testing/CI Infra Changes ๐Ÿ”Œ

  • add compat test for CopyOptions::purge (#7526)
  • run sqllogic test with docker image (#7650)

Thoughtful Bug Fix ๐Ÿ”ง

planner

  • change generated alias name for scalar expression to lowercase (#7525)

query

  • add missing EOI (#7534)

cluster

  • stop tasks in cluster when select limit (#7542)

storege

  • scan_progress should be incr before prewhere filter (#7566)

new expression

  • fix ceil return type (#7520)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

RFC: Idempotent Copy

When streaming copy stage files into a table, there is a chance that some files have already been copied, So it needs some ways to avoid duplicate copying files, make it an idempotent operation.

  • Save copy into table stage files meta information in meta service
  • Avoiding duplicates when copy stage files into a table

Learn more: https://databend.rs/doc/contributing/rfcs/idempotent-copy

Databend Perf with Ontime JOIN

With several recent patches, Databend can fully support Ontime JOIN queries, so you can now also see them in the Databend Perf dashboard.

  • Q5 JOIN

    SELECT Carrier, c, c2, c*100/c2 as c3 FROM( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier) q JOIN ( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier ) qq USING (Carrier) ORDER BY c3 DESC;
    
  • Q6 JOIN

    SELECT Carrier, c, c2, c*100/c2 as c3 FROM( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year>=2000 AND Year<=2008 GROUP BY Carrier) q JOIN ( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c2 FROM ontime WHERE Year>=2000 AND Year<=2008 GROUP BY Carrier ) qq USING (Carrier) ORDER BY c3 DESC;
    
  • Q7 JOIN

    SELECT Year, c1/c2 FROM( select Year, count(*)*100 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) q JOIN ( select Year, count(*) as c2 from ontime GROUP BY Year ) qq USING (Year) ORDER BY Year;
    

View dashboard: https://perf.databend.rs/

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilBohuTANGChasen-Zhangdrmingdrmereverpcpc
andylokandyariesdevilBohuTANGChasen-Zhangdrmingdrmereverpcpc
flaneur2020lichuangmergify[bot]RinChanNOWWWsoyeric128sundy-li
flaneur2020lichuangmergify[bot]RinChanNOWWWsoyeric128sundy-li
TCeasonXuanwoxudong963zhang2014zhyass
TCeasonXuanwoxudong963zhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.