This week in Databend #62

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

meta

  • add snapshot_id codec support (#8005)

planner

  • support update ast and planner (#7925)

query

  • jsonb parser optimize (#7947)
  • impl externalLocation for create table (#7789)
  • use common_jsonb::compare to compare variants (#8027)

storage

  • accept SESSION_TOKEN for AWS temporary credentials (#7946)

cluster

  • experimental distributed eval index (#7867)

new expression

  • migrate retention to v2 (#7952)
  • support constructing array and CAST(... AS VARIANT) (#7781)

Code Refactor 🎉

settings

  • add prefix "format_" for format related settings (#7960)

new expression

  • reorder comparision funciton priority (#7991)

unit tests

  • use goldenfile in tests on system tables (#7978 & #7982)

Thoughtful Bug Fix 🔧

legacy parser

  • use unicode_segmentation to truncate INSERT statement (#8011)

planner

  • find smallest column for pruning unused columns (#7955 & #7962
  • union needs more than one coercion type (#8007)

processor

  • try fix cannot kill optimize table (#7959)
  • try fix cannot kill drop table (#7963)

storage

  • shrink min max index (#7958)

new expression

  • fix the bug in logic expression or and add test cases (#7966)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Better Index in Databend

In the past, Databend used a Bloom Filter (Bitmap Index) to check if a key was exists. Databend has enabled Bloom Index at the block level (#6639) and delivered an 8x read performance improvement in certain scenarios (index / data ~= 10%). Due to the implementation policy, it can take up very large amounts of storage space and has poor performance when it comes to point queries.

Now, Databend is making a number of improvements to enhance the insert and read capabilities of large data sets. Some of this work revolves around the index.

We introduced the Xor Fliter to replace the Bloom Filter (#7870), which in some scenarios gives about twice the performance improvement and requires very little data to be scanned. Recent work has also included distributed index pruning (#7867) and local parallel execution of pruning (especially the index pruning) (#7893) , which we believe will further improve cpu and network utilisation and hence performance.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
dantengskydrmingdrmerjunaireleiyskymergify[bot]PsiACE
dantengskydrmingdrmerjunaireleiyskymergify[bot]PsiACE
RinChanNOWWWsandfleesundy-liXuanwoxudong963youngsofun
RinChanNOWWWsandfleesundy-liXuanwoxudong963youngsofun
zenriazhang2014zhyass
zenriazhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #61

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

share

  • add share database (#7932)

meta

  • add catalog in TableMeta (#7835)

planner

  • support full outer join (#7783)
  • support right semi/anti join (#7909)

index

  • add XOR filter (#7860)
  • enable XOR filter index (#7870)

jsonb

  • add jsonb builtin functions build_array and compare (#7802)

query

  • check memory_size() for building data block (#7927)
  • support unload multi files into stage (#7910)

new expression

  • add try_downcast_builder for ValueType (#7838)
  • migrate min/max/any functions (#7787)
  • migrate aggregation covariance functions (#7926)

Code Refactor 🎉

index

  • split index Filter trait to two trait: FilterBuilder and Filter (#7937)

interpreter

  • try remove InterceptorInterpreter (#7796)

query

new expression

  • manually vectorize not() and xor() (#7801)

Thoughtful Bug Fix 🔧

storage

  • fix oom when recluster (#7791)
  • warmup segment cache during insertion #7803)
  • use shortcut path if filter vector is empty during pruning #7877)

compatibility

  • fix mysql pt-archive compatibility #7853)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

External Location for Fuse Engine

Problems often encountered in the past with cloud services include the fact that data files are often invisible to the user, making it very difficult to migrate back locally. In addition, there is a lack of tools to help exchange data in the Big Data ecosystem and to better exploit the value in the data. In response to this need Databend has proposed an ISSUE: External Location for Fuse Engine.

This is part of the plan for Databend as Lakehouse and once this support is complete, users will be able to use Databend to manage the lifecycle of their data and perform data governance tasks, as well as having access to key features including Data Share and Time Travel.

Learn more:

Databend Automated Testing with SQLancer

Databend Automated Testing with SQLancer is one of the Databend community's projects in the Open Source Promotion Plan 2022. @hanyisong helped us with this important work, which has now been merged into sqlancer/sqlancer repository.

SQLancer (Synthesized Query Lancer) is a tool to automatically test Database Management Systems (DBMS) in order to find logic bugs in their implementation.

Learn more at https://github.com/sqlancer/sqlancer/pull/568

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
everpcpcflaneur2020hanyisongleiyskylichuangmergify[bot]
everpcpcflaneur2020hanyisongleiyskylichuangmergify[bot]
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonTennyZhuang
PsiACERinChanNOWWWsoyeric128sundy-liTCeasonTennyZhuang
wubxXuanwoxudong963youngsofunzhang2014zhyass
wubxXuanwoxudong963youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #60

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

planner

  • support independent right join (#7634)

storage

  • delay start the worker for simple select hive query (#7595)
  • get all parquet file list for fuse engine (#7631)

query

  • unify pipeline for all inputs with format (#7613)
  • add security token support for AWS S3 (#7758)
  • implement copy from ipfs (#7729)
  • add and_filters function (#7712)
  • idempotent-copy file (#7719)
  • support jsonb format (#7522)
  • add select from share db and show tables from share db SQL support (#7640)

cluster

  • auto discover ip when ip is unspecified or loop back (#7617)

new expression

  • migrate regexp func to func-v2 (#7459)

Code Refactor 🎉

meta

planner

  • Old Planner Never See Again (Part 2) (#7767)

interpreter

  • remove sendable stream in interpreter (#7582)

processor

  • save pipeline executor into query context (#7722)

query

  • improve in function (#7645)
  • push all filters to prewhere and prune columns for it (#7646)
  • streaming load use planner v2 (#7756)

new expression

  • make unit test goldenfile only display the used columns (#7739)

Thoughtful Bug Fix 🔧

tracing

  • fix: Jaeger layer not filtered (#7621)

planner

  • fix EXPLAIN AST for invalid query (#7737)
  • fix left join returns wrong answer (#7662)

settings

  • fix server hang when concurrent requests http auth (#7733)

query

  • fix cast deterministic error (#7686)

cluster

  • add statistics receiver runtime (#7679)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Designing and Using JSON in Databend

JSON (JavaScript Object Notation) is a commonly used semi-structured data type. With the self-describing schema structure, JSON can hold all data types. The JSON data shared by various platforms through open interfaces, and the public datasets and application logs stored in JSON format.

Databend supports structured data types, as well as JSON. This post dives deeply into the JSON data type in Databend.

Learn more at https://databend.rs/blog/json-datatypes

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
everpcpchanyisongleiyskylichuangmergify[bot]PsiACE
everpcpchanyisongleiyskylichuangmergify[bot]PsiACE
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonXuanwo
xudong963xychuyoungsofunzhang2014zhyass
xudong963xychuyoungsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #59

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

RFC

  • Idempotent Copy (#7541)

meta

  • new RPC to echo client ip (#7538)
  • save table stage file info into meta, remove these data when truncate table (#7558)
  • add grpc API kv_api for replacing read_msg and write_msg. (#7605)

query

  • support distributed insert select (#7527)
  • support purge option in copy into table (#7518)

storage

  • add clustering_history system table (#7535)

metrics

  • abstract active instance counting (#7545)

new expression

  • support variant type (#7572)
  • migrate string func insert to func-v2 (#7564)

Code Refactor 🎉

meta

  • remove redundant ActionHandler; move logic into MetaServiceImpl (#7555)

planner

  • Old Planner Never See Again (Part 1) (#7576)
  • make planner depends on TableContext trait (#7600)

query

  • replace recursion for fast-path insert with loop (#7530)
  • always list from OpenDAL instead of meta (#7547)
  • fix set operation err format (#7575)

new expression

  • codegen function registers (#7556)
  • extract number types (#7553)
  • improve floats (#7574)

Build/Testing/CI Infra Changes 🔌

  • add compat test for CopyOptions::purge (#7526)
  • run sqllogic test with docker image (#7650)

Thoughtful Bug Fix 🔧

planner

  • change generated alias name for scalar expression to lowercase (#7525)

query

  • add missing EOI (#7534)

cluster

  • stop tasks in cluster when select limit (#7542)

storege

  • scan_progress should be incr before prewhere filter (#7566)

new expression

  • fix ceil return type (#7520)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

RFC: Idempotent Copy

When streaming copy stage files into a table, there is a chance that some files have already been copied, So it needs some ways to avoid duplicate copying files, make it an idempotent operation.

  • Save copy into table stage files meta information in meta service
  • Avoiding duplicates when copy stage files into a table

Learn more: https://databend.rs/doc/contributing/rfcs/idempotent-copy

Databend Perf with Ontime JOIN

With several recent patches, Databend can fully support Ontime JOIN queries, so you can now also see them in the Databend Perf dashboard.

  • Q5 JOIN

    SELECT Carrier, c, c2, c*100/c2 as c3 FROM( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier) q JOIN ( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier ) qq USING (Carrier) ORDER BY c3 DESC;
    
  • Q6 JOIN

    SELECT Carrier, c, c2, c*100/c2 as c3 FROM( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year>=2000 AND Year<=2008 GROUP BY Carrier) q JOIN ( SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*) AS c2 FROM ontime WHERE Year>=2000 AND Year<=2008 GROUP BY Carrier ) qq USING (Carrier) ORDER BY c3 DESC;
    
  • Q7 JOIN

    SELECT Year, c1/c2 FROM( select Year, count(*)*100 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) q JOIN ( select Year, count(*) as c2 from ontime GROUP BY Year ) qq USING (Year) ORDER BY Year;
    

View dashboard: https://perf.databend.rs/

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilBohuTANGChasen-Zhangdrmingdrmereverpcpc
andylokandyariesdevilBohuTANGChasen-Zhangdrmingdrmereverpcpc
flaneur2020lichuangmergify[bot]RinChanNOWWWsoyeric128sundy-li
flaneur2020lichuangmergify[bot]RinChanNOWWWsoyeric128sundy-li
TCeasonXuanwoxudong963zhang2014zhyass
TCeasonXuanwoxudong963zhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #58

Databend is an open source elastic and reliable Modern Cloud Data Warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

meta

  • add metrics last_seq (#7429)

query

  • add Users in config file (#7477)
  • avoid full tokenizer in parsing insert statement (#7485)
  • make aggregate function return null on empty set (#7412)
  • support using DEFAULT to fill default value in INSERT statement (#7436)

storage

  • keep a hint of last snapshot location while committing new snapshot (#7418)

share

  • save share config whenever share meta has been changed (#7430)

planner

  • implement join reordering (#7507)
  • fold simple count aggregation (#7414)

new expression

  • migrate math functions to function-v2 (#7514)
  • migrate string functions to function-v2 (#7425)
  • add new aggregate function ANY (#7419)

http handler

  • the first request no longer wait for query to start (#7410)

Improvement

meta

sessions

  • remove query context ref count (#7480)
  • eliminate strong ref for sessions manager and session (#7487)

storage

  • enable chunked reading of hive table (#7373)

Bug fixes

planner

  • column reference is ambiguous in using (#7431)

query

  • fix insert format size (#7441)
  • type_checker return type support nullable (#7504)
  • fix hashset capacity overflow (#7513)
  • cancel task when pipeline is finished (#7515)

cluster

  • fix performance degradation in cluster mode (#7451)

storege

  • fix hive table location not match partition location (#7398)
  • fix block pruning panic (#7492)

new expression

  • support serde for Scalar::Array (#7421)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Deploy Databend with KubeSphere

Databend officially provides a Helm repository, so you can easily deploy Databend using KubeSphere.

  1. In your workspace, go to App Repositories under App Management, and then click Add.
  2. In the dialog that appears, specify the app repository name and add Databend repository URL. Enter https://charts.databend.rs .
  3. After you specify required fields, click Validate to verify the URL. You will see a green check mark next to the URL if it is available and click OK to finish.

After this, Databend has been added to the KubeSphere App Repositories. You can refer to Deploy Apps from App Templates to complete the deployment.

New release for OpenDAL: Access data freely, painless, and efficiently

OpenDAL v0.15.0 has been released with new features 🤩 :

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGChasen-Zhangdantengsky
andylokandyariesdevilb41shBohuTANGChasen-Zhangdantengsky
drmingdrmerhanyisongleiyskylichuangmergify[bot]PsiACE
drmingdrmerhanyisongleiyskylichuangmergify[bot]PsiACE
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonxudong963
RinChanNOWWWsandfleesoyeric128sundy-liTCeasonxudong963
youngsofunzhang2014zhyass
youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.