This week in Databend #68

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

metrics

  • add metrics for query detail (#8800)

multiple catalog

  • multiple catalog config (#8743)

query

  • adjust max io requests when read block data to avoid oom (#8726)
  • change PrecommitBlock serde from serde_json to bincode (#8726)
  • support parallel final aggregator, 4X faster! (#8577)
  • parallel merge for distribute query (#8811)

storage

  • shuffle segments during distributed pruning (#8793)
  • add shuffle policy for Partitions (#8814)

new expression

  • add calc domain for comparison (#8754)

Code Refactor πŸŽ‰

io

  • replace NestedCheckpointReader with Cursor (#8716)

handler

  • use FieldEncoder to encode data (#8733)

format

  • refactor with FieldEncoder (#8778)

query

  • unified hashtable interface (#8681)

storage

new expression

  • refine domain (#8755)
  • allow sparse column id in chunk (#8789)

Thoughtful Bug Fix πŸ”§

handler

  • correct databend types to mysql types (#8745)

functions

  • l_col like r_col will generate a hashmap based on r_col, if r_col is huge, it will be oom kill (#8737)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Shuffle Policy for Partitions

For cache affinity, we consider some strategies when re-shuffle partitions in plan_fragemnt.rs::redistribute_source_fragment, default kind is Seq.

pub enum PartitionsShuffleKind {
    // Bind the Partition to executor one by one with order.
    Seq,
    // Bind the Partition to executor by partition.hash()%executor_nums order.
    Mod,
    // Bind the Partition to executor by partition.rand() order.
    Rand,
}

Learn More

Databend x Rust China Hackathon 2022

The first Rust China Hackathon Online is here! The theme of this year's Hackathon is Rust for Fun and we look forward to working with you to unleash the possibilities of innovation with Rust.

As a co-organiser of this year's hackathon, Databend is sponsoring an enterprise track where participants can explore the appeal of cloud-native data warehouses by creating work around Databend components or the Databend ecosystem.

Learn More

Issues

Meet issues you may be interested in and try to solve it.

Switch to nextest in CI

cargo-nextest is a next-generation test runner for Rust projects.

We have noticed that it is very much faster than cargo test on many projects. However, there are still some challenges in applying this to Databend. For example, for different types of tests, the number of threads has to be adjusted to ensure that the tests are performed quickly and correctly. Also, some of the tests may need to be rewritten to get better results.

Issue 4376: switch to nexttest in ci

If you find it interesting, try to solve it or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyBohuTANGClSlaiddantengskydependabot[bot]everpcpc
andylokandyBohuTANGClSlaiddantengskydependabot[bot]everpcpc
lichuangmergify[bot]RinChanNOWWWsandfleesoyeric128sundy-li
lichuangmergify[bot]RinChanNOWWWsandfleesoyeric128sundy-li
TCeasonwubxXuanwoxudong963youngsofunzhang2014
TCeasonwubxXuanwoxudong963youngsofunzhang2014
ZhiHanZzhyass
ZhiHanZzhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #67

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

toolchain

  • upgrade to 1.67 nightly (#8631)

multiple catalog

  • multiple catalog create (planner and catalog manager) (#8620)

compact

  • optimize compact for data load (#8644)

planner

  • optimize left/single join (#8583)

query

  • support copy from xml (#8404)
  • add collation (#8610)
  • copy files order by last modified time asc (#8628)
  • improve sort, 10%~50% faster than the old one (#8452)

new expression

  • add to_xxx() cast functions (#8599)
  • add inlist expr in new expression (#8676)

Code Refactor πŸŽ‰

format

  • refactor output format with FieldEncoders (#8700)

planner

  • move plan from query/planner to sql/planner (#8660)

query

  • remove sqlparser-rs (#8670)
  • try move list files to read_partitions (#8673)

storage

  • move and group sub-crates in storages (#8613, #8621, #8627, etc.)
  • compact segments, which strictly preserves the order of ingestion (#8590)

new expression

  • migrate deserializations to expression (#8637)
  • use to_xxx() to evaluate CAST(xxx AS xxx) (#8637)

Build/Testing/CI Infra Changes πŸ”Œ

  • rust-toolchain nightly 1.67.0 (nightly-2022-11-07) (#8641)

Thoughtful Bug Fix πŸ”§

compatibility

  • problem when using Trino Mysql connector (#8668)

meta

  • emit kv change events after committing a transaction (#8674)

query

  • union's pairs are handled incorrectly (#8638)
  • max_threads can not determined automatically (#8707)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Support Copy from XML

After #8404 was merged, Databend now offers support for loading data from XML formatted files.

Similar to the use of other formats, in the SQL statement it is only necessary to set the format option to XML and an example of using the streaming load API is given below.

curl -sH "insert_sql:insert into test_xml format XML" \
-F "[email protected]/tmp/simple_v1.xml" \
-u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load"

The content of your XML file needs to match one or more of the following types:

  • Column names as attributes and column values as attribute values:
<row column1="value1" column2="value2" .../>
  • Column names as tags and column values as the content of these tags:
<row>
  <column1>value1</column1>
  <column2>value2</column2>
</row>
  • Column names are the name attributes of tags, and values are the contents of these tags:
<row>
  <field name='column1'>value1</field>
  <field name='column2'>value2</field>
</row>

Learn More

Support for Char Collation

After #8610 was merged, Databend now supports setting collation to select the string encoding to be considered.

By default, collation is set to 'binary', as Databend stores string columns in binary format by default, which you can change to 'utf-8' with a statement like the following:

set collation = 'utf8';

This may help you to get the expected results when working with non-English strings.

statement query TI
select substr('εŸŽεŒΊδΈ»εŸŽεŒΊε…Άδ»–', 1, 6), length('ζˆ‘ηˆ±δΈ­ε›½');

----
城区	12


statement ok
set collation = 'utf8';


statement query TI
select substr('εŸŽεŒΊδΈ»εŸŽεŒΊε…Άδ»–', 1, 6), length('ζˆ‘ηˆ±δΈ­ε›½');

----
εŸŽεŒΊδΈ»εŸŽεŒΊε…Ά	4

Learn More

Issues

Meet issues you may be interested in and try to solve it.

Enable Xor Filter Index for IN

Databend introduced the Xor Fliter to replace the Bloom Filter (#7870), which in some scenarios gives about twice the performance improvement and requires very little data to be scanned.

Initially, we simply added this index for the string columns.Then, in #7958, it is enabled for the integer columns.

Now, we want to enable Xor Filter index for IN .

SELECT * FROM t1 where xx IN ('', '')

Issue 8625: performance: enable xor filter index for IN

If you find it interesting, try to solve it or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGChasen-ZhangClSlaiddantengsky
andylokandyb41shBohuTANGChasen-ZhangClSlaiddantengsky
dependabot[bot]drmingdrmereliasyaoyclichuangmergify[bot]RinChanNOWWW
dependabot[bot]drmingdrmereliasyaoyclichuangmergify[bot]RinChanNOWWW
soyeric128sundy-liTCeasonXuanwoxudong963youngsofun
soyeric128sundy-liTCeasonXuanwoxudong963youngsofun
zhang2014ZhiHanZ
zhang2014ZhiHanZ

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #66

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

RFC

user stage

  • add internal and legacy internal stage support (#8526)
  • implement user stage support along with stateful tests (#8556)

multiple catalog

  • implement multiple catalog AST (#8608)

hive

  • hive predict pushdown (#8469)
  • suport substr && order by with null values compatible with hive functions (#8558)

rbac

  • only display available roles in SHOW ROLES statement (#8553)

compatibility

  • support mydumper dump data (#8500)

auth

  • add common auth module and support file token on sharing endpoint (#8607)

format

  • allow nested field in TSV (#8606)
  • unify format settings/options (#8566)

hashtable

  • adaptive string hash table (#7971)

meta

  • auto-clean expired keys (#8539)
  • export/import supports new key-space: Expire (#8578)

query

  • implement logging format for databend-query (#8466)

Code Refactor πŸŽ‰

error handler

  • add internal error and merge other not needed errors (#8581)

interpreter

  • move interpreter_common.rs to common (#8593)

query

  • remove legacy crate codes (#8412)
  • avoid extra memcpy (#8569)
  • migrate serializations to expression (#8582)

storage

  • make block meta easier to be cloned (#8548)

new expression

  • cast timestamp to variant string instead of variant int (#8580)

Thoughtful Bug Fix πŸ”§

compatibility

  • fed mysqldump 5.7.16 /*!code query (#8485)

planner

  • select view column not exists (#8543)

query

  • div nullable(Int) with null value should not err (#8495)
  • CURRENT ROLE behavior with AUTH ROLE is set (#8546)

storage

  • compact limit (#8473)
  • incorrect index size during reducing block metas (#8428)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

String Adaptive Hash Table

String Adaptive Hash Table is one of the Databend community's projects in the Open Source Promotion Plan 2022. @usamoi helped us with this important work, which has now been merged into Databend repository and has resulted in a 20% - 50% performance improvement in certain scenarios.

String Adaptive Hash Table is designed for the key to be a string type, and different data structures are applied to optimise the insertion and reading based on the different lengths of the key.

Learn More

RFC: User Stage

Databend only supports named internal stage:

CREATE STAGE @my_stage;
COPY INTO my_table FROM @my_stage;

However, named internal stages are complex to be used in some cases. Especially for users who only use stages to load data. By supporting the user stage, they can copy data more efficiently:

COPY INTO my_table from @~;

Learn More

Issues

Meet issues you may be interested in and try to solve it.

Replace RwLock<HashMap> and Mutex<HashMap> by Using DashMap

DashMap is an implementation of a concurrent associative array/hashmap in Rust. it tries to be very simple to use and to be a direct replacement for RwLock<HashMap<K, V>>.

Using DashMap will bring two advantages:

  • Code will be cleaner, and
  • Potential performance improvement

Issue 8601: Feature: replace RwLock and Mutex by using DashMap

If you find it interesting, try to solve it or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGChasen-ZhangClSlaiddantengsky
andylokandyb41shBohuTANGChasen-ZhangClSlaiddantengsky
drmingdrmereverpcpcflaneur2020leiyskylichuangmergify[bot]
drmingdrmereverpcpcflaneur2020leiyskylichuangmergify[bot]
sandfleesoyeric128sundy-liTCeasonTszKitLo40usamoi
sandfleesoyeric128sundy-liTCeasonTszKitLo40usamoi
wubxXuanwoxudong963youngsofunZhiHanZzhyass
wubxXuanwoxudong963youngsofunZhiHanZzhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #65

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

RFC

  • Multiple Catalog (#8254)

meta

  • allow to join a cluster if a meta node has no log (#8384)
  • add key space Expire (#8441)

datablock

  • add metainfo in datablock (#8417)

functions

  • improve performance of count distinct (#8317)

parser

  • relax the order requirement about the COPY options (#8341)
  • support SELECT output nothing (#8390)

planner

  • optimize join plan to make filter push down (#8377)

handlers

  • http handler no longer need to call final_uri explicitly. (#8299)

rbac

  • support SET ROLE and current_role() (#8392)

storage

  • Compact Segment (#8261)
  • Add cache operator in common-storage (#8306)
  • introduce data metrics for table (#8363)
  • add gc status to data metrics and show in processlist (#8389)
  • allow loading data from local fs (#8431)
  • new system table: system.catalogs (#8423)

new expression

  • implement array function slice, remove_fist, remove_last (#8326)
  • add tuple() and get() for tuple (#8372)

tests

  • add join tests under large dataset (#8351)

Code Refactor πŸŽ‰

meta

  • merge two to-meta-server rpc into one (#8308)
  • try best to leave a cluster (#8298)

planner

  • use right mark join as subquery's default join type (#8427)

query

  • optimize get/upsert copied file info (#8282)
  • re-org query crates (#8336)

storage

  • use commit_mutation in segment compaction (#8350)
  • decouple meta readers from TableContext (#8395)

new expression

  • move expressoin test to function-v2 (#8397)

Build/Testing/CI Infra Changes πŸ”Œ

  • replace cargo udeps with cargo machete (#8343)
  • migrate deprecating set-output commands (#8381)

Thoughtful Bug Fix πŸ”§

config

  • throw errors while loading config failed (#8462)

planner

  • fix aggregation in cluster mode (#8333)
  • left join panic (#8325)
  • remove unnecessary required columns (#8443)

processor

  • try fix data lost when resize multi outputs (#8319)
  • try fix lost last message if finish at same time (#8333)

query

  • StringSearchLike vector_vector can not match '\n' (#8359)
  • optimize upsert table copied file info (#8409)
  • div zero return err (#8464)

storage

  • support mutation during insertion (#8205)
  • add a threshold for compact block (#8322)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

RFC: Multiple Catalog

Databend supports multiple catalogs now, but only in a static way.

To allow accessing the hive catalog, users need to configure hive inside databend-query.toml in this way:

[catalog]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"

Users can't add/alter/remove the catalogs during runtime.

By allowing users to maintain multiple catalogs for the databend, we can integrate more catalogs like iceberg more quickly.

Learn More

Jepsen Test for Databend Meta Service

Jepsen is an open source software library for system testing. It is an effort to improve the safety of distributed databases, queues, consensus systems, etc.

For the past period of time, @lichuang has been working on the design and implementation of a Jepsen test solution for the Databend Meta Service.

If you are interested in this test, please check the corresponding GitHub Repo, which contains the steps, scripts and clients for the test.

Learn More

Issues

Meet issues you may be interested in and try to solve it.

New Key-Value services support for OpenDAL

OpenDAL means Open Data Access Layer and its goal is Access data freely, painlessly, and efficiently .

In past, OpenDAL has completed support for different storage backends such as local file system, AWS s3, Azure Blob, etc. And to support storing volatile data to provide cache solutions and temporary storage of data, OpenDAL has designed and implemented Key-Value service support (with kv::Adapter).

The following backends are currently available for the Key-Value service:

  • memory: Service based on BtreeMap
  • moka: Service based on the high-performance caching library moka.
  • redis: Service based on redis.

The community also plans to add support for the following Key-Value services:

If you find these interesting, try to solve them or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
andylokandyb41shBohuTANGClSlaiddantengskydrmingdrmer
everpcpcflaneur2020guzzitleiyskylichuangmergify[bot]
everpcpcflaneur2020guzzitleiyskylichuangmergify[bot]
miles170RinChanNOWWWsoyeric128sundy-liTCeasonTszKitLo40
miles170RinChanNOWWWsoyeric128sundy-liTCeasonTszKitLo40
wubxXuanwoxudong963youngsofunzhang2014zhyass
wubxXuanwoxudong963youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #64

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

meta

  • add API list_all_tables (#8254)

jsonb

  • jsonb functions support json string (#8222)

planner

  • optimize right join (#8204)
  • optimize subquery by deleting redundancy join (#8264)

query

  • parallel read of ndjson in copy (#8199)

storage

  • add share table query (#8198)
  • add limit push down to fuse_snapshot (#8201)
  • retry fuse table write operations (#8244)

new expression

  • migrate array func to func-v2 (#8169)
  • support cast between date/timestamp and string (#8147)
  • implement arithmetic functions for Date and Timestamp (#8202)
  • migrate date to number functions to v2 (#8257)
  • migrate datetime rounder functions to v2 (#8281)

Code Refactor πŸŽ‰

meta

  • refactor the sled-store , remove unnecessary methods (#8176, #8190 & #8203)

new expression

  • improve combine_nullable (#8191)

planner

  • refactor hash join (#8173)

Thoughtful Bug Fix πŸ”§

meta

  • AsKeySpace::clear() should not entire sled tree (#8288)

sessions

  • fix dead lock when force kill sessions (#8242)

pipelines

  • support abort for semi or anti join, right join, left join, etc. (#8165, #8212, #8245, etc.)

query

  • query version is incorrect (#8272)

storage

  • fix undrop not working as expected if table is dropped by using drop table t all (#8177)
  • fix dead lock caused by incorrect semaphore permit control (#8226)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Add cache support for Databend Query

In past discussions (Issues: #6786), Databend has planned to introduce a new cache solution to help better handle hot data.

We have recently tried to push this forward and have now implemented two cache layers in OpenDAL for metadata and content data respectively, and introduced a CacheOperator to Databend.

Next, we will consider using OpenDAL to take over Databend's existing in-memory cache, and then tune it according to practice.

Learn More

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGdantengskydrmingdrmereverpcpc
andylokandyb41shBohuTANGdantengskydrmingdrmereverpcpc
leiyskylichuangmergify[bot]PsiACERinChanNOWWWsandflee
leiyskylichuangmergify[bot]PsiACERinChanNOWWWsandflee
soyeric128sundy-liTCeasonwubxXuanwoxudong963
soyeric128sundy-liTCeasonwubxXuanwoxudong963
youngsofunzhang2014zhyass
youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.