This week in Databend #72

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Multiple Catalogs

  • extends show databases SQL (#9152)

Stage

  • support select from URI (#9247)

Streaming Load

  • support file_format syntax in streaming load insert sql (#9063)

Planner

  • push down limit to union (#9210)

Query

  • use analyze table instead of optimize table statistic (#9143)
  • fast parse insert values (#9214)

Storage

  • use distinct count calculated by the xor hash function (#9159)
  • read_parquet read meta before read data (#9154)
  • push down filter to parquet reader (#9199)
  • prune row groups before reading (#9228)

Open Sharing

  • add prototype open sharing and add sharing stateful tests (#9177)

Code Refactoring 🎉

*

  • simplify the global data registry logic (#9187)

Storage

  • refactor deletion (#8824)

Build/Testing/CI Infra Changes 🔌

  • release databend deb package and databend with hive (#9138, #9241, etc.)

Bug Fixes 🔧

Format

  • support ASCII control code hex as format field delimiter (#9160)

Planner

  • prewhere_column empty and predicate is not const will return empty (#9116)
  • don't push down topk to Merge when it's child is Aggregate (#9183)
  • fix nullable column validity not equal (#9220)

Query

  • address unit test hang on test_insert (#9242)

Storage

  • too many io requests for read blocks during compact (#9128)
  • collect orphan snapshots (#9108)

What's On In Databend

Stay connected with the latest news about Databend.

Breaking Change: Unified File Format Options

To simplify, we're rolling out a set of unified file format options as follows for the COPY INTO command, the Streaming Load API, and all the other cases where users need to describe their file formats:

[ FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | XML} [ formatTypeOptions ] ) ]
  • Please note that the current format options starting with format_* will be deprecated.
  • ... FORMAT CSV ... will still be accepted by the ClickHouse handler.
  • Support for customized formats created by CREATE FILE FORMAT ... will be added in a future release: ... FILE_FORMAT = (format_name = 'MyCustomCSV') .... .

Learn More

Open Sharing

Open Sharing is a simple and secure data-sharing protocol designed for databend-query nodes running in a multi-cloud environment.

  • Simple & Free: Open Sharing is open-source and basically a RESTful API implementation.
  • Secure: Open Sharing verifies incoming requesters' identities and access permissions, and provides an audit log.
  • Multi-Cloud: Open Sharing supports a variety of public cloud platforms, including AWS, Azure, GCP, etc.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

We're about to run stage-related tests again using the Streaming Load API to move files to a stage instead of an AWS command like this:

aws --endpoint-url ${STORAGE_S3_ENDPOINT_URL} s3 cp s3://testbucket/admin/data/ontime_200.csv s3://testbucket/admin/stage/internal/s1/ontime_200.csv >/dev/null 2>&1

This is because Databend users do not need to take care of, or do not even know the stage paths that the AWS command requires.

Issue 8528: refactor stage related tests

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

ariesdevilb41shBohuTANGChasen-ZhangClSlaiddantengsky
ariesdevilb41shBohuTANGChasen-ZhangClSlaiddantengsky
drmingdrmerhantmaclichuangmergify[bot]PsiACERinChanNOWWW
drmingdrmerhantmaclichuangmergify[bot]PsiACERinChanNOWWW
soyeric128sundy-liwubxXuanwoxudong963youngsofun
soyeric128sundy-liwubxXuanwoxudong963youngsofun
ZhiHanZzhyasszzzdong
ZhiHanZzhyasszzzdong

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This week in Databend #71

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Planner

  • optimize topk in cluster mode (#9092)

Query

  • support select * exclude [column_name | (col_name, col_name,...)] (#9009)
  • alter table flashback (#8967)
  • new table function read_parquet to read parquet files as a table (#9080)
  • support select * from @stage (#9123)

Storage

  • cache policy (#9062)
  • support hive nullable partition (#9064)

Code Refactoring 🎉

Memory Tracker

  • keep tracker state consistent (#8973)

REST API

  • drop ctx after query finished (#9091)

Bug Fixes 🔧

Configs

  • add more tests for hive config loading (#9074)

Planner

  • try to fix table name case sensibility (#9055)

Functions

  • vector_const like bug fix (#9082)

Storage

  • update last_snapshot_hint file when purge (#9060)

Cluster

  • try fix broken pipe or connect reset (#9104)

What's On In Databend

Stay connected with the latest news about Databend.

RESTORE TABLE

By the snapshot ID or timestamp you specify in the command, Databend restores the table to a prior state where the snapshot was created. To retrieve snapshot IDs and timestamps of a table, use FUSE_SNAPSHOT.

-- Restore with a snapshot ID
ALTER TABLE <table> FLASHBACK TO (SNAPSHOT => '<snapshot-id>');
-- Restore with a snapshot timestamp
ALTER TABLE <table> FLASHBACK TO (TIMESTAMP => '<timestamp>'::TIMESTAMP);

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Adding Build Information to Error Report

An error report currently only contains an error code and some information about why the error occurred. When build information is available, troubleshooting will become easier.

"Code: xx. Error: error msg... (version ...)"

Issue 9117: Add Build Information to the Error Report

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyb41shBohuTANGdantengskydrmingdrmereverpcpc
andylokandyb41shBohuTANGdantengskydrmingdrmereverpcpc
lichuangmergify[bot]PsiACERinChanNOWWWsandfleesoyeric128
lichuangmergify[bot]PsiACERinChanNOWWWsandfleesoyeric128
sundy-liTCeasonXuanwoxudong963youngsofunzhang2014
sundy-liTCeasonXuanwoxudong963youngsofunzhang2014
ZhiHanZ
ZhiHanZ

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This week in Databend #70

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

Special Note: This Week in Databend will be gradually migrated to the Databend Blog. We will keep the content in sync until the final migration is complete.

What's New

Check out what we've done this week to make Databend even better for you.

Features & Improvements ✨

Format

  • better checking of format options (#8981)
  • add basic schema infer for parquet (#9043)

Query

  • QualifiedName support 'db.table.' and 'table.' (#8965)
  • support bulk insert without exprssion (#8966)

Storage

  • add cache layer for fuse engine (#8830)
  • add system table system.memory_statistics (#8945)
  • add optimize statistic ddl support (#8891)

Code Refactoring 🎉

Base

  • remove common macros (#8936)

Format

  • TypeDeserializer get rid of FormatSetting (#8950)

Planner

  • refactor extract or predicate (#8951)

Processors

  • optimize join by merging build data block (#8961)

New Expression

  • allow sparse column id in chunk, redo #8789 with a new approach. (#9008)

Documentation 📔

Bug Fixes 🔧

Base

  • try fix lost tracker (#8932)

Meta

  • fix share db bug, create DatabaseIdToName if need (#9006)

Mysql handler

  • fix mysql conns leak (#8894)

Processors

  • try fix update list memory leak (#9023)

Storage

  • read and write block in parallel when compact (#8921)

What's On In Databend

Stay connected with the latest news about Databend.

Infer Schema at a Glance

You usually need to create a table before loading data from a file stored on a stage or somewhere. Unfortunately, sometimes you might not know the file schema to create the table or are unable to input the schema due to its complexity.

Introducing the capability to infer schema from an existing file will make the work much easier. You will even be able to query data directly from a stage using a SELECT statement like select * from @my_stage.

INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV );
+-------------+---------+----------+
| COLUMN_NAME | TYPE    | NULLABLE |
|-------------+---------+----------|
| CONTINENT   | TEXT    | True     |
| COUNTRY     | VARIANT | True     |
+-------------+---------+----------+

We've added support for inferring the basic schema from parquet files in #9043, and we're now working on #7211 to implement select from @stage.

Learn More

What's Up Next

We're always open to cutting-edge technologies and innovative ideas. You're more than welcome to join the community and bring them to Databend.

Add Tls Support for Mysql Handler

opensrv-mysql v0.3.0 that was released recently includes support for TLS. It sounds like a good idea to introduce it to Databend.

let (is_ssl, init_params) = opensrv_mysql::AsyncMysqlIntermediary::init_before_ssl(
    &mut shim,
    &mut r,
    &mut w,
    &Some(tls_config.clone()),
)
.await
.unwrap();

opensrv_mysql::secure_run_with_options(shim, w, ops, tls_config, init_params).await

Issue 8983: Feature: tls support for mysql handler

Please let us know if you're interested in contributing to this issue, or pick up a good first issue at https://link.databend.rs/i-m-feeling-lucky to get started.

Changelog

You can check the changelog of Databend Nightly for details about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGdantengskydrmingdrmer
andylokandyariesdevilb41shBohuTANGdantengskydrmingdrmer
everpcpcflaneur2020leiyskylichuangmergify[bot]PsiACE
everpcpcflaneur2020leiyskylichuangmergify[bot]PsiACE
sandfleesoyeric128sundy-liTCeasonTracyZYJXuanwo
sandfleesoyeric128sundy-liTCeasonTracyZYJXuanwo
xudong963youngsofunyufan022zhang2014zhyass
xudong963youngsofunyufan022zhang2014zhyass

Connect With Us

We'd love to hear from you. Feel free to run the code and see if Databend works for you. Submit an issue with your problem if you need help.

DatafuseLabs Community is open to everyone who loves data warehouses. Please join the community and share your thoughts.

This week in Databend #69

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

multiple catalog

  • implement drop user defined catalog (#8820)

meta

  • add cli command to delete a key and to expire a key (#8858)

planner

  • support broadcast join (#8779)
  • extract or clause to push down potential predicates for join (#8855)

query

  • optimize count(Nullable(col)) (#8805)
  • support unset settings (#8870)
  • add distinct count aggregator and column distinct count (#8825)

storage

  • compact segments in reversed order (#8806)

new expression

  • geo functions (#8481)
  • add methods to get memory size of ValueTypes (#8875)
  • add a global builtin function registry (#8912)

Code Refactor 🎉

memory tracker

  • send pointer addresses to mem tracker (#8879)
  • add StatBuffer to provide fine grained mem allocation stats buffer (#8880)

new expression

  • allow sparse column id for constant folder (#8821)

Build/Testing/CI Infra Changes 🔌

  • separate sqllogic test with handler (#8836)

Thoughtful Bug Fix 🔧

base

  • support track processor async task (#8871)

http handler

  • avoid dropping runtime when task on it not finished (#8894)

query

  • remove useless memcpy when group long string (#8851)

storage

  • snapshot removed unsafely during meta commit failure (#8850)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Preview of New Expressions: Geo Functions

By supporting Geo functions, Databend will have the ability to perform operations on geographic inputs.

With the merging of #8481, geo functions like great_circle_distance, geo_distance, great_circle_angle and point_in_ellipses are already supported in the new expression system.

Databend is currently actively working on the migration to the new expressions, so keep an eye on the expression branch for progress!

Learn More

Unset Settings

The merging of #8870 brings the ability to unset settings to Databend

UNSET means to restore one or more settings to their default values. The settings will also be reset to the initial SESSION level if they were set to GLOBAL level.

UNSET <setting_name> | ( <setting_name> [, <setting_name> ...])

Learn More

Issues

Meet issues you may be interested in and try to solve it.

Add Compression Option to Create Table

Compression helps to reduce the size of databases. For IO-intensive loads, compression may provide some performance improvements.

Databend plans to introduce compression option in the create table statement and support compression algorithms such as LZ4 (default) and Snappy.

create table t1(a int) [compression="LZ4|SNAPPY"]

Issue 8903: feat: add compression option to create table

If you find it interesting, try to solve it or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changelogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
andylokandyariesdevilb41shBohuTANGChasen-ZhangClSlaid
dantengskydrmingdrmereverpcpckemingylichuangmergify[bot]
dantengskydrmingdrmereverpcpckemingylichuangmergify[bot]
RinChanNOWWWsoyeric128sundy-liTCeasonwubxXuanwo
RinChanNOWWWsoyeric128sundy-liTCeasonwubxXuanwo
xudong963youngsofunzhang2014zhyass
xudong963youngsofunzhang2014zhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #68

Databend is a powerful cloud data warehouse. Built for elasticity and efficiency. Free and open. Also available in the cloud: https://app.databend.com .

What's Changed

Below is a list of some major changes that we don't want you to miss.

Exciting New Features ✨

metrics

  • add metrics for query detail (#8800)

multiple catalog

  • multiple catalog config (#8743)

query

  • adjust max io requests when read block data to avoid oom (#8726)
  • change PrecommitBlock serde from serde_json to bincode (#8726)
  • support parallel final aggregator, 4X faster! (#8577)
  • parallel merge for distribute query (#8811)

storage

  • shuffle segments during distributed pruning (#8793)
  • add shuffle policy for Partitions (#8814)

new expression

  • add calc domain for comparison (#8754)

Code Refactor 🎉

io

  • replace NestedCheckpointReader with Cursor (#8716)

handler

  • use FieldEncoder to encode data (#8733)

format

  • refactor with FieldEncoder (#8778)

query

  • unified hashtable interface (#8681)

storage

new expression

  • refine domain (#8755)
  • allow sparse column id in chunk (#8789)

Thoughtful Bug Fix 🔧

handler

  • correct databend types to mysql types (#8745)

functions

  • l_col like r_col will generate a hashmap based on r_col, if r_col is huge, it will be oom kill (#8737)

News

Let's take a look at what's new at Datafuse Labs & Databend each week.

Shuffle Policy for Partitions

For cache affinity, we consider some strategies when re-shuffle partitions in plan_fragemnt.rs::redistribute_source_fragment, default kind is Seq.

pub enum PartitionsShuffleKind {
    // Bind the Partition to executor one by one with order.
    Seq,
    // Bind the Partition to executor by partition.hash()%executor_nums order.
    Mod,
    // Bind the Partition to executor by partition.rand() order.
    Rand,
}

Learn More

Databend x Rust China Hackathon 2022

The first Rust China Hackathon Online is here! The theme of this year's Hackathon is Rust for Fun and we look forward to working with you to unleash the possibilities of innovation with Rust.

As a co-organiser of this year's hackathon, Databend is sponsoring an enterprise track where participants can explore the appeal of cloud-native data warehouses by creating work around Databend components or the Databend ecosystem.

Learn More

Issues

Meet issues you may be interested in and try to solve it.

Switch to nextest in CI

cargo-nextest is a next-generation test runner for Rust projects.

We have noticed that it is very much faster than cargo test on many projects. However, there are still some challenges in applying this to Databend. For example, for different types of tests, the number of threads has to be adjusted to ensure that the tests are performed quickly and correctly. Also, some of the tests may need to be rewritten to get better results.

Issue 4376: switch to nexttest in ci

If you find it interesting, try to solve it or participate in discussions and PR reviews. Or you can click on https://link.databend.rs/i-m-feeling-lucky to pick up a good first issue, good luck!

Changelogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Contributors

Thanks a lot to the contributors for their excellent work this week.

andylokandyBohuTANGClSlaiddantengskydependabot[bot]everpcpc
andylokandyBohuTANGClSlaiddantengskydependabot[bot]everpcpc
lichuangmergify[bot]RinChanNOWWWsandfleesoyeric128sundy-li
lichuangmergify[bot]RinChanNOWWWsandfleesoyeric128sundy-li
TCeasonwubxXuanwoxudong963youngsofunzhang2014
TCeasonwubxXuanwoxudong963youngsofunzhang2014
ZhiHanZzhyass
ZhiHanZzhyass

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.