This week in Databend #32

Databend aimed to be an open source elastic and reliable cloud warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • databend-query: external source with new processor: make external source(like S3) as table engine. by @BohuTANG, (#4277)
  • functions: support EXTRACT && toYear by @clark1013, (#4329)

Improvement

  • databend-query: add more metrics into query_log table: io metrics of dal level. by @sundy-li, (#4365)
  • databend-meta: refactor role identity: remove host for role identity. by @Junnplus, (#4370)
  • databend-query: refactor the comparison function by using ScalarBinaryExpression by @zhyass, (#4285)
  • databend-query: add CALL command: impl CALL syntax parser, add Trait for system function. by @Junnplus, (#4315)

Performance Improvement

  • datablocks&datavalues: support nullable group by: improved by 60% on the second query with metacache. by @sundy-li, (#4340)
  • datablocks: use `SmallVec`` to improve HashMethod Serialize: improved by 30%~50% in ontime dataset. by @sundy-li, (#4353)
  • datavalues: Simd Selection of column filter: improved by ~25%. by @platoneko, (#4271)

Bug fixes

Tips

Let's learn a weekly tip from Databend.

RFC: Semi-structured Data Types

Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on. In order to be compatible with Snowflake's SQL syntax, we plan to support the following three semi-structured data types:

  • Variant: A tagged universal type, which can store values of any other type, including Object and Array.
  • Object: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value of Variant type.
  • Array: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are Variant types.

For more information see Semi-structured data types design. For ongoing work see databend#4348.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #31

Databend aimed to be an open source elastic and reliable cloud warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

  • databend-meta: add rename_table for meta api: remove old table record and add a new one in a sled-tree-transaction. by @Junnplus, (#4288)
  • databend-query: support COPY from external location(S3): load files from a storage location (Amazon S3). by @BohuTANG, (#4170 & #4241)
  • bendctl: add package purge subcommand by @linyihai, (#4245)

Improvement

  • databend-query: use database_id/table_id as data file prefix: no more having all parquet files in one big directory. by @dantengsky, (#4273)
  • databend-query: support aggregate sum/avg booleans by @everpcpc, (#4237)
  • databend-query: zero extra cost of async trait: use GAT feature instead of #[async_trait] in some hot path. by @sundy-li, (#4269)
  • databend-query: implement new processor for system tables & github & memory & null engine by @zhang2014, (#4166 & #4272)

Build/Test/CI

Bug fixes

  • dal_context: use ObserveReader to calculate metrics: calculate the read cost time correctly. by @Xuanwo, (#4298)

Tips

Let's learn a weekly tip from Databend.

Loading files from S3 External Location

Benefit from #4170 & #4241, Databend now has the ability to load CSV format files from the s3 external location.

Here's an example of reading 5 lines from a file:

COPY INTO ontime FROM 's3://databend-external/t_ontime/t_ontime.csv'
    CREDENTIALS=(aws_key_id='<your-key-id>' aws_secret_key='<your-secret-key>')
    FILE_FORMAT = (type = "CSV" field_delimiter = '\t'  record_delimiter = '\n' skip_header = 1)
    SIZE_LIMIT=5; /* only read 5 rows */

/* Check. */
SELECT * FROM ontime;

Note To learn more, check out:

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Ecosystem/Upstream

From open source, for open source. Our team is also committed to contributing to the Rust ecosystem and upstream dependencies.

Upstream

OpenDAL announced v0.1

We announced that Open Data Access Layer is in v0.1. Let's look at what's different together.

Welcome to use OpenDAL(github, crates.io) to connect your data and applications.

Big changes are happening at OpenRaft

Although no new releases have been made recently, we have seen some important changes with @schreter's help.

  • Refactor storage APIs to allow more clear ownership of data. (openraft#199)
  • Make NodeId type configurable via RaftTypeConfig. (openraft#220)

Feel free to visit Insights/Pulse to observe all changes, and it would be great if you would like to try it out in advance.

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #30

Databend aimed to be an open source elastic and reliable cloud warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

Improvement

In particular, with the merging of #4200 rename datavalues2 to datavalues, we now have a whole new set of datavalues and have successfully migrated all the relevant code.

Bug fixes

Tips

Let's learn a weekly tip from Databend.

How to eliminate OOM at build time

Databend has a large crate to manage and implement most of the functions. This means that the following error may be reported at build time:

(signal: 9, SIGKILL: kill) warning: build failed, waiting for other jobs to finish... error: build failed.

We observed that this phenomenon is mainly due to memory overflow during linking. This should be a problem that many giant rust projects may have to face, so let's see how to solve it.

  1. Use a better linker. Modern linkers like mold, which in addition to being faster, are also more memory friendly, can reduce this problem to some extent.
  2. Enable the newer symbol mangling scheme. In the latest nightly Rust, this means that -C symbol-manging-version=v0. It generates smaller symbols, and we observe a great memory reduction.
  3. Consider allocate more virtual memory. Of course, this is definitely a valid approach, but it may require adding another dozen GiB.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #29

Databend aimed to be an open source elastic and reliable cloud warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

Improvement

Performance Improvement

Bug fixes

Tips

Let's learn a weekly tip from Databend.

Announce OpenDAL

Open Data Access Layer that connect the whole world together.

  • General: designed for any workload, not only for Databend.
  • Zero-Overhead: Using this lib is just like using the native SDK.
  • Easy to understand: Both for using and implementing.

This project has now been separated from Databend and is offered as a separate project -> datafuselabs/opendal.

Please see proposal: Vision of Databend DAL for the vision of OpenDAL.

See its change history details in #3677.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Ecosystem/Upstream

From open source, for open source. Our team is also committed to contributing to the Rust ecosystem and upstream dependencies.

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.

This week in Databend #28

Databend aimed to be an open source elastic and reliable cloud warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Big changes

Below is a list of some major changes that we don't want you to miss.

Features

Improvement

Experimental

Tips

Let's learn a weekly tip from Databend.

Databend Release & Maintenance

Release and maintenance are routine tasks for the Databend team, let's take a look at how the processes probably works.

Release channels

Databend release process following the 'release train' model used by e.g. Rust, Firefox and Chrome, as well as 'feature staging'.

Databend is in its early stages and we are only updating the nightly version number for now. We also release a minor version periodically, about 6 weeks, influenced by the iteration cycle.

For more information, please see databend.rs - Databend release channels.

Routine maintenance

Databend always tries to use newer toolchains and dependencies to ensure performance and reliability.

We currently have almost a thousand dependencies and have had to intervene manually in this process. If possible, it is essential to introduce a proper automatic batch upgrade mechanism with a rollback scheme.

For more information, please see databend.rs - Databend routine maintenance.

Changlogs

You can check the changelogs of Databend nightly to learn about our latest developments.

Meet Us

Please join the DatafuseLabs Community if you are interested in Databend.

We are looking forward to seeing you try our code. We have a strong team behind you to ensure a smooth experience in trying our code for your projects. If you are a hacker passionate about database internals, feel free to play with our code.

You can submit issues for any problems you find. We also highly appreciate any of your pull requests.