1️⃣ Metastore (Tarchia) #529

joocer · 2022-09-16T17:50:51Z

joocer
Sep 16, 2022
Maintainer

Tarchia will be a third system in the suite

Mabel will add blob entries to Tarchia when it writes and also use it for validation and quality tests.
Opteryx will use it as a data catalogue for planning and pruning.

Mabel (or Opteryx) adding a blob should trigger the statistics being built for it, ANALYZE should read the blobs rather than Tarchia - in effect this is ad hoc stats creation and refresh.

Initial implementation should just be via the API, Opteryx can support via CREATE statements later.

Mabel should trigger a call to Tarchia direct, via Cloud Tasks and via PubSub to add new blobs.

Statistics have three key purposes:

allow pruning of blobs to reduce reads (i.e. as a BRIN)
allow estimating of costs to make algorithm decisions , such as filter and join order and partition decisions.
shortcut queries which are about the statistics of the table (COUNT, MIN, MAX, AVG, and approximates we can get from the distogram)

The Optimizer will be able to take advantage of new facts when doing planning after binding, such as knowing the range of a column to push a read filter to a joined table.

To do these things we need volume and count information, the bounds of the values (BRIN) and distributions of the values (a histogram). For non-blob stores, this would be a single set of statistics for the entire dataset.

We should probably start with an ANALYZE TABLE query to create metadata, record per partition, with nodes per blob within that partition. For non-blob stores we should just have min, max, count and AVG numbers as full profiling may be slow.

Should also implement SHOW STATISTICS FOR table ( see https://trino.io/docs/current/sql/show-stats.html) to recall them. This will break the temporal filters.

Should use the KVStore model, with extra interfaces to filter results.

A KVStore should be written for a relational store (MySQL/ Postgres)

The store should be referenceable by partition path, contain the blobs in the partition, the fields in the blobs and attributes of the columns.

Partition (count)

Field (name, distogram)
Blob (count)
- Field (name, type, min, max, missing, sum, bytes)

This then allows fast COUNT, MIN, MAN, AVG responses.

Also helps prune blobs from the query if they don't contain values covering the range being searched for.

Finally, also gives cardinality and distribution approximations for query optimization (join ordering, filter ordering)

The metastore API should be as declarative as practical to allow for a wide range of implementations underneath. E.g. these blobs, these predicates and aggregates, you work out what to do.

joocer · 2022-11-26T20:50:14Z

joocer
Nov 26, 2022
Maintainer Author

Performance considerations:

T-digest is slower than Distogram by about a factor of 10, Distograms are able to be combined, with some enhancements (already written)

For each field record source information for each field. Record individual values, up to 8 unique values.

We can't estimate cardinality over discreet datasets at speed, (we can use HLL, but it isn't fast enough) so don't try.

0 replies

joocer · 2022-12-22T14:53:16Z

joocer
Dec 22, 2022
Maintainer Author

Detach the physical path from the logical name, allow tables to have more consistent naming without moving the existing files.

Alias columns, to provide better compatibility across systems

Fill default values for schema evolution - base on the catalogue, not page 1.

0 replies

joocer · 2022-12-30T20:24:45Z

joocer
Dec 30, 2022
Maintainer Author

We have three district collections

the blob / dataset profiles
the dataset catalogue
the permissions store

We're writing a dataset catalogue first, we can realize the benefits in planning before we start to do any work with statistics.

The binder should map which field comes from which table as a priority. This will allow pushdowns in more complex queries.

This in turn will allow us to do blob pruning, maybe only BRIN / small unique values pruning.

0 replies

joocer · 2022-12-30T20:45:23Z

joocer
Dec 30, 2022
Maintainer Author

Can we use some like an Trie to get distribution details for strings. We could try to have a Trie no more than three layers deep and count at the nodes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1️⃣ Metastore (Tarchia) #529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

1️⃣ Metastore (Tarchia) #529

joocer Sep 16, 2022 Maintainer

Replies: 4 comments

joocer Nov 26, 2022 Maintainer Author

joocer Dec 22, 2022 Maintainer Author

joocer Dec 30, 2022 Maintainer Author

joocer Dec 30, 2022 Maintainer Author

joocer
Sep 16, 2022
Maintainer

joocer
Nov 26, 2022
Maintainer Author

joocer
Dec 22, 2022
Maintainer Author

joocer
Dec 30, 2022
Maintainer Author

joocer
Dec 30, 2022
Maintainer Author