Figure out what to do with `table_column` catalog table and bulk schema loading in general

Currently we're not really using our [`Schema`](https://github.com/splitgraph/seafowl/blob/40b1158a90121422e66acbc66e4d536f6081b6d7/src/schema.rs#L10-L13) for anything but the `to_column_names_types` call when persisting the columns to the `table_column` metadata table. So it's possible to remove that `Schema` altogether and just use the underlying `arrow_schema` call (though that could be extracted to a separate function).

On a more general level, we also currently don't use anything from our `table_column` catalog table. When fetching a schema for a given table, such as in `information_schema.columns` or when calling `TableProvider::schema` somewhere in code (which is what [DF uses](https://github.com/apache/arrow-datafusion/blob/19bdcdc4140f0b36023626195d84bfbf970b752d/datafusion/core/src/catalog/information_schema.rs#L157) for `information_schema.columns` queries internally as well), we always rely on the Delta table's schema, which is ultimately [reconstructed](https://github.com/delta-io/delta-rs/blob/733b5ffdde99cbfb256b8f69ae4529aeb1174599/crates/deltalake-core/src/table/state.rs#L251-L254) from the logs. The `information_schema.columns` in particular will pose a problem at some point, see here
https://github.com/splitgraph/seafowl/blob/40b1158a90121422e66acbc66e4d536f6081b6d7/src/catalog.rs#L285-L293

The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in `TableProvider::schema`, and only resolve `TableProvider::scan`s using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.

There's also a minor matter of format; currently we store the fields using the [unofficial arrow json representation](https://github.com/apache/arrow-rs/blob/ef6932f31e243d8545e097569653c8d3f1365b4d/arrow-integration-test/src/field.rs#L266-L295), while our storage layer has it's own [schema/field types](https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/kernel/schema.rs). There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.

	// Build a delta table but don't load it yet; we'll do that only for tables that are
	// actually referenced in a statement, via the async `table` method of the schema provider.
	// TODO: this means that any `information_schema.columns` query will serially load all
	// delta tables present in the database. The real fix for this is to make DF use `TableSource`
	// for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs.
	let table_log_store = self.object_store.get_log_store(table_uuid);

	let table = DeltaTable::new(table_log_store, Default::default());
	(Arc::from(table_name.to_string()), Arc::new(table) as _)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Figure out what to do with `table_column` catalog table and bulk schema loading in general #475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Figure out what to do with table_column catalog table and bulk schema loading in general #475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Figure out what to do with `table_column` catalog table and bulk schema loading in general #475