-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Currently we're not really using our Schema for anything but the to_column_names_types call when persisting the columns to the table_column metadata table. So it's possible to remove that Schema altogether and just use the underlying arrow_schema call (though that could be extracted to a separate function).
On a more general level, we also currently don't use anything from our table_column catalog table. When fetching a schema for a given table, such as in information_schema.columns or when calling TableProvider::schema somewhere in code (which is what DF uses for information_schema.columns queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. The information_schema.columns in particular will pose a problem at some point, see here
Lines 285 to 293 in 40b1158
| // Build a delta table but don't load it yet; we'll do that only for tables that are | |
| // actually referenced in a statement, via the async `table` method of the schema provider. | |
| // TODO: this means that any `information_schema.columns` query will serially load all | |
| // delta tables present in the database. The real fix for this is to make DF use `TableSource` | |
| // for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs. | |
| let table_log_store = self.object_store.get_log_store(table_uuid); | |
| let table = DeltaTable::new(table_log_store, Default::default()); | |
| (Arc::from(table_name.to_string()), Arc::new(table) as _) |
The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in TableProvider::schema, and only resolve TableProvider::scans using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.
There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.