Skip to content

refactor(io): use the registry pattern for IO schemes#709

Open
alessandro-nori wants to merge 9 commits intoapache:mainfrom
alessandro-nori:io_registry_single_module
Open

refactor(io): use the registry pattern for IO schemes#709
alessandro-nori wants to merge 9 commits intoapache:mainfrom
alessandro-nori:io_registry_single_module

Conversation

@alessandro-nori
Copy link
Contributor

@alessandro-nori alessandro-nori commented Feb 2, 2026

Related to #696

This PR introduces a registry pattern for IO implementations, similar to the existing catalog package pattern.

Moved all cloud storage implementations to io/gocloud.

Extra notes

I decided to use a single subpackage because all the existing implementations use the same dependency and it's easier to import just one package to register all of them. However I think in most of the integration tests only use s3 so multiple subpackages would also work fine.

@alessandro-nori alessandro-nori force-pushed the io_registry_single_module branch from 3ed4082 to 24273ec Compare February 2, 2026 13:04
@alessandro-nori alessandro-nori changed the title Io registry single module IO Registry Feb 2, 2026
@alessandro-nori alessandro-nori marked this pull request as ready for review February 2, 2026 13:27
io/registry.go Outdated
if factory == nil {
panic("io: Register factory is nil")
}
defaultRegistry.set(scheme, factory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have something here to handle a case where there's already something registered for a given scheme? In this current version, it looks like a user importing two packages registering themselves on the same scheme could result in a surprising behavior.

For reference, database/sql actually checks for prior existence of a driver and panics in case of a duplicate registration.

Copy link
Contributor Author

@alessandro-nori alessandro-nori Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I just mirrored what was done in the catalog package but I think this is worth adding here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, currently the stated semantics of the catalog registry is that if something is registered it will overwrite anything that is already registered with that catalog type. but for file IO it might makes sense to be a bit more strict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex and Matt!
I implemented the change in ec4790b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to handle it atomically? Because right now, two goroutines can register same scheme while skipping panic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution would be for the mutex to be in the Register and Unregister functions rather than in the set and remove methods of the registry object.

Comment on lines 55 to 57
gocloud.S3Region: "us-east-1",
gocloud.S3AccessKeyID: "admin",
gocloud.S3SecretAccessKey: "password",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are properties that tend to be shared across any IO implementation, we should probably leave these constants in the io package rather than moving them down to the gocloud package. the assumption being that the properties being looked for in the catalog should not be dependant on the IO implementation being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e18695d
🙇

Comment on lines +64 to +68
icebergio.Register("mem", func(ctx context.Context, parsed *url.URL, props map[string]string) (icebergio.IO, error) {
bucket := memblob.OpenBucket(nil)

return createBlobFS(ctx, bucket, defaultKeyExtractor(parsed.Host)), nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we had any tests for mem, can we add some?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added some basic tests to create, write, read and delete an in-memory file
068d8cf

@alessandro-nori alessandro-nori force-pushed the io_registry_single_module branch from 5411291 to 068d8cf Compare February 6, 2026 11:04
// under the License.

package io
package gocloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to nesting, all integration tests aren't running in CI anymore

Copy link
Contributor Author

@alessandro-nori alessandro-nori Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😬 ! Fixed it in 4556800

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍🏻

@github-actions github-actions bot added the INFRA label Feb 10, 2026
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one nitpick on atomically handling register/unregister

Otherwise this looks good to me!

@zeroshade zeroshade changed the title IO Registry refactor(io): use the registry pattern for IO schemes Feb 10, 2026
io/registry.go Outdated
type registry map[string]SchemeFactory

var (
regMutex sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to use sync.RWMutex

io/io.go Outdated
}

var (
ErrIONotFound = errors.New("io scheme not registered")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrIOSchemeNotFound?

io/registry.go Outdated
regMutex.Unlock()

if !ok {
return nil, fmt.Errorf("%w: %s", ErrIONotFound, parsed.Scheme)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously path was added into error but now it's removed. Is this a change we want? It could be useful for debugging

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's keep the path in the error

S3EndpointURL = "s3.endpoint"
S3ProxyURI = "s3.proxy-uri"
S3ConnectTimeout = "s3.connect-timeout"
S3SignerUri = "s3.signer.uri"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
S3SignerUri = "s3.signer.uri"
S3SignerURI = "s3.signer.uri"

GCSKeyPath = "gcs.keypath"
GCSJSONKey = "gcs.jsonkey"
GCSCredType = "gcs.credtype"
GCSUseJsonAPI = "gcs.usejsonapi" // set to anything to enable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GCSUseJsonAPI = "gcs.usejsonapi" // set to anything to enable
GCSUseJSONAPI = "gcs.usejsonapi" // set to anything to enable

Comment on lines +44 to +49
AdlsSasTokenPrefix = "adls.sas-token."
AdlsConnectionStringPrefix = "adls.connection-string."
AdlsSharedKeyAccountName = "adls.auth.shared-key.account.name"
AdlsSharedKeyAccountKey = "adls.auth.shared-key.account.key"
AdlsEndpoint = "adls.endpoint"
AdlsProtocol = "adls.protocol"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADLS...

nit: these were here before so feel free to ignore but noting while we're on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants