Skip to content

Use UUID's for CFDataset's unique_id property#78

Open
jameshiebert wants to merge 1 commit into
masterfrom
unique-newyork
Open

Use UUID's for CFDataset's unique_id property#78
jameshiebert wants to merge 1 commit into
masterfrom
unique-newyork

Conversation

@jameshiebert
Copy link
Copy Markdown
Contributor

Constructing the unique_id from attributes of the dataset leads to problems. One can have multiple datasets that end up with the same unique_id (for example datasets that have a different spatial domain, but everything else is the same). When generated according to the standard methods, UUIDs are for practical purposes unique.

Constructing the unique_id from attributes of the dataset leads to
problems. One can have multiple datasets that end up with the same
unique_id (for example datasets that have a different spatial domain,
but everything else is the same). When generated according to the
standard methods, UUIDs are for practical purposes unique.
Copy link
Copy Markdown
Collaborator

@rod-glover rod-glover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem here. See comment.

Comment thread nchelpers/__init__.py

return unique_id.replace('+', '-') # In original code, but why?
"""A unique id for this file"""
return str(uuid.uuid4())
Copy link
Copy Markdown
Collaborator

@rod-glover rod-glover Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates a different random UUID every time it is executed. Which means that every time a program that uses nchelpers is run, a different UUID will be generated for any given dataset (file).

But I think the way we use unique_id relies on it being stable across time.

Possible problematic use case: Indexing. When a file is indexed in any given run, its unique_id will be generated (only once during that indexing run, but as a random value). Then when the same file is re-indexed in a later indexing run (it has moved location, say), its unique_id is generated, but it is by definition different than the first one. If the indexer (or anything else) tries to look up this file according to its unique_id, it will get incorrect results. Essentially, if anything (e.g., a modelmeta database) has a memory of a past unique_id for a file, it will be wrong the next time.

Copy link
Copy Markdown
Collaborator

@rod-glover rod-glover Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, however, this argument also can be used against using metadata to generate the unique_id, since reindexing could in principle include reindexing a data file with slightly different metadata. But I think that case is implicitly excluded; the idea was that the metadata uniquely identified the file. We have an identity crisis here (pun, but serious too). What is it that uniquely characterizes a file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was concerned about that too, so thank you for pointing this out.

Admittedly, I filed this PR more as provocation than as a finished product.

Maybe it would be better to change this from a (assumed to be static) property to a method (e.g. create_unique_id()), so users of it have to be intentional about when they want a new one (for example, during indexing) and when they want the existent one.

Copy link
Copy Markdown
Collaborator

@rod-glover rod-glover Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds more robust.

Question is how to generate the unique id and, if it is not a deterministic computation based on the file contents, how to persist it. One solution is to make method create_unique_id() write its value to the metadata (attribute unique_id?) in the file, and for property unique_id to read it from there. If we want to emphasize "don't screw with this", then maybe name the attribute __unique_id__ or something similarly repellent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants