-
Notifications
You must be signed in to change notification settings - Fork 2
Use UUID's for CFDataset's unique_id property #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jameshiebert
wants to merge
1
commit into
master
Choose a base branch
from
unique-newyork
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates a different random UUID every time it is executed. Which means that every time a program that uses
nchelpersis run, a different UUID will be generated for any given dataset (file).But I think the way we use
unique_idrelies on it being stable across time.Possible problematic use case: Indexing. When a file is indexed in any given run, its
unique_idwill be generated (only once during that indexing run, but as a random value). Then when the same file is re-indexed in a later indexing run (it has moved location, say), itsunique_idis generated, but it is by definition different than the first one. If the indexer (or anything else) tries to look up this file according to itsunique_id, it will get incorrect results. Essentially, if anything (e.g., a modelmeta database) has a memory of a pastunique_idfor a file, it will be wrong the next time.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, however, this argument also can be used against using metadata to generate the unique_id, since reindexing could in principle include reindexing a data file with slightly different metadata. But I think that case is implicitly excluded; the idea was that the metadata uniquely identified the file. We have an identity crisis here (pun, but serious too). What is it that uniquely characterizes a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was concerned about that too, so thank you for pointing this out.
Admittedly, I filed this PR more as provocation than as a finished product.
Maybe it would be better to change this from a (assumed to be static) property to a method (e.g.
create_unique_id()), so users of it have to be intentional about when they want a new one (for example, during indexing) and when they want the existent one.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds more robust.
Question is how to generate the unique id and, if it is not a deterministic computation based on the file contents, how to persist it. One solution is to make method
create_unique_id()write its value to the metadata (attributeunique_id?) in the file, and for propertyunique_idto read it from there. If we want to emphasize "don't screw with this", then maybe name the attribute__unique_id__or something similarly repellent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a great idea.