Use UUID's for CFDataset's unique_id property#78
Conversation
Constructing the unique_id from attributes of the dataset leads to problems. One can have multiple datasets that end up with the same unique_id (for example datasets that have a different spatial domain, but everything else is the same). When generated according to the standard methods, UUIDs are for practical purposes unique.
6fcbad1 to
a2d400f
Compare
rod-glover
left a comment
There was a problem hiding this comment.
I think there is a problem here. See comment.
|
|
||
| return unique_id.replace('+', '-') # In original code, but why? | ||
| """A unique id for this file""" | ||
| return str(uuid.uuid4()) |
There was a problem hiding this comment.
This generates a different random UUID every time it is executed. Which means that every time a program that uses nchelpers is run, a different UUID will be generated for any given dataset (file).
But I think the way we use unique_id relies on it being stable across time.
Possible problematic use case: Indexing. When a file is indexed in any given run, its unique_id will be generated (only once during that indexing run, but as a random value). Then when the same file is re-indexed in a later indexing run (it has moved location, say), its unique_id is generated, but it is by definition different than the first one. If the indexer (or anything else) tries to look up this file according to its unique_id, it will get incorrect results. Essentially, if anything (e.g., a modelmeta database) has a memory of a past unique_id for a file, it will be wrong the next time.
There was a problem hiding this comment.
Hmm, however, this argument also can be used against using metadata to generate the unique_id, since reindexing could in principle include reindexing a data file with slightly different metadata. But I think that case is implicitly excluded; the idea was that the metadata uniquely identified the file. We have an identity crisis here (pun, but serious too). What is it that uniquely characterizes a file?
There was a problem hiding this comment.
Yeah, I was concerned about that too, so thank you for pointing this out.
Admittedly, I filed this PR more as provocation than as a finished product.
Maybe it would be better to change this from a (assumed to be static) property to a method (e.g. create_unique_id()), so users of it have to be intentional about when they want a new one (for example, during indexing) and when they want the existent one.
There was a problem hiding this comment.
That sounds more robust.
Question is how to generate the unique id and, if it is not a deterministic computation based on the file contents, how to persist it. One solution is to make method create_unique_id() write its value to the metadata (attribute unique_id?) in the file, and for property unique_id to read it from there. If we want to emphasize "don't screw with this", then maybe name the attribute __unique_id__ or something similarly repellent.
There was a problem hiding this comment.
Yeah, that's a great idea.
Constructing the
unique_idfrom attributes of the dataset leads to problems. One can have multiple datasets that end up with the same unique_id (for example datasets that have a different spatial domain, but everything else is the same). When generated according to the standard methods, UUIDs are for practical purposes unique.