A weave reference ("ref") has the following 2 formats:
- The W&B Artifact:
wandb-artifact:///{ENTITY}/{PROJECT}/{ARTIFACT_NAME}:{ALIAS}[/{FILE_PATH}[#REF_EXTRA]]- Yes, the 3 forward slashes are correct.
- The Local Artifact
local-artifact:///{ARTIFACT_NAME}:{ALIAS}[/{FILE_PATH}[#REF_EXTRA]]- A known problem with local artifacts is that it is possible to have name collisions with artifacts of the same name, sourced from 2 different projects.
Path Component Details:
We will define the "CommonCharset" as alphanumeric and underscore _ and dash -
ENTITY: limited to CommonCharsetPROJECT: limited to CommonCharsetARTIFACT_NAME: limited to CommonCharsetALIAS: can take 1 of 3 forms:- "alias": limited to CommonCharset, for example
latest - "version":
v#where#is an integer value - "digest": a deterministic hex digest of the contents
- "versionHash": a hex digest combining the "digest" with the prior version's digest in the sequence.
- "alias": limited to CommonCharset, for example
FILE_PATH: (optional) a list of forward slash/separatedFILE_PATH_PARTs. EachFILE_PATH_PARTis limited to CommonCharset and dot.REF_EXTRA: (optional, only allowed ifFILE_PATHis present) a list of forward slash/separatedREF_EXTRA_TUPLEs. AREF_EXTRA_TUPLEhas the format of{REF_EXTRA_EDGE_TYPE}/{REF_EXTRA_PART}. WhereREF_EXTRA_EDGE_TYPEis one of:ndx,key,atr,col,row,id. AREF_EXTRA_PARTis limited to CommonCharset.- Important: the
REF_EXTRA_EDGE_TYPEofidis not yet implemented
- Important: the
When interpreting a reference, we follow the following rules:
-
Lookup the artifact itself using everything up to, but excluding the
FILE_PATH. If noFILE_PATHexists, then the reference is pointing to an artifact and we halt. -
If
FILE_PATHexists, then we fetch the file located at such path. There are two cases:FILE_PATHexactly matches a member file of the artifact. In this case, the ref is pointing to the specific file and we halt.FILE_PATHis not contained in the artifact, but ratherFILE_PATH.type.jsonis contained in the artifact. In this case, the ref is pointing to a "weave object". The Weave engine reads theFILE_PATH.type.jsonfile to determine the type of the object. Reconstruction/deserialization of the object will often require reading 1 or more peer files - the rules of which are up to the type's implementation. By far the most common case here is whenFILE_PATH = "obj". Where we haveobj.type.jsonat the root of the artifact, then a peer file, for exampleobj.object.jsoncontaining the data payload itself. Note: the peer file needn't be calledobj.object.json- this is up to the object type to determine. Importantly: this is the case where the ref is pointing to a "weave object" and the file system is not important to the user. If noREF_EXTRAexists, we halt.
-
If a
REF_EXTRAexists (and by definition ourFILE_PATHpoints to a weave object), then theREF_EXTRAtells us how to traverse the object itself to extract a nested data property. For example, you might have a class calledModelwith an attributeprompt. If the ref wants to point to the prompt field itself, theREF_EXTRAwould beattr/prompt. As mentioned above, there are a number ofREF_EXTRA_EDGE_TYPEs that allow the ref to point deep into the object. This is useful for things like datasets where you might have an class calledDatasetthat has a propertyrowswhich is a list of dictionaries. At this point, we return the final data. The specific rules forREF_EXTRA_EDGE_TYPEare as follows:- If the current object is a Table (ArrowWeaveList), then:
ndx/{INDEX}- get the row at indexINDEXcol/{COLUMN_NAME}- get the column calledCOLUMN_NAMEid/{ID}- get the row at idID
- If the current object is a Dict, then:
key/{KEY}- get the value at keyKEY
- If the current object is a Object, then:
atr/{ATTRIBUTE}- get the value at attributeATTRIBUTE
- If the current object is a List, then:
ndx/{INDEX}- get the item at positionINDEX
- If the current object is a Table (ArrowWeaveList), then:
So putting this all together, the following ref (wandb-artifact:///example_entity/example_project/example_artifact:abc123/obj#attr/rows/index/10/key/input) should be interpreted as follows:
- Fetch the artifact corresponding to
example_entity/example_project/example_artifact:abc123from W&B. - Determine that
objis not a specific entry but rather a "weave object" - Get the
rowsproperty from the object (this could be a list or a table in this case) - Get the row at index
10. (this is a dictionary) - Get the value located at the
inputkey.
Note: a careful reader will notice that the same piece of data might have multiple valid refs pointing to it. Consider the following case:
wandb-artifact:///example_entity/example_project/example_artifact:abc123/obj#attr/rows/index/10/key/inputwandb-artifact:///example_entity/example_project/example_artifact:abc123/rows/0#index/10/key/input
Both of these refs will return the same exact data (assuming that the obj object's rows property is a pointer to the rows/0 entry.). While this is perfectly fine and valid, it has a problem. Case 2 breaks the relationship as we no longer know that the data was derived from traversing into the dataset itself. If you are given reference 2, then you have no way of knowing that it is actually a descendant member of wandb-artifact:///example_entity/example_project/example_artifact:abc123/obj (other than maybe by convention). Reference 2 is a biproduct of the serialization format, not the logical "thing" the user is using. Therefore, when constructing refs, we always prefer case 1. However, an important exception to this rule is if during the object extra traversal, we "jump" to completely new artifact, then we restart the ref there. This allows us to preserve the name of the object.
Further idea: We should probably add a ?hash=CONTENT_HASH at the end of refs - this would allow us to know if two entries in the same dataset are actually the same content. We can't purely rely on the artifact hash for uniqueness since the ref could be pointing to a deep member of the artifact.