Skip to content

F006 specific back propagate #19

Open
tmsincomb wants to merge 59 commits intoSciCrunch:devfrom
tmsincomb:master
Open

F006 specific back propagate #19
tmsincomb wants to merge 59 commits intoSciCrunch:devfrom
tmsincomb:master

Conversation

@tmsincomb
Copy link
Collaborator

pipeline.sh vs pipeline-old.sh have same export of CSVs (tables to csv within same dir). They are not using the SPARC files just the cassava pull and CSVs from those links to ingest a local database.

tgbugs and others added 30 commits February 18, 2025 17:05
these are in the instance_parent hierarchy even though they are
technically not material and not part of the physical derivation chain
these are needed for cases where the identifier for something is
implicit in the structure of the larger data container
was causing confusing error messages where it looked like something
that definitely was not a dataset was being reported as a dataset when
it was actually just not in the objects table at all
helps to avoid duplicate data among other things

the solution for supporting repeated measures and similar cases where
an instance might have multiple values for the same quantiative
descriptor is to add performances, at the moment the qd would have to
come from a separate object (even if the value was the same)
mistake when i originally specced this out i suspect
count distinct requires a slight variation so no implemented for
limit needed because queries can now return intractably large numbers
of results, count is a temporary workaround to make possible to see
how many results there are for the query

pagination of large results sets is not currently implemented so right
now the way to proceed if there are too many resuls would be to try to
refine the query (a sample of size limit is returned) until a
reasonable limit is reached
these were added to help with the intersection case but have bad
performance so removing them because the correct solution for objects
and related endpoints is to use union or fix ingest
tgbugs and others added 26 commits July 25, 2025 13:30
further abstraction of the process, still having issues with having to
specify addresses in 3 different places (oof) instead of in one common
place (the values object x descriptor mappings that act as our schema)
was able to do a bit better by creating inv_vocd and inv_voqd but that
depends on execution order, we should only need to specify descriptor
to address mappings in one central place and everything else should be
able to leverage that, the InternalIds class is part of the problem in
this case

- ingest, all inserts are now batched and commits happen after each
  chunk (if commit is true) to reduce locking in increase throughput,
  the 20k batchsize parameter came from my testing for interlex ingest

- samples, subject, sites are now derived from curation export
  (dataset metadata) rather than path metadata (dataset paths), source
  addresses have been updated accordingly, the implementation actually
  does this inside of the bit that processes the parsed paths, so that
  needs to be fixed, but importantly instance parents should always be
  populated by subject/sample/site structure from the metadata not
  from paths because paths can be ambiguous and leave out intervening
  parts of the hierarchy, sites don't fit in the physical derivation
  hierarchy as mentioned, but they do fit in the derivation +
  differentiation hierarchy so they should be included since they do
  have corresponding instances, performances are orthogonal in the
  sense that they will be used within the qdb to allow repeated
  measures of the same quantitative descriptor on the same instance
  (which currently are only allowed if they come from separate
  objects)

- for f006 in particular sam-/site- to fasc- mappings are derived from
  parsed paths (really should be from single file index) and fasc- to
  fiber- are derived from the combined data files, they could be
  derived from the individual fiber data files but it is vastly more
  efficient to ingest the merged files as the contents are the same

- fixed anat_index so that it doesn't break with subsegment, section,
  and site levels but have not extended it to position subsegments and
  sections within the coordinates of the parent segment, a v3 of the
  identifier derived coordinate system would spread sub-parts evenly
  within the parent coordinate range just as segments are spread
  evenly from 0 to 1 but we aren't there yet

- split processing of the raw anat index into its own function to
  simplify applying it to indexes derived from dataset metadata

- ingest into parents must be in topoligical order otherwise db
  validation checks will fail, most of it can be done quickly by
  prefix, but samples derived from other samples require a toposort,
  fortunately those are usually smaller in number than e.g. the half
  million fibers

- add path_from_blob to retrieve the contents of an object and return
  a local file system path where it is stored, reuses sparcur code for
  the time being

- ext_values has been updated so that it can manage the extraction of
  data from object records, a bit of a convoluted mess, but passing it
  ext_contents, and process_records applies process_record to each
  record in each object to produce one or more instances, parents, and
  any values cat and quant associated with the record, ext_values thus
  returns new values to propagate the extracted contents back out,
  this is not at all how this should be done because it requires
  reading all objects into memory and cannot work on a stream of
  inputs, but it works for now
these need to be run with ON CONFLICT DO NOTHING because they will
often already be present, but provides a quick way to make sure that
we have what we need for a new ingestion process, extend as needed

yes this is yet another duplication of e.g. addresses etc, but is also
a very quick way to get what we need curated into shape
quite an oversight, probably also means that we need to a separate
hierarchy that is not used for search but for validating descriptor
domains because they diverge from what we need for the interface

also add better error messages for value_{cat,quant}_check_before

labels instead of ids, bit of a tradeoff but this way the mismatch can
be understood in principle without having to query
the 80% solution for aspect-of-type-in-context is to add a single
additional desc_inst column to the desc quant table to make it
possible to do counts and areas and other aspects of the population of
said thing in the context of the instance that the number is attached
to, it doesn't work for everything but it gets is much closer, in
theory the same field can be used to enable the reverse where the
aspect is attached to the child and the context is the parent, so
position of fiber in fascicle and the direction is inferred from the
partonomy over the instances

in the process it also became clear that units should be part of
values NOT part of quantiative descriptors, since (beyond the fact
that quantitative values should _always_ have units and numbers should
always travel with units, re: mars climate orbiter failure of having
unitless numbers that got read in incorrectly despite the fact that
the system had proper typing internally) they are the instance
equivalent for aspects

class -> instance (aka desc_inst -> instance)
aspect -> unit

in the mean time adding placeholder aspects that materialize the thing
being counted and using aggregation type summary to distinguish them,
thus the aspects hierarchy is complicated by the duplication of what
should be in the instance hierachy, but ok for now

in theory we may want to ingest the source identifier (when available)
as a quantiative value (technially an arbitrary pointer to the inst)
so that it can be used as a sanity check inside the db for id_formal
in values_inst
reminder that nerve is really nerve or part of nerve
a follow on from the sql updates we can now ingest the areas and
counts for fibers in fascicles
queries for endpoints at the descriptor level (e.g. aspects, objects)
that do not include parameters that operate at the instance level
(e.g. value-quant-min, value-quant) will now query only at the
descriptor level unless force-inst=true is set, this makes these
queries much faster by default while still making it possible to
e.g. find all objects associated with instances that have a distance
measurement instead of all objects that contain distance measurements

the new behavior has been implemented for endpoints where there were
known performance issues, some endpoints such as desc/cat continue to
use the instances layer by default for now since performance is ok

also significantly improved performance and accuracy of results in
cases where querying for instances and looking to return any object,
aspect, unit, or quantitative descriptor associated with that object
rather than just those that were the source or matched the source
quant/cat, needs more review to ensure results are as expected

this change also ensures that union-cat-quant is applied when querying
for instances, this likely still needs a bit of tweaking, but it means
that the union/intersect operation is applied consistently to the set
of instances that is returned, not to the final results (note that
this currently only implemented for objects, units, aspects, and
desc/quant endpoints
very rough but gets the job done

the main issue that has to be addressed is that the address to qd
mapping has to be maintained on a per-dataset basis because e.g.
the units for age usually aren't reported, but I guess we can look
in to that as we expand what we ingest
vector vs scalar is going to be another source of issues for modeling
would up being easier than expected but too a could of tries

have to split values_inst into in and out but there should be no
impact on perf in the normal case since it is an exact id match

updated the docs to indicate that this is implemented now but things
might not be quite right, e.g. I may be doing joins on the wrong
values_inst in some cases so needs testing
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Resolved conflicts in bin/dbsetup and quantdb/ingest.py:
- Added inserts.sql execution to dbsetup
- Added simulation support (species translation, sample type, descriptor)
- Added new address mappings for subject_id, sample_id, site_id, site_type

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@tmsincomb tmsincomb self-assigned this Dec 22, 2025
@tmsincomb tmsincomb changed the base branch from master to dev December 22, 2025 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants