F006 specific back propagate by tmsincomb · Pull Request #19 · SciCrunch/quantitative-db

tmsincomb · 2025-12-22T00:30:57Z

pipeline.sh vs pipeline-old.sh have same export of CSVs (tables to csv within same dir). They are not using the SPARC files just the cassava pull and CSVs from those links to ingest a local database.

…formating

these are in the instance_parent hierarchy even though they are technically not material and not part of the physical derivation chain

these are needed for cases where the identifier for something is implicit in the structure of the larger data container

was causing confusing error messages where it looked like something that definitely was not a dataset was being reported as a dataset when it was actually just not in the objects table at all

helps to avoid duplicate data among other things the solution for supporting repeated measures and similar cases where an instance might have multiple values for the same quantiative descriptor is to add performances, at the moment the qd would have to come from a separate object (even if the value was the same)

mistake when i originally specced this out i suspect

count distinct requires a slight variation so no implemented for

helpful for reparsing

limit needed because queries can now return intractably large numbers of results, count is a temporary workaround to make possible to see how many results there are for the query pagination of large results sets is not currently implemented so right now the way to proceed if there are too many resuls would be to try to refine the query (a sample of size limit is returned) until a reasonable limit is reached

these were added to help with the intersection case but have bad performance so removing them because the correct solution for objects and related endpoints is to use union or fix ingest

further abstraction of the process, still having issues with having to specify addresses in 3 different places (oof) instead of in one common place (the values object x descriptor mappings that act as our schema) was able to do a bit better by creating inv_vocd and inv_voqd but that depends on execution order, we should only need to specify descriptor to address mappings in one central place and everything else should be able to leverage that, the InternalIds class is part of the problem in this case - ingest, all inserts are now batched and commits happen after each chunk (if commit is true) to reduce locking in increase throughput, the 20k batchsize parameter came from my testing for interlex ingest - samples, subject, sites are now derived from curation export (dataset metadata) rather than path metadata (dataset paths), source addresses have been updated accordingly, the implementation actually does this inside of the bit that processes the parsed paths, so that needs to be fixed, but importantly instance parents should always be populated by subject/sample/site structure from the metadata not from paths because paths can be ambiguous and leave out intervening parts of the hierarchy, sites don't fit in the physical derivation hierarchy as mentioned, but they do fit in the derivation + differentiation hierarchy so they should be included since they do have corresponding instances, performances are orthogonal in the sense that they will be used within the qdb to allow repeated measures of the same quantitative descriptor on the same instance (which currently are only allowed if they come from separate objects) - for f006 in particular sam-/site- to fasc- mappings are derived from parsed paths (really should be from single file index) and fasc- to fiber- are derived from the combined data files, they could be derived from the individual fiber data files but it is vastly more efficient to ingest the merged files as the contents are the same - fixed anat_index so that it doesn't break with subsegment, section, and site levels but have not extended it to position subsegments and sections within the coordinates of the parent segment, a v3 of the identifier derived coordinate system would spread sub-parts evenly within the parent coordinate range just as segments are spread evenly from 0 to 1 but we aren't there yet - split processing of the raw anat index into its own function to simplify applying it to indexes derived from dataset metadata - ingest into parents must be in topoligical order otherwise db validation checks will fail, most of it can be done quickly by prefix, but samples derived from other samples require a toposort, fortunately those are usually smaller in number than e.g. the half million fibers - add path_from_blob to retrieve the contents of an object and return a local file system path where it is stored, reuses sparcur code for the time being - ext_values has been updated so that it can manage the extraction of data from object records, a bit of a convoluted mess, but passing it ext_contents, and process_records applies process_record to each record in each object to produce one or more instances, parents, and any values cat and quant associated with the record, ext_values thus returns new values to propagate the extracted contents back out, this is not at all how this should be done because it requires reading all objects into memory and cannot work on a stream of inputs, but it works for now

these need to be run with ON CONFLICT DO NOTHING because they will often already be present, but provides a quick way to make sure that we have what we need for a new ingestion process, extend as needed yes this is yet another duplication of e.g. addresses etc, but is also a very quick way to get what we need curated into shape

quite an oversight, probably also means that we need to a separate hierarchy that is not used for search but for validating descriptor domains because they diverge from what we need for the interface also add better error messages for value_{cat,quant}_check_before labels instead of ids, bit of a tradeoff but this way the mismatch can be understood in principle without having to query

the 80% solution for aspect-of-type-in-context is to add a single additional desc_inst column to the desc quant table to make it possible to do counts and areas and other aspects of the population of said thing in the context of the instance that the number is attached to, it doesn't work for everything but it gets is much closer, in theory the same field can be used to enable the reverse where the aspect is attached to the child and the context is the parent, so position of fiber in fascicle and the direction is inferred from the partonomy over the instances in the process it also became clear that units should be part of values NOT part of quantiative descriptors, since (beyond the fact that quantitative values should _always_ have units and numbers should always travel with units, re: mars climate orbiter failure of having unitless numbers that got read in incorrectly despite the fact that the system had proper typing internally) they are the instance equivalent for aspects class -> instance (aka desc_inst -> instance) aspect -> unit in the mean time adding placeholder aspects that materialize the thing being counted and using aggregation type summary to distinguish them, thus the aspects hierarchy is complicated by the duplication of what should be in the instance hierachy, but ok for now in theory we may want to ingest the source identifier (when available) as a quantiative value (technially an arbitrary pointer to the inst) so that it can be used as a sanity check inside the db for id_formal in values_inst

reminder that nerve is really nerve or part of nerve

a follow on from the sql updates we can now ingest the areas and counts for fibers in fascicles

queries for endpoints at the descriptor level (e.g. aspects, objects) that do not include parameters that operate at the instance level (e.g. value-quant-min, value-quant) will now query only at the descriptor level unless force-inst=true is set, this makes these queries much faster by default while still making it possible to e.g. find all objects associated with instances that have a distance measurement instead of all objects that contain distance measurements the new behavior has been implemented for endpoints where there were known performance issues, some endpoints such as desc/cat continue to use the instances layer by default for now since performance is ok also significantly improved performance and accuracy of results in cases where querying for instances and looking to return any object, aspect, unit, or quantitative descriptor associated with that object rather than just those that were the source or matched the source quant/cat, needs more review to ensure results are as expected this change also ensures that union-cat-quant is applied when querying for instances, this likely still needs a bit of tweaking, but it means that the union/intersect operation is applied consistently to the set of instances that is returned, not to the final results (note that this currently only implemented for objects, units, aspects, and desc/quant endpoints

very rough but gets the job done the main issue that has to be addressed is that the address to qd mapping has to be maintained on a per-dataset basis because e.g. the units for age usually aren't reported, but I guess we can look in to that as we expand what we ingest

vector vs scalar is going to be another source of issues for modeling

would up being easier than expected but too a could of tries have to split values_inst into in and out but there should be no impact on perf in the normal case since it is an exact id match updated the docs to indicate that this is implemented now but things might not be quite right, e.g. I may be doing joins on the wrong values_inst in some cases so needs testing

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Resolved conflicts in bin/dbsetup and quantdb/ingest.py: - Added inserts.sql execution to dbsetup - Added simulation support (species translation, sample type, descriptor) - Added new address mappings for subject_id, sample_id, site_id, site_type Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

tgbugs and others added 30 commits February 18, 2025 17:05

docs add ingest diagram

8395eeb

blue hook added

89f80ad

.py blue format

372584d

sql whitespace rm

f378006

ingest.py needed 2nd pass

1977266

-X theirs with pre-commit ignore update to ingest.py

b01bae3

pre-commit try

a6c144d

accepted all incoming changes for hangnail merge

dc0776f

added fastapi postgresql folder

81487b7

remove ingest from blue ignore and found internal bug on why api was …

542f773

…formating

init for test_db

db38a8e

attempt at clone top 10 rows, failing

0e126e5

test postgres beta pass

f2444fb

test postgres alpha pass

afe15f4

f006 tests

8baa87c

f006 refactor

f71e28a

patch for root models

c5fc839

patch for leaf models

02b12e7

full tests passing

e956512

used local f006 data, ingested to test db and converted to CSVs

bb98b01

fibers ingestion

8497beb

tables.sql add site instance type

c6564e3

these are in the instance_parent hierarchy even though they are technically not material and not part of the physical derivation chain

tables.sql add record-index address type

0a2385d

these are needed for cases where the identifier for something is implicit in the structure of the larger data container

tables.sql object_is_not_dataset distinguish object does not exist

7828c87

was causing confusing error messages where it looked like something that definitely was not a dataset was being reported as a dataset when it was actually just not in the objects table at all

tables.sql values_quant varchar -> text

9fcf4db

mistake when i originally specced this out i suspect

api docs update support matrix for limit and count

4dbd67b

count distinct requires a slight variation so no implemented for

api update internal queries to always use capital AS

779cf2c

helpful for reparsing

api main_query remove left outer joins

d2517dd

these were added to help with the intersection case but have bad performance so removing them because the correct solution for objects and related endpoints is to use union or fix ingest

tgbugs and others added 26 commits July 25, 2025 13:30

api.org update union-cat-quant default values docs

48ec51e

queries update orgstrap block with helper functions

9e1889f

ingest remove debug breakpoint

9be79e6

test.sql update reva ft anat index domain for consistency

bf4c0f9

reminder that nerve is really nerve or part of nerve

ingest update fasc fib with more fasc values, fix demo ingest

cf4b456

a follow on from the sql updates we can now ingest the areas and counts for fibers in fascicles

ingest allow path_from_blob to work when no sparcur paths exist

986e548

ingest extract demo fix souce path

4df4dbb

ingest demo reorder ops to ensure cache is populated

c73a982

sync with ingest-f006 upstream branch

196d4b0

revamping yaml

518e28f

Merge branch 'master' into reva-generic

0ece435

api add /api/1/db-name endpoint for database sanity checks

7f0b605

sql inserts first pass at sub/sam/site and vagus scaffold needs

899fd35

vector vs scalar is going to be another source of issues for modeling

sql move used inserts from test to inserts, dbsetup run inserts

f757ac9

api align dataset_objects variable name

3b82025

Cleanup: update gitignore, remove obsolete test files

793c02d

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

newer files some may be pruned

9a84518

f006 pass finished; no sparc

34b0bb7

tmsincomb self-assigned this Dec 22, 2025

tmsincomb changed the base branch from master to dev December 22, 2025 00:31

tmsincomb added 2 commits March 2, 2026 11:34

how-to for backprop

ac8c0e8

how-to for backprop

4577c19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F006 specific back propagate #19

F006 specific back propagate #19
tmsincomb wants to merge 59 commits intoSciCrunch:devfrom
tmsincomb:master

tmsincomb commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tmsincomb commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants