Add start and end positions in output matrix#135
Add start and end positions in output matrix#135acesnik wants to merge 2 commits intolazear:masterfrom
Conversation
There was a problem hiding this comment.
Looks good overall, left a few specific comments.
Additional considerations not addressed:
During deduplication of identical peptide sequences post-modification, add protein-positions
sage/crates/sage/src/database.rs
Line 197 in a969ca7
Ensure that positions and proteins identified are co-sorted:
sage/crates/sage/src/database.rs
Line 206 in a969ca7
Might be best to define a new struct for assigning proteins that includes both the identifier and positions, to preclude any potential bugs from the above.
struct ProteinAssignment {
identifier: String,
// do these need to be usize (8 bytes)? u32 (4 bytes) is probably sufficient...
start: usize,
end: usize,
}
crates/sage/src/peptide.rs
Outdated
|
|
||
| pub proteins: Vec<Arc<String>>, | ||
| /// What residue does this peptide start at in the protein (1-based inclusive)? | ||
| pub start_position: Vec<Arc<usize>>, |
There was a problem hiding this comment.
No need for Arc - this is a smart pointer (atomic reference counted) allocated on the heap. It's used for proteins: String (which is already heap allocated) to prevent repeated clones of protein identifiers. Doesn't make sense to use an Arc for a simple usize in this case!
crates/sage/src/enzyme.rs
Outdated
| fn hash<H: std::hash::Hasher>(&self, state: &mut H) { | ||
| self.sequence.hash(state); | ||
| self.position.hash(state); | ||
| self.start_position.hash(state); |
There was a problem hiding this comment.
I don't think this is correct behavior (won't necessarily cause a bug though). We use hash to deduplicate digests prior to creating peptides.
Only sequence and position are considered to ensure that we don't accidentally deduplicate peptide sequences that occur on protein termini, which might be assigned termini-specific modifications. Including start_position will lead to extra (duplicated) digests being created -> more duplicated peptides that must be generated (expensive) and then trashed during de-duplication.
There was a problem hiding this comment.
Okay, that makes sense. I was thinking about maybe including a list of start indices for each protein in the proteins list, but that probably doesn't make sense from a peptide-centric search perspective, and other search engines just list the index for the leading protein.
| "label", | ||
| "expmass", | ||
| "calcmass", | ||
| "measured_mass", |
There was a problem hiding this comment.
What's the rationale for changing column names?
There was a problem hiding this comment.
Oh, the rationale here, although I'm not wedded to them, are to 1) keep the underscore delimiting in the rest of the columns, 2) spell out calculated and experimental/measured, and 3) including number after scan doesn't feel necessary since every entry has "scan=[scan number]".
There was a problem hiding this comment.
I'm not opposed to it - but the column names have been stable for ~2.5 years so I don't see a huge need to rename them (and thus force people to retool pipelines downstream of Sage). They were originally named like this because sage results (used) to be passed directly to mokapot for rescoring.
crates/sage/src/peptide.rs
Outdated
| missed_cleavages: value.missed_cleavages, | ||
| semi_enzymatic: value.semi_enzymatic, | ||
| proteins: vec![value.protein], | ||
| start_position: value.start_position, |
|
Thanks for the feedback on this! I'll continue working on the PR and mark it ready for review when ready or ask some follow-up questions along the way. |
I'm drafting a PR for #110