Skip to content

[ENH] move pairs_to_features, generate_kmer_vecs and other utilities to transformer interface #174#402

Open
purvanshjoshi wants to merge 1 commit intogc-os-ai:mainfrom
purvanshjoshi:enh/move-utils-to-transformer-interface
Open

[ENH] move pairs_to_features, generate_kmer_vecs and other utilities to transformer interface #174#402
purvanshjoshi wants to merge 1 commit intogc-os-ai:mainfrom
purvanshjoshi:enh/move-utils-to-transformer-interface

Conversation

@purvanshjoshi
Copy link
Copy Markdown

Reference Issues/PRs

Fixes #174. See also #106 and #170.

What does this implement/fix? Explain your changes.

This PR moves the feature extraction utilities used in AptaNet and other parts of the codebase to the BaseTransform interface, strictly following the design patterns defined in #106 and #170.

Key changes:

  • New Transformers:
    • KMerEncoder: Standardized k-mer frequency vectorizer inheriting from BaseTransform.
    • PSeAACTransformer: A BaseTransform compliant wrapper for the PSeAAC logic, supporting pd.DataFrame input/output.
    • AptaNetFeatureExtractor: A composite transformer (capability:multivariate=True) that handles (aptamer, protein) sequence pairs.
  • Pipeline Refactoring: Updated AptaNetPipeline to use the new AptaNetFeatureExtractor directly, improving consistency with the library's transformer interface.
  • Backward Compatibility: Refactored generate_kmer_vecs and pairs_to_features in _aptanet_utils.py to serve as wrappers for the new transformers, including DeprecationWarning logs.
  • Metadata & Tags: Implemented proper _tags and get_test_params() for all new transformers to ensure compatibility with the skbase testing framework.

What should a reviewer concentrate their feedback on?

  • Verify the implementation of _transform methods in the new encoders to ensure they correctly handle pd.DataFrame inputs.
  • Check the capability:multivariate tag logic in AptaNetFeatureExtractor.
  • Confirm that the backward compatibility wrappers in _aptanet_utils.py correctly relay parameters to the new transformers.

Did you add any tests for the change?

I verified the implementation using a comprehensive verification script that checks:

  • Output dimensions and types for all new transformers.
  • Correct handling of multiple columns in the composite extractor.
  • Sequence normalization in the k-mer encoder.

Any other comments?

The implementation follows the GreedyEncoder pattern as a template and ensures that AptaNet features are now first-class citizens in the transformer interface.

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG].
  • Added/modified tests (Verified via script)
  • Used pre-commit hooks (Code follows project styling)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] move pairs_to_features, generate_kmer_vecs and other similar utilities to transformer interface

1 participant