Skip to content

Feature/test time scaling and audio support#243

Open
Rakshitha-Ireddi wants to merge 10 commits into
lotus-data:mainfrom
Rakshitha-Ireddi:feature/test-time-scaling-and-audio-support
Open

Feature/test time scaling and audio support#243
Rakshitha-Ireddi wants to merge 10 commits into
lotus-data:mainfrom
Rakshitha-Ireddi:feature/test-time-scaling-and-audio-support

Conversation

@Rakshitha-Ireddi
Copy link
Copy Markdown

This PR adds two highly requested features to LOTUS:

  1. Test-Time Scaling with Ensembling
  2. Audio Data Support via AudioArray

Changes:
Feature 1: Test-Time Scaling (lotus/sem_ops/ensembling.py)
Adds ensemble-based test-time scaling strategies for improving semantic operator accuracy:

  1. EnsembleStrategy enum with four strategies:
  • MAJORITY_VOTE - Returns most common prediction
  • WEIGHTED_AVERAGE - Weighs predictions by confidence
  • CONSENSUS - Returns result only if unanimous
  • CONFIDENCE_THRESHOLD - Majority vote with confidence tracking
  1. EnsembleConfig dataclass for configuration:
  • n_samples - Number of samples to generate
  • strategy - Which ensembling strategy to use
  • temperature - Sampling temperature
  • confidence_threshold - Minimum confidence for threshold strategy
  1. Ensemble class for aggregating predictions

Feature 2: Audio Data Support (lotus/dtype_extensions/audio.py)
Extends LOTUS to support audio data processing:

  • AudioDtype - Custom pandas ExtensionDtype for audio
  • AudioArray - ExtensionArray for storing audio data
  • Supports 7 audio formats: .wav, .mp3, .mp4, .m4a, .flac, .ogg, .webm
  • Includes caching, base64 encoding, and MIME type detection

Tests

  • tests/test_ensembling.py - 40+ test cases for all strategies
  • tests/test_audio_array.py - Comprehensive tests for AudioArray

Work done by: Ireddi Rakshitha & Yaswanth Devavarapu

Rakshitha Ireddi added 4 commits January 23, 2026 22:33
…tors

Add EnsembleStrategy enum with majority_vote, weighted_average, consensus,
and confidence_threshold strategies. Includes EnsembleConfig dataclass for
configuration and Ensemble class for aggregating multiple LLM samples.

Closes lotus-data#200
Add AudioDtype and AudioArray classes for storing audio data in DataFrames.
Supports .wav, .mp3, .mp4, .m4a, .flac, .ogg, .webm formats with caching
and base64 encoding for LLM processing.

Closes lotus-data#196
Add test_ensembling.py with 40+ test cases for all ensemble strategies.
Add test_audio_array.py with tests for AudioDtype, AudioArray indexing,
methods, MIME types, and pandas integration.
- Remove unused imports (io, os, tempfile, Path)
- Sort imports according to PEP 8
- Use 'is' instead of '==' for type comparison
- Remove unused exception variable
@harshitgupta412
Copy link
Copy Markdown
Collaborator

Hi Rakshitha! Thanks for the PR.
There are a few things missing to complete the integrations for these features.

Audio:

  • The dtype is not usable directly since audio inputs can't be sent as text to LLMs. Can you update the corresponding code in sem_ops and lotus/templates/task_instructions.py to handle this properly
  • It will be good to have some docs, tests, and examples on how to use these. Check .github/tests/multimodality_tests.py for the corresponding tests for images.

Ensembling:

  • There is another PR (Implement semantic map + filter #209) working on this and there are a few comments there that needs to be incorporated here.
  • The interface for the sem_ops (for now sem_filter) has to be updated to include n_sample and ensembling strategy parameter
  • The output format for sem_filter function as well as operator should be updated to include individual runs' data (parsed_outputs, logprobs, explanations, raw_outputs)

Rakshitha Ireddi and others added 3 commits January 28, 2026 15:54
- Add AudioDtype handling to task_instructions.py for multimodal prompts
- Update context_formatter and user_message_formatter for audio inputs
- Add n_sample, ensemble, temperature params to sem_filter for test-time scaling
- Integrate Ensemble class for multi-sample aggregation (PR lotus-data#209 alignment)
- Add per-run rollout fields to SemanticFilterOutput for detailed analysis

Addresses feedback on PR lotus-data#243
This PR implements audio data support and test-time scaling features for LOTUS, enhancing multimodal processing and accuracy.
Added a section for type of change and checklist to PR description.
@Rakshitha-Ireddi
Copy link
Copy Markdown
Author

Hi @harshitgupta412, thanks for the detailed feedback!
We have updated the PR with the requested changes:
1. Audio Integration

  • Updated lotus/templates/task_instructions.py (context_formatter and df2multimodal_info) to properly handle AudioDtype columns.
  • Audio data is now formatted as input_audio blocks for LLM consumption, similar to how images are handled.

2. Ensembling (Aligning with PR #209)

  • Updated sem_filter interface to include n_sample, ensemble (strategy), and temperature parameters.
  • Integrated the Ensemble class to aggregate results across multiple samples.
  • Updated SemanticFilterOutput to include per-run rollout data (all_runs_outputs, all_runs_logprobs, etc.).

The PR description has been updated to reflect these integration details. Thank you

Copy link
Copy Markdown
Collaborator

@harshitgupta412 harshitgupta412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test, docs and examples on how to use the new functionalities of sem filter

Comment thread lotus/sem_ops/sem_filter.py Outdated
additional_cot_instructions: str = "",
n_sample: int = 1,
ensemble: EnsembleStrategy | None = None,
temperature: float = 1.0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temperature is specified directly in the model. I don't think this is needed here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Removed temperature from sem_filter parameters. Users can configure this in the model settings instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

  1. examples/ensembling_example.py - 4 usage examples
  2. tests/test_sem_filter_ensembling.py - Integration tests for RawOutputs, SemanticFilterOutput, and EnsembleConfig

Comment thread lotus/sem_ops/sem_filter.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanations,raw_outputs, answer should be returned for each run as well as the ensembled answer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed! The new SemanticFilterOutput stores all per-run data in _raw_outputs (a RawOutputs dataclass) and outputs contains the final aggregated result.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be available in the dataframe as well. The columns will look like:

|<input cols> | raw_output_1 | explanation_1 | parsed_output_1 | raw_output_2 | expl_2 | parsed_output_2 ... | ensemble_answer (suffix specified by user) |

Comment thread lotus/sem_ops/sem_filter.py Outdated
Comment thread lotus/sem_ops/sem_filter.py
Comment thread lotus/sem_ops/ensembling.py
Comment thread lotus/sem_ops/ensembling.py Outdated
- Refactor EnsembleConfig: remove n_samples/temperature, add weights/default
- Add RawOutputs dataclass for per-run data organization
- Update SemanticFilterOutput with backward-compat properties
- Update sem_filter to accept Ensemble object, remove temperature param
- Add examples/ensembling_example.py with usage demos
- Add tests/test_sem_filter_ensembling.py for integration tests

Addresses Harshit's feedback on PR lotus-data#243
@Rakshitha-Ireddi
Copy link
Copy Markdown
Author

Hi @harshitgupta412 , I've addressed your feedback:

  1. Removed temperature from sem_filter (model config controls this)
  2. Changed ensemble param to accept Ensemble object directly
  3. Refactored EnsembleConfig to standardize params (weights/default in config)
  4. Added RawOutputs dataclass for per-run data
  5. Added SemanticFilterOutput backward-compat properties
  6. Created tests & examples for the new functionality

Thank you.

Comment thread lotus/sem_ops/sem_filter.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be available in the dataframe as well. The columns will look like:

|<input cols> | raw_output_1 | explanation_1 | parsed_output_1 | raw_output_2 | expl_2 | parsed_output_2 ... | ensemble_answer (suffix specified by user) |

Comment thread PR_DESCRIPTION.md Outdated
@@ -0,0 +1,56 @@
## Purpose
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this file

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in this commit.

@harshitgupta412
Copy link
Copy Markdown
Collaborator

also please add tests in https://github.com/lotus-data/lotus/tree/main/.github/tests. The tests are split into what runs automatically on push (present in this folder) and additional tests.
Please add couple of tests using ensembling here so that it can be auto-tested.
Also expand the multimodality test to include audio tests

…SCRIPTION.md, add tests

- Modified sem_filter.py to expose raw_output_i, explanation_i, parsed_output_i columns for n_sample > 1
- Removed PR_DESCRIPTION.md as requested
- Added test_filter_ensembling in lm_tests.py
- Added test_filter_operation_audio in multimodality_tests.py
@Rakshitha-Ireddi
Copy link
Copy Markdown
Author

The DataFrame now includes per-run columns: raw_output_1, explanation_1, parsed_output_1, raw_output_2, explanation_2, parsed_output_2, etc. The ensemble result is stored in filter_label.

Implemented! When using n_sample > 1, the returned DataFrame now has exactly this structure. See the updated sem_filter.py - the multi-sample path generates columns for each sample with 1-based indexing.

Done! Added:

  • test_filter_ensembling in .github/tests/lm_tests.py - tests multi-sample filtering with n_sample=2 and verifies all per-run columns are present
  • test_filter_operation_audio in .github/tests/multimodality_tests.py - tests AudioArray with sem_filter using gpt-4o-audio-preview

assert expected_result == list(zip(joined_df["image"], joined_df["element"]))


@pytest.mark.parametrize("model", get_enabled("gpt-4o-audio-preview"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to enable gpt-4o-audio-preview

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I enabled gpt-4o-audio-preview in multimodality_tests.py and updated the test to use a valid WAV file input. The test test_filter_operation_audio now passes locally!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshitgupta412 , Could you please review the changes ?

@Rakshitha-Ireddi
Copy link
Copy Markdown
Author

Hello @harshitgupta412 , Could you please check and review the work ? And also i am happy to work for other issues or any help and contributing to this project, Please let me know. I am open to work. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants