Feature/test time scaling and audio support by Rakshitha-Ireddi · Pull Request #243 · lotus-data/lotus

Rakshitha-Ireddi · 2026-01-23T18:00:58Z

This PR adds two highly requested features to LOTUS:

Test-Time Scaling with Ensembling
Audio Data Support via AudioArray

Changes:
Feature 1: Test-Time Scaling (lotus/sem_ops/ensembling.py)
Adds ensemble-based test-time scaling strategies for improving semantic operator accuracy:

EnsembleStrategy enum with four strategies:

MAJORITY_VOTE - Returns most common prediction
WEIGHTED_AVERAGE - Weighs predictions by confidence
CONSENSUS - Returns result only if unanimous
CONFIDENCE_THRESHOLD - Majority vote with confidence tracking

EnsembleConfig dataclass for configuration:

n_samples - Number of samples to generate
strategy - Which ensembling strategy to use
temperature - Sampling temperature
confidence_threshold - Minimum confidence for threshold strategy

Ensemble class for aggregating predictions

Feature 2: Audio Data Support (lotus/dtype_extensions/audio.py)
Extends LOTUS to support audio data processing:

AudioDtype - Custom pandas ExtensionDtype for audio
AudioArray - ExtensionArray for storing audio data
Supports 7 audio formats: .wav, .mp3, .mp4, .m4a, .flac, .ogg, .webm
Includes caching, base64 encoding, and MIME type detection

Tests

tests/test_ensembling.py - 40+ test cases for all strategies
tests/test_audio_array.py - Comprehensive tests for AudioArray

Work done by: Ireddi Rakshitha & Yaswanth Devavarapu

…tors Add EnsembleStrategy enum with majority_vote, weighted_average, consensus, and confidence_threshold strategies. Includes EnsembleConfig dataclass for configuration and Ensemble class for aggregating multiple LLM samples. Closes lotus-data#200

Add AudioDtype and AudioArray classes for storing audio data in DataFrames. Supports .wav, .mp3, .mp4, .m4a, .flac, .ogg, .webm formats with caching and base64 encoding for LLM processing. Closes lotus-data#196

Add test_ensembling.py with 40+ test cases for all ensemble strategies. Add test_audio_array.py with tests for AudioDtype, AudioArray indexing, methods, MIME types, and pandas integration.

- Remove unused imports (io, os, tempfile, Path) - Sort imports according to PEP 8 - Use 'is' instead of '==' for type comparison - Remove unused exception variable

harshitgupta412 · 2026-01-26T20:59:35Z

Hi Rakshitha! Thanks for the PR.
There are a few things missing to complete the integrations for these features.

Audio:

The dtype is not usable directly since audio inputs can't be sent as text to LLMs. Can you update the corresponding code in sem_ops and lotus/templates/task_instructions.py to handle this properly
It will be good to have some docs, tests, and examples on how to use these. Check .github/tests/multimodality_tests.py for the corresponding tests for images.

Ensembling:

There is another PR (Implement semantic map + filter #209) working on this and there are a few comments there that needs to be incorporated here.
The interface for the sem_ops (for now sem_filter) has to be updated to include n_sample and ensembling strategy parameter
The output format for sem_filter function as well as operator should be updated to include individual runs' data (parsed_outputs, logprobs, explanations, raw_outputs)

- Add AudioDtype handling to task_instructions.py for multimodal prompts - Update context_formatter and user_message_formatter for audio inputs - Add n_sample, ensemble, temperature params to sem_filter for test-time scaling - Integrate Ensemble class for multi-sample aggregation (PR lotus-data#209 alignment) - Add per-run rollout fields to SemanticFilterOutput for detailed analysis Addresses feedback on PR lotus-data#243

This PR implements audio data support and test-time scaling features for LOTUS, enhancing multimodal processing and accuracy.

Added a section for type of change and checklist to PR description.

Rakshitha-Ireddi · 2026-01-28T11:09:42Z

Hi @harshitgupta412, thanks for the detailed feedback!
We have updated the PR with the requested changes:
1. Audio Integration

Updated lotus/templates/task_instructions.py (context_formatter and df2multimodal_info) to properly handle AudioDtype columns.
Audio data is now formatted as input_audio blocks for LLM consumption, similar to how images are handled.

2. Ensembling (Aligning with PR #209)

Updated sem_filter interface to include n_sample, ensemble (strategy), and temperature parameters.
Integrated the Ensemble class to aggregate results across multiple samples.
Updated SemanticFilterOutput to include per-run rollout data (all_runs_outputs, all_runs_logprobs, etc.).

The PR description has been updated to reflect these integration details. Thank you

harshitgupta412

Please add test, docs and examples on how to use the new functionalities of sem filter

harshitgupta412 · 2026-01-28T18:51:05Z

        additional_cot_instructions: str = "",
+        n_sample: int = 1,
+        ensemble: EnsembleStrategy | None = None,
+        temperature: float = 1.0,


The temperature is specified directly in the model. I don't think this is needed here

Done! Removed temperature from sem_filter parameters. Users can configure this in the model settings instead.

Added:

examples/ensembling_example.py - 4 usage examples

tests/test_sem_filter_ensembling.py - Integration tests for RawOutputs, SemanticFilterOutput, and EnsembleConfig

harshitgupta412 · 2026-01-28T18:53:11Z

explanations,raw_outputs, answer should be returned for each run as well as the ensembled answer.

Addressed! The new SemanticFilterOutput stores all per-run data in _raw_outputs (a RawOutputs dataclass) and outputs contains the final aggregated result.

- Refactor EnsembleConfig: remove n_samples/temperature, add weights/default - Add RawOutputs dataclass for per-run data organization - Update SemanticFilterOutput with backward-compat properties - Update sem_filter to accept Ensemble object, remove temperature param - Add examples/ensembling_example.py with usage demos - Add tests/test_sem_filter_ensembling.py for integration tests Addresses Harshit's feedback on PR lotus-data#243

Rakshitha-Ireddi · 2026-01-28T19:57:52Z

Hi @harshitgupta412 , I've addressed your feedback:

Removed temperature from sem_filter (model config controls this)
Changed ensemble param to accept Ensemble object directly
Refactored EnsembleConfig to standardize params (weights/default in config)
Added RawOutputs dataclass for per-run data
Added SemanticFilterOutput backward-compat properties
Created tests & examples for the new functionality

Thank you.

harshitgupta412 · 2026-02-05T20:05:05Z

harshitgupta412 · 2026-02-05T20:05:28Z

@@ -0,0 +1,56 @@
+## Purpose


please remove this file

Removed in this commit.

harshitgupta412 · 2026-02-05T20:08:57Z

also please add tests in https://github.com/lotus-data/lotus/tree/main/.github/tests. The tests are split into what runs automatically on push (present in this folder) and additional tests.
Please add couple of tests using ensembling here so that it can be auto-tested.
Also expand the multimodality test to include audio tests

…SCRIPTION.md, add tests - Modified sem_filter.py to expose raw_output_i, explanation_i, parsed_output_i columns for n_sample > 1 - Removed PR_DESCRIPTION.md as requested - Added test_filter_ensembling in lm_tests.py - Added test_filter_operation_audio in multimodality_tests.py

Rakshitha-Ireddi · 2026-02-06T17:05:46Z

The DataFrame now includes per-run columns: raw_output_1, explanation_1, parsed_output_1, raw_output_2, explanation_2, parsed_output_2, etc. The ensemble result is stored in filter_label.

Implemented! When using n_sample > 1, the returned DataFrame now has exactly this structure. See the updated sem_filter.py - the multi-sample path generates columns for each sample with 1-based indexing.

Done! Added:

test_filter_ensembling in .github/tests/lm_tests.py - tests multi-sample filtering with n_sample=2 and verifies all per-run columns are present
test_filter_operation_audio in .github/tests/multimodality_tests.py - tests AudioArray with sem_filter using gpt-4o-audio-preview

harshitgupta412 · 2026-02-07T05:33:49Z

    assert expected_result == list(zip(joined_df["image"], joined_df["element"]))
+
+
+@pytest.mark.parametrize("model", get_enabled("gpt-4o-audio-preview"))


need to enable gpt-4o-audio-preview

Done. I enabled gpt-4o-audio-preview in multimodality_tests.py and updated the test to use a valid WAV file input. The test test_filter_operation_audio now passes locally!

@harshitgupta412 , Could you please review the changes ?

Rakshitha-Ireddi · 2026-03-28T21:16:30Z

Hello @harshitgupta412 , Could you please check and review the work ? And also i am happy to work for other issues or any help and contributing to this project, Please let me know. I am open to work. Thank you.

Rakshitha Ireddi added 4 commits January 23, 2026 22:33

feat(audio): add AudioArray extension for audio data support

471afcd

Add AudioDtype and AudioArray classes for storing audio data in DataFrames. Supports .wav, .mp3, .mp4, .m4a, .flac, .ogg, .webm formats with caching and base64 encoding for LLM processing. Closes lotus-data#196

test: add comprehensive tests for ensembling and AudioArray

3df7d59

Add test_ensembling.py with 40+ test cases for all ensemble strategies. Add test_audio_array.py with tests for AudioDtype, AudioArray indexing, methods, MIME types, and pandas integration.

style: fix linting issues found by ruff

1877d7f

- Remove unused imports (io, os, tempfile, Path) - Sort imports according to PEP 8 - Use 'is' instead of '==' for type comparison - Remove unused exception variable

Rakshitha Ireddi and others added 3 commits January 28, 2026 15:54

Added audio data support and test-time scaling features

0fc10c0

This PR implements audio data support and test-time scaling features for LOTUS, enhancing multimodal processing and accuracy.

Enhance PR description with type and checklist sections

74a5300

Added a section for type of change and checklist to PR description.

harshitgupta412 reviewed Jan 28, 2026

View reviewed changes

harshitgupta412 reviewed Feb 5, 2026

View reviewed changes

harshitgupta412 reviewed Feb 7, 2026

View reviewed changes

Enable gpt-4o-audio-preview model and use valid WAV file for audio test

4a23937

harshitgupta412 approved these changes Mar 31, 2026

View reviewed changes

		assert expected_result == list(zip(joined_df["image"], joined_df["element"]))


		@pytest.mark.parametrize("model", get_enabled("gpt-4o-audio-preview"))

Conversation

Rakshitha-Ireddi commented Jan 23, 2026

Uh oh!

harshitgupta412 commented Jan 26, 2026

Uh oh!

Rakshitha-Ireddi commented Jan 28, 2026

Uh oh!

harshitgupta412 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rakshitha-Ireddi commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshitgupta412 commented Feb 5, 2026

Uh oh!

Rakshitha-Ireddi commented Feb 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rakshitha-Ireddi commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants