105 har missing modules#106
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds CSV/NPY time-series dataset abstractions, two HAR-specific dataset implementations (Rodrigues/CPC windowing and Xu/TNC neighbor sampling), multiple Lightning DataModules (folder-based, multimodal, CPC, and Xu variants), a DIET LightningModule with automatic linear-head construction, a DIET linear/adapter module, a small LFR encoder permute option, and comprehensive tests covering datasets, datamodules, and model training paths. ChangesHAR Dataset Infrastructure and DIET SSL Model
Comprehensive Test Coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (9)
tests/models/nets/test_diet_linear.py (1)
17-17: ⚡ Quick winAvoid hard-coding batch size in adapter reshape.
Line 17 hard-codes
32, which makes the test fragile if the input batch size changes. Use the runtime batch dimension instead.♻️ Proposed fix
- adapted_model = AdaptedHead(model=model, adapter=lambda x: x.reshape(32, -1)) + adapted_model = AdaptedHead(model=model, adapter=lambda x: x.reshape(x.size(0), -1))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/models/nets/test_diet_linear.py` at line 17, The test hard-codes a batch size in the adapter passed to AdaptedHead (adapter=lambda x: x.reshape(32, -1)); replace the fixed 32 with the runtime batch dimension so the adapter uses x.shape[0] (or x.size(0)) to compute the first dimension dynamically; update the adapter to reshape using the input's batch size (e.g., lambda x: x.reshape(x.shape[0], -1) or lambda x: x.view(x.size(0), -1)) so AdaptedHead works for any batch size.minerva/data/data_modules/har_xu_23.py (2)
170-177: 💤 Low valueMutable default argument (Ruff B006).
Same issue as
HarDataset: replace the list literal default withNoneand assign the fallback inside__init__to avoid shared-state surprises if the module is ever instantiated multiple times with mutations to the list.♻️ Proposed change
- feature_column_prefixes: List[str] = [ - "accel-x", - "accel-y", - "accel-z", - "gyro-x", - "gyro-y", - "gyro-z", - ], + feature_column_prefixes: Optional[List[str]] = None, target_column: str = "standard activity code",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/data_modules/har_xu_23.py` around lines 170 - 177, The parameter feature_column_prefixes is using a mutable default list; change its signature in the class constructor (where feature_column_prefixes is defined) to feature_column_prefixes: Optional[List[str]] = None (import Optional from typing if not already), and inside __init__ of the same class (the constructor that currently uses feature_column_prefixes) set self.feature_column_prefixes = feature_column_prefixes or ["accel-x","accel-y","accel-z","gyro-x","gyro-y","gyro-z"] so each instance gets its own list rather than sharing a single default mutable object.
70-97: ⚡ Quick winDocstring default contradicts signature; consider deferring IO to
setup().
- Lines 70–72: docstring says “By default, this is True” for
use_val_with_train, but the signature default isFalse. Fix the doc.- Lines 84–86: all three
.npyarrays (including test) are loaded eagerly in__init__. With Lightning DataModules the idiomatic location issetup(stage)so the heavy IO runs once per process and only loads the splits required for the current stage; this also lets the module be constructed cheaply (and pickled across DDP workers without holding the arrays).- Lines 95–97: leftover commented
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/data_modules/har_xu_23.py` around lines 70 - 97, The docstring for use_val_with_train contradicts the function signature—update the docstring so its default matches the __init__ signature (use_val_with_train default False) or change the parameter default to True; then remove eager file IO from __init__ (np.load of "train_data.npy", "val_data.npy", "test_data.npy" that populate har_train, har_val, har_test) and defer those loads into a new or existing setup(self, stage=None) method so data is loaded per stage (and the DataModule can be cheaply instantiated/pickled); also delete the commented print lines at the end. Ensure logic that handles use_train_as_val and use_val_with_train remains in setup and still references har_train/har_val/har_test.minerva/data/datasets/har_xu_23.py (2)
334-410: ⚡ Quick winMutable default +
feature_column_prefixes/target_columnappear unused.Two concerns on the
HarDatasetAPI:
feature_column_prefixes: List[str] = [...](Lines 334–341) is a mutable default (Ruff B006). Replace withNoneand initialize inside__init__.feature_column_prefixesandtarget_columnare assigned toselfbut never consulted in__getitem__— features and labels come straight from*_data_subseq.npy/*_labels_subseq.npy. The constructor implies column selection that doesn't actually happen, which will confuse downstream users and tests. Either implement the selection or drop the parameters from the public signature.♻️ Suggested fix for the mutable default
- feature_column_prefixes: List[str] = [ - "accel-x", - "accel-y", - "accel-z", - "gyro-x", - "gyro-y", - "gyro-z", - ], + feature_column_prefixes: Optional[List[str]] = None, target_column: str = "standard activity code", flatten: bool = False, ): ... - self.feature_column_prefixes = feature_column_prefixes + self.feature_column_prefixes = feature_column_prefixes or [ + "accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z", + ]Also applies to: 441-446
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/datasets/har_xu_23.py` around lines 334 - 410, The HarDataset __init__ uses a mutable default for feature_column_prefixes and exposes feature_column_prefixes and target_column that are never used in __getitem__; change the signature to use feature_column_prefixes: Optional[List[str]] = None (and set self.feature_column_prefixes = ["accel-x",... ] if None inside __init__) to fix the Ruff B006 mutable-default issue, and then either (A) implement column selection in __getitem__ (use self.feature_column_prefixes and self.target_column to index or map loaded arrays before returning features/labels) or (B) remove feature_column_prefixes and target_column from the public API and delete their assignments in __init__ so the constructor matches the actual behavior; update any tests/usage accordingly and preserve flatten handling already on self.flatten.
201-218: ⚡ Quick winReplace bare
exceptwith a narrow exception type.Bare
except:(Line 217) also catchesKeyboardInterruptandSystemExit, which is dangerous inside long-running training loops.adfullertypically raisesValueError/numpy.linalg.LinAlgErroron degenerate windows — catch those (or at minimumException) instead. Ruff E722.♻️ Proposed change
- except: + except (ValueError, np.linalg.LinAlgError, Exception): corr.append(0.6)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/datasets/har_xu_23.py` around lines 201 - 218, The bare except in the loop that computes p-values from adfuller (inside the for w_t loop updating corr) should be tightened to catch specific exceptions (e.g., ValueError and numpy.linalg.LinAlgError, or at minimum Exception) instead of catching all exceptions; update the try/except around the adfuller call and p computation (references: adfuller call, p_val, corr.append) to catch the narrow exception types and fall back to appending 0.6 on those known failures so KeyboardInterrupt/SystemExit are not swallowed.minerva/data/datasets/har_rodrigues_24.py (1)
2-10: 💤 Low valueDuplicate
import numpy as np.Lines 2 and 9 both
import numpy as np. Remove the second one.♻️ Proposed fix
import numpy as np import os from minerva.utils.typing import PathLike import torch from torch.utils.data import Dataset from pathlib import Path import pandas as pd -import numpy as np from numpy.lib.stride_tricks import as_strided as ast🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/datasets/har_rodrigues_24.py` around lines 2 - 10, Remove the duplicate import of NumPy: there are two identical lines importing "numpy as np" in minerva/data/datasets/har_rodrigues_24.py (the second occurrence in the shown diff); delete the redundant "import numpy as np" so only a single import remains and ensure no other references are altered.minerva/data/data_modules/har_rodrigues_24.py (1)
100-125: 💤 Low valueRepeated
DataLoaderconstruction differs only by split and shuffle.The three
*_dataloadermethods duplicate identical kwargs. A small helper (mirroringMultiModalHARSeriesDataModule._get_loader) would remove the duplication and make future changes (e.g., addingpin_memory,persistent_workers) a single edit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/data_modules/har_rodrigues_24.py` around lines 100 - 125, Extract the repeated DataLoader construction into a private helper (e.g., add a method like _get_loader(self, dataset, shuffle: bool)) and have train_dataloader, val_dataloader, and test_dataloader call this helper with self.train_dataset/self.val_dataset/self.test_dataset and the appropriate shuffle flag; the helper should centralize shared kwargs (batch_size=self.batch_size, drop_last=self.drop_last, num_workers=self.num_workers and easily add pin_memory or persistent_workers later) and mirror the pattern used by MultiModalHARSeriesDataModule._get_loader to keep behavior consistent.minerva/data/data_modules/har.py (1)
660-686: 💤 Low valueReplace
_get_loader.Lines 662–664 and 678 use
logging(or remove entirely once stable).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/data_modules/har.py` around lines 660 - 686, In _get_loader replace the two print(...) calls that announce the DataLoader configuration with logger calls: use the module/class logger (e.g., logging.getLogger(__name__) or an existing self.logger) and emit an appropriate level (info or debug) message that includes n_domains_per_sample and batch_size for the RandomDomainSampler branch and the shuffle value for the standard DataLoader branch; update references around RandomDomainSampler, DataLoader, n_domains_per_sample, batch_size, and drop_last to use the logger so messages are filterable and not printed to stdout.minerva/data/datasets/series_dataset.py (1)
343-377: ⚖️ Poor tradeoff
pad=Truedefeatslazy=Trueby reading every CSV at init.
_get_longest_sample_sizecallsself[i]for every index inside_disable_fix_length. Whenlazy=True, each__getitem__triggers_read_csv, reading and parsing every CSV at construction time just to determine the longest sequence. Users requesting lazy mode get neither the memory savings nor the deferred I/O they expect.Consider scanning only file row-counts (e.g., counting newlines or using
sum(1 for _ in open(f))) instead of fully parsing each CSV.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/datasets/series_dataset.py` around lines 343 - 377, _get_longest_sample_size currently forces eager loading by calling self[i] (which triggers __getitem__ -> _read_csv) inside _disable_fix_length; change it to scan file sizes directly from disk instead of indexing into the dataset: iterate over files from _scan_data(), open each Path and count data rows (e.g., count lines or use csv.reader to skip header) to compute the longest sequence length, respect the early return when not self.pad, and remove the reliance on _disable_fix_length for this operation so lazy=True no longer causes full parsing at init.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@minerva/data/data_modules/har_rodrigues_24.py`:
- Around line 25-54: Update the class docstring to accurately describe the
constructor parameter semantics: change the description of use_val_with_train to
state that when True the validation partition is concatenated into the training
partition (not replaced), explicitly document use_train_as_val behavior, and add
missing parameter entries for columns, num_workers, and drop_last with brief
types/defaults; reference HARDatasetCPC.load_dataset and the constructor
parameter names (use_val_with_train, use_train_as_val, columns, num_workers,
drop_last) so readers can correlate behavior with the loader implementation.
- Around line 55-98: The datasets are being built in __init__ causing
HARDatasetCPC to call load_dataset() three times (re-reading CSVs and rerunning
opp_sliding_window) and breaking Lightning's stage-aware lifecycle; move all
expensive dataset construction into setup(stage) and instead load the raw arrays
once (call load_dataset() / the CSV reader a single time and perform
opp_sliding_window once) then slice those arrays per phase
('train'/'val'/'test') and instantiate HARDatasetCPC with phase-specific slices
(or add a constructor flag to HARDatasetCPC to accept preloaded arrays) while
preserving use_val_with_train logic so concatenation happens only once.
In `@minerva/data/data_modules/har_xu_23.py`:
- Around line 260-269: The train_dataloader function currently calls
self._get_dataset_dataloader("train", shuffle=False); change this so the
training DataLoader is created with shuffle=True by calling
self._get_dataset_dataloader("train", shuffle=True) in the train_dataloader
method (function name: train_dataloader) to ensure training batches are shuffled
unless a sampler is intentionally provided elsewhere.
In `@minerva/data/data_modules/har.py`:
- Around line 87-94: The default features tuple is getting wrapped into a
single-item list downstream causing KeyError; fix by normalizing to a proper
list (not a list-of-tuple). Either change the default declaration of features to
a list literal or, better, coerce when storing/forwarding: in
UserActivityFolderDataModule (where features is stored) set self.features =
list(features) and/or update SeriesFolderCSVDataset.__init__ to convert non-list
iterables correctly (replace the current if not isinstance(features, list):
features = [features] with features = list(features) for iterable inputs).
Ensure the unique symbols touched are features (the default),
UserActivityFolderDataModule (assignment of features), and
SeriesFolderCSVDataset.__init__ (features normalization).
- Around line 528-538: The code currently materializes the entire Subset by
iterating over subset_ to build x_list and y_list solely for a debug print,
causing unnecessary O(N) work; remove the loop that populates x_list/y_list and
the debug print (the lines referencing x_list, y_list and the print f'\n\ny list
is {y_list} \n\n\n'), leaving sampled_indices, Subset(dataset, sampled_indices)
assigned to subset_ and returning subset_ unchanged so the DataLoader can lazily
load samples.
In `@minerva/data/datasets/har_rodrigues_24.py`:
- Around line 254-255: The labels are being cast to np.uint8 which will wrap
values ≥256 and breaks the "return_index_as_label" path where y =
np.arange(len(x)) can exceed 255; in the block that sets
datasets[phase]["labels"] (the same area referencing label and
"return_index_as_label"), change the cast to a wider integer type (e.g.,
np.int64) or conditionally skip/cast differently when label ==
"return_index_as_label" so sample indices are preserved without overflow.
- Around line 237-255: The CSV file iteration using phase_path.glob("*.csv") is
nondeterministic and may yield empty lists causing np.concatenate to throw;
update the loop in the loader that uses self.paths / phase_path (where data_x
and data_y are built and assigned into datasets[phase]) to iterate over a sorted
list of files (e.g., sorted(phase_path.glob("*.csv"))) to ensure reproducible
order, and handle the case of no files by checking if data_x/data_y are
non-empty before calling np.concatenate (or initialize datasets[phase]["data"]
and ["labels"] to empty arrays of the correct dtype when no CSVs are found) so
concatenation never raises ValueError.
- Around line 200-203: The current non-deterministic use of np.random.shuffle in
the label-assignment branch makes index labels unreproducible; update the
dataset to accept a seed (e.g., add a seed parameter to the constructor of the
class containing self.data and self.labels) and replace np.random.shuffle with a
per-instance RNG: create rng = np.random.default_rng(self.seed) and call
rng.shuffle(datum_index) when computing datum_index in the block guarded by if
self.label and self.label == "return_index_as_label"; ensure the new seed
parameter is stored on the instance (self.seed) and used anywhere else
randomness is required so behavior is fully reproducible.
In `@minerva/data/datasets/har_xu_23.py`:
- Around line 122-130: The constructor for TNCDataset does not validate that the
series length T is greater than 4 * window_size, causing
np.random.randint(2*window_size, T - 2*window_size) in __getitem__ to raise
low>=high later; add a validation in TNCDataset.__init__ after
self.T/self.window_size are set that checks if self.T > 4 * self.window_size and
raise a ValueError with a clear message (mention T and window_size) if the check
fails so misconfiguration fails fast; keep existing epsilon/delta logic intact.
- Around line 352-360: Docstring filenames are inconsistent with the actual
loader which uses "{annotate}_data_subseq.npy" for all splits and there is a
stale commented-out load line; update the docstring to list
train_data_subseq.npy, train_labels_subseq.npy, val_data_subseq.npy,
val_labels_subseq.npy, test_data_subseq.npy, test_labels_subseq.npy (matching
the loader that uses the annotate variable) and remove the stale comment "#
self.labels = np.load(...)" in the class/function that constructs/loads the
dataset (refer to the annotate variable used when building file names and the
constructor/__init__ or load method where the commented line appears).
- Around line 320-326: The fallback branch that checks `if len(x_n) == 0` is
unreachable and would crash if hit because it calls `.unsqueeze` on a NumPy
array; remove or guard this dead branch in the method that builds negatives
(look for variables `x_n`, `x`, `self.time_series` in the same function) and
instead ensure inputs and `np.random.randint(..., size=self.mc_sample_size)` are
called with valid bounds (or assert `self.mc_sample_size > 0`) so an empty `x_n`
cannot occur; if you still need a safe fallback, replace the `.unsqueeze(0)`
call with `np.expand_dims(x, 0)` (or `x[None, ...]`) to avoid AttributeError.
In `@minerva/data/datasets/series_dataset.py`:
- Around line 142-163: When feature_prefixes is None the code overwrites
self.feature_prefixes with all non-label columns then blindly reshapes using
len(self.feature_prefixes], collapsing time to 1; change this by detecting the
auto-discovery case before the reshape: preserve whether prefixes were
explicitly provided (e.g. store a boolean like was_prefixes_provided before you
overwrite), and if was_prefixes_provided is False and self.features_as_channels
is True either skip the channel reshape (do not call data.reshape(...)) or raise
a clear error asking for explicit feature_prefixes when
features_as_channels=True; modify the branch around the feature_prefixes
assignment and the block that calls data.reshape(...) to implement this guard
using the symbols self.feature_prefixes, self.features_as_channels, and the
existing data.reshape call.
In `@minerva/models/nets/diet_linear.py`:
- Around line 11-13: In DietLinear.forward, don't call self.model.forward(x)
directly; instead pass the adapted tensor through the model via the module call
(use self.model(x)) so PyTorch's __call__ and hooks run; update the forward
method that currently does x = self.adapter(x); return self.model.forward(x) to
use self.model(x) after adapter, leaving self.adapter and self.model references
intact.
In `@minerva/models/ssl/diet.py`:
- Around line 92-106: The lazy linear-head construction uses a CPU tensor for
sizing which can mismatch the backbone's device; update the random_input
creation to move it to the same device as the model/backbone (e.g., use the
backbone's parameter device or a model attribute) before calling self.backbone,
so that the forwards for self.backbone (and optional self.adapter) run on the
correct device and out.size(1) is computed without device errors; keep the rest
of the logic (flattening, encoding_size and creating self.linear_head)
unchanged.
- Around line 1-8: There are two identical imports of the torch module in
minerva/models/ssl/diet.py; remove the duplicate import so only a single "import
torch" remains (keep whichever import line fits the file's import
ordering/style), then run the linter/formatter to ensure imports are clean; look
for the duplicate symbol "torch" in the import block to locate the lines to
change.
- Line 19: Update the type annotation for the parameter named loss to be
explicit Optional[Callable] instead of Callable = None; import Optional from
typing if not already imported and ensure the signature (e.g., in the
constructor or function where loss is defined) uses Optional[Callable] for
clarity and PEP 484 compliance (keep the default = None unchanged).
In `@tests/data/datasets/test_series_dataset_multimodal.py`:
- Around line 368-377: The test currently shuffles labels_from_dataset in-place
which mutates the original returned labels and can hide defects; instead, keep
labels_from_dataset immutable and operate on a copy: create a copy (e.g.,
shuffled_labels = labels_from_dataset.copy()), shuffle that copy until it
differs from ground_truth, run the length and non-identity assertions against
shuffled_labels, and keep the final sorted check using the original
labels_from_dataset (sorted_labels_from_dataset = np.sort(labels_from_dataset))
to verify return_index_as_label preserves the identity when sorted.
---
Nitpick comments:
In `@minerva/data/data_modules/har_rodrigues_24.py`:
- Around line 100-125: Extract the repeated DataLoader construction into a
private helper (e.g., add a method like _get_loader(self, dataset, shuffle:
bool)) and have train_dataloader, val_dataloader, and test_dataloader call this
helper with self.train_dataset/self.val_dataset/self.test_dataset and the
appropriate shuffle flag; the helper should centralize shared kwargs
(batch_size=self.batch_size, drop_last=self.drop_last,
num_workers=self.num_workers and easily add pin_memory or persistent_workers
later) and mirror the pattern used by MultiModalHARSeriesDataModule._get_loader
to keep behavior consistent.
In `@minerva/data/data_modules/har_xu_23.py`:
- Around line 170-177: The parameter feature_column_prefixes is using a mutable
default list; change its signature in the class constructor (where
feature_column_prefixes is defined) to feature_column_prefixes:
Optional[List[str]] = None (import Optional from typing if not already), and
inside __init__ of the same class (the constructor that currently uses
feature_column_prefixes) set self.feature_column_prefixes =
feature_column_prefixes or
["accel-x","accel-y","accel-z","gyro-x","gyro-y","gyro-z"] so each instance gets
its own list rather than sharing a single default mutable object.
- Around line 70-97: The docstring for use_val_with_train contradicts the
function signature—update the docstring so its default matches the __init__
signature (use_val_with_train default False) or change the parameter default to
True; then remove eager file IO from __init__ (np.load of "train_data.npy",
"val_data.npy", "test_data.npy" that populate har_train, har_val, har_test) and
defer those loads into a new or existing setup(self, stage=None) method so data
is loaded per stage (and the DataModule can be cheaply instantiated/pickled);
also delete the commented print lines at the end. Ensure logic that handles
use_train_as_val and use_val_with_train remains in setup and still references
har_train/har_val/har_test.
In `@minerva/data/data_modules/har.py`:
- Around line 660-686: In _get_loader replace the two print(...) calls that
announce the DataLoader configuration with logger calls: use the module/class
logger (e.g., logging.getLogger(__name__) or an existing self.logger) and emit
an appropriate level (info or debug) message that includes n_domains_per_sample
and batch_size for the RandomDomainSampler branch and the shuffle value for the
standard DataLoader branch; update references around RandomDomainSampler,
DataLoader, n_domains_per_sample, batch_size, and drop_last to use the logger so
messages are filterable and not printed to stdout.
In `@minerva/data/datasets/har_rodrigues_24.py`:
- Around line 2-10: Remove the duplicate import of NumPy: there are two
identical lines importing "numpy as np" in
minerva/data/datasets/har_rodrigues_24.py (the second occurrence in the shown
diff); delete the redundant "import numpy as np" so only a single import remains
and ensure no other references are altered.
In `@minerva/data/datasets/har_xu_23.py`:
- Around line 334-410: The HarDataset __init__ uses a mutable default for
feature_column_prefixes and exposes feature_column_prefixes and target_column
that are never used in __getitem__; change the signature to use
feature_column_prefixes: Optional[List[str]] = None (and set
self.feature_column_prefixes = ["accel-x",... ] if None inside __init__) to fix
the Ruff B006 mutable-default issue, and then either (A) implement column
selection in __getitem__ (use self.feature_column_prefixes and
self.target_column to index or map loaded arrays before returning
features/labels) or (B) remove feature_column_prefixes and target_column from
the public API and delete their assignments in __init__ so the constructor
matches the actual behavior; update any tests/usage accordingly and preserve
flatten handling already on self.flatten.
- Around line 201-218: The bare except in the loop that computes p-values from
adfuller (inside the for w_t loop updating corr) should be tightened to catch
specific exceptions (e.g., ValueError and numpy.linalg.LinAlgError, or at
minimum Exception) instead of catching all exceptions; update the try/except
around the adfuller call and p computation (references: adfuller call, p_val,
corr.append) to catch the narrow exception types and fall back to appending 0.6
on those known failures so KeyboardInterrupt/SystemExit are not swallowed.
In `@minerva/data/datasets/series_dataset.py`:
- Around line 343-377: _get_longest_sample_size currently forces eager loading
by calling self[i] (which triggers __getitem__ -> _read_csv) inside
_disable_fix_length; change it to scan file sizes directly from disk instead of
indexing into the dataset: iterate over files from _scan_data(), open each Path
and count data rows (e.g., count lines or use csv.reader to skip header) to
compute the longest sequence length, respect the early return when not self.pad,
and remove the reliance on _disable_fix_length for this operation so lazy=True
no longer causes full parsing at init.
In `@tests/models/nets/test_diet_linear.py`:
- Line 17: The test hard-codes a batch size in the adapter passed to AdaptedHead
(adapter=lambda x: x.reshape(32, -1)); replace the fixed 32 with the runtime
batch dimension so the adapter uses x.shape[0] (or x.size(0)) to compute the
first dimension dynamically; update the adapter to reshape using the input's
batch size (e.g., lambda x: x.reshape(x.shape[0], -1) or lambda x:
x.view(x.size(0), -1)) so AdaptedHead works for any batch size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5781ce65-6b2b-41a3-a422-80fdb566cd28
📒 Files selected for processing (17)
minerva/data/data_modules/har.pyminerva/data/data_modules/har_rodrigues_24.pyminerva/data/data_modules/har_xu_23.pyminerva/data/datasets/har_rodrigues_24.pyminerva/data/datasets/har_xu_23.pyminerva/data/datasets/series_dataset.pyminerva/models/nets/diet_linear.pyminerva/models/nets/lfr_har_architectures.pyminerva/models/ssl/diet.pytests/data/data_modules/test_data_module_har.pytests/data/data_modules/test_data_module_tnc.pytests/data/datasets/test_har_rodrigues_24.pytests/data/datasets/test_har_xu_23.pytests/data/datasets/test_series_dataset_folder_csv.pytests/data/datasets/test_series_dataset_multimodal.pytests/models/nets/test_diet_linear.pytests/models/ssl/test_diet.py
Removed the HarDataModule_Downstream class and its methods, which handled loading and batching of data for human activity recognition tasks.
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
tests/data/datasets/test_series_dataset_multimodal.py (1)
369-378:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not mutate dataset output in the assertion path.
At Line 373, shuffling
labels_from_datasetin-place weakens the test and can let a brokenreturn_index_as_labelimplementation pass.✅ Suggested fix
- labels_from_dataset = dataset[:][1] ground_truth = np.arange(len(dataset)) - - while sum(abs(labels_from_dataset - ground_truth)) == 0: - np.random.shuffle(labels_from_dataset) - - assert len(labels_from_dataset) == len(ground_truth) - assert sum(abs(labels_from_dataset - ground_truth)) > 0 - sorted_labels_from_dataset = np.sort(labels_from_dataset) - assert sum(abs(sorted_labels_from_dataset - ground_truth)) == 0 + draws = [dataset[:][1] for _ in range(8)] + for labels_from_dataset in draws: + assert len(labels_from_dataset) == len(ground_truth) + np.testing.assert_array_equal(np.sort(labels_from_dataset), ground_truth) + + assert any( + not np.array_equal(labels_from_dataset, ground_truth) + for labels_from_dataset in draws + ), "return_index_as_label should eventually return a non-identity permutation."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/data/datasets/test_series_dataset_multimodal.py` around lines 369 - 378, The test currently shuffles labels_from_dataset in-place which mutates the dataset output; instead make and shuffle a copy, e.g. create a copy (labels_shuffled = labels_from_dataset.copy() or np.copy(labels_from_dataset)), run the while loop and np.random.shuffle on that copy, and use labels_shuffled for the inequality assertions and sorting; keep labels_from_dataset unchanged for the final sorted comparison with ground_truth so the original dataset output is not mutated (referencing variables labels_from_dataset, labels_shuffled, ground_truth, and dataset).
🧹 Nitpick comments (1)
minerva/data/data_modules/har.py (1)
653-654: ⚡ Quick winReplace debug
print()calls with proper logging.These print statements output on every dataloader creation, polluting stdout during training runs. Consider using Python's
loggingmodule or removing them entirely.♻️ Proposed fix
+import logging + +logger = logging.getLogger(__name__) + # In _get_loader method: if self.n_domains_per_sample is not None: - print( - f"Using DataLoader with RandomDomainSampler with n_domains_per_sample={self.n_domains_per_sample}" - ) + logger.debug( + "Using DataLoader with RandomDomainSampler with n_domains_per_sample=%d", + self.n_domains_per_sample, + ) # ... sampler code ... else: - print(f"Using DataLoader with shuffle={shuffle}") + logger.debug("Using DataLoader with shuffle=%s", shuffle)Also applies to: 669-669
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@minerva/data/data_modules/har.py` around lines 653 - 654, Replace the debug print statements in the HAR data module with proper logging: locate the print calls that output "Using DataLoader with RandomDomainSampler with n_domains_per_sample={self.n_domains_per_sample}" (and the similar one around line 669) inside the HAR data module (e.g., methods that build the DataLoader / RandomDomainSampler like train_dataloader/val_dataloader setup). Add or reuse a module-level logger (logger = logging.getLogger(__name__)) and change the prints to logger.debug(...) or logger.info(...) as appropriate, passing the formatted message and avoiding stdout prints so training logs remain clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@minerva/data/data_modules/har_xu_23.py`:
- Around line 72-75: The docstring for the parameter use_val_with_train
contradicts the actual default (the signature sets use_val_with_train=False);
update the docstring to state the correct default of False (or change the
function/class signature to True if you intended that behavior). Locate the
parameter use_val_with_train in the function or class signature in har_xu_23.py
and make the docstring parameter description match the actual default value so
they are consistent.
- Around line 172-179: The constructor currently uses a mutable list default for
feature_column_prefixes; change the parameter to accept None (e.g.,
feature_column_prefixes: Optional[List[str]] = None) and inside __init__ set
self.feature_column_prefixes = feature_column_prefixes or
["accel-x","accel-y","accel-z","gyro-x","gyro-y","gyro-z"] so the default list
is created per-instance; update the signature and assignment in the HarXu23 (or
the class containing feature_column_prefixes) and remove the mutable default
from the parameter list.
In `@minerva/data/data_modules/har.py`:
- Around line 415-416: The docstring for the HAR data module incorrectly
documents the parameter shuffle_train as type `str`; update the docstring to
reflect the actual type (`bool`) where shuffle_train is described (e.g., in the
class/function docstring that documents shuffle_train in
minerva.data.data_modules.har.py) so the doc comment matches the parameter
declaration and usage.
In `@minerva/data/datasets/har_xu_23.py`:
- Around line 219-220: The bare except block around the adfuller call should be
replaced with specific exception handling; update the try/except that currently
does "except: corr.append(0.6)" to catch the expected exceptions (e.g.,
ValueError, numpy.linalg.LinAlgError, or at minimum Exception) and append the
fallback 0.6 only for those cases; locate the adfuller invocation and the
corr.append call to make this change (ensure you import/qualify LinAlgError if
used) so we don't swallow system-exiting exceptions like
KeyboardInterrupt/SystemExit.
- Around line 336-343: The class currently defines feature_column_prefixes as a
module/class-level default list which is mutable; change the signature to accept
feature_column_prefixes: Optional[List[str]] = None and in the class __init__
assign self.feature_column_prefixes = feature_column_prefixes or
["accel-x","accel-y","accel-z","gyro-x","gyro-y","gyro-z"] so each instance gets
its own list and no shared mutable default is used; update any references to
feature_column_prefixes to use self.feature_column_prefixes.
In `@minerva/models/ssl/diet.py`:
- Around line 95-100: The lazy head probe creation assumes sliceable Dataset and
CPU tensors; fix by obtaining a sample input robustly (use try: sample =
training_dataset[0] except: sample = next(iter(training_dataset))) then extract
the tensor input (if sample is tuple/list use sample[0]) and infer device from
the backbone (device = next(self.backbone.parameters()).device or
next(self.backbone.buffers()).device). Create random_input with the same shape
on that device (torch.rand(sample_input.shape, device=device)) so the probe
supports all Dataset types and matches the backbone device before calling
self.backbone to compute embeddings (preserve handling of self.adapter and
subsequent code).
---
Duplicate comments:
In `@tests/data/datasets/test_series_dataset_multimodal.py`:
- Around line 369-378: The test currently shuffles labels_from_dataset in-place
which mutates the dataset output; instead make and shuffle a copy, e.g. create a
copy (labels_shuffled = labels_from_dataset.copy() or
np.copy(labels_from_dataset)), run the while loop and np.random.shuffle on that
copy, and use labels_shuffled for the inequality assertions and sorting; keep
labels_from_dataset unchanged for the final sorted comparison with ground_truth
so the original dataset output is not mutated (referencing variables
labels_from_dataset, labels_shuffled, ground_truth, and dataset).
---
Nitpick comments:
In `@minerva/data/data_modules/har.py`:
- Around line 653-654: Replace the debug print statements in the HAR data module
with proper logging: locate the print calls that output "Using DataLoader with
RandomDomainSampler with n_domains_per_sample={self.n_domains_per_sample}" (and
the similar one around line 669) inside the HAR data module (e.g., methods that
build the DataLoader / RandomDomainSampler like train_dataloader/val_dataloader
setup). Add or reuse a module-level logger (logger =
logging.getLogger(__name__)) and change the prints to logger.debug(...) or
logger.info(...) as appropriate, passing the formatted message and avoiding
stdout prints so training logs remain clean.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 148136c1-d979-461b-9f66-bc2ead1963bd
📒 Files selected for processing (19)
minerva/data/data_modules/har.pyminerva/data/data_modules/har_rodrigues_24.pyminerva/data/data_modules/har_xu_23.pyminerva/data/datasets/har_rodrigues_24.pyminerva/data/datasets/har_xu_23.pyminerva/data/datasets/series_dataset.pyminerva/models/nets/diet_linear.pyminerva/models/nets/lfr_har_architectures.pyminerva/models/ssl/diet.pytests/data/data_modules/test_data_module_har.pytests/data/data_modules/test_data_module_tnc.pytests/data/datasets/test_har_rodrigues_24.pytests/data/datasets/test_har_xu_23.pytests/data/datasets/test_series_dataset_folder_csv.pytests/data/datasets/test_series_dataset_multimodal.pytests/models/nets/test_diet_linear.pytests/models/nets/test_lfr_har_architectures.pytests/models/ssl/test_diet.pytests/models/ssl/test_lfr_implementation_har.py
✅ Files skipped from review due to trivial changes (3)
- tests/models/nets/test_lfr_har_architectures.py
- minerva/models/nets/diet_linear.py
- tests/models/ssl/test_lfr_implementation_har.py
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/data/data_modules/test_data_module_tnc.py
- tests/models/nets/test_diet_linear.py
- minerva/data/data_modules/har_rodrigues_24.py
- minerva/data/datasets/har_rodrigues_24.py
- tests/models/ssl/test_diet.py
- minerva/models/nets/lfr_har_architectures.py
- tests/data/datasets/test_har_xu_23.py
- tests/data/datasets/test_har_rodrigues_24.py
- minerva/data/datasets/series_dataset.py
|
LGTM |
Describe your changes
Adds the components required to replicate the paper
"Benchmarking Encoders and Self-Supervised Learning for Smartphone-Based Human Activity Recognition"
(IEEE Access, https://ieeexplore.ieee.org/document/11417778) on top of current Minerva
main. The paper's experiment fork contained modules that were never upstreamed, so the public release cannot currently reproduce the paper's pre-training pipeline. This PR brings those modules in.Data layer
SeriesDataset: Time-series base dataset (parent class for the HAR datasets).HARDatasetCPC(Rodrigues 2024),TNCDatasetandHarDataset(Xu 2023).MultiModalHARSeriesDataModuleand per-dataset HAR DataModules (har.py,har_rodrigues_24.py,har_xu_23.py).Models
DIET: SSL technique, one of the four benchmarked in the paper. The other three (TNC, TF-C, LFR) are alreadyin upstream.
DIETLinear: linear projection head used by DIET.LFRHARArchitecture: optionalpermuteflag onHARSCnnEncoder(used when feeding(B, T, C)instead of(B, C, T))Tests
New pytest files under
tests/coveringSeriesDataset(folder-CSV + multimodal), both HAR datasets, both HAR DataModules,DIETLinear,DIETCommit layout
feat(data): add SeriesDataset base class for time seriesfeat(data): add HAR datasets for Rodrigues 2024 and Xu 2023feat(data): add HAR DataModules (generic + Rodrigues 24, Xu 23)feat(ssl): add DIET and linear headfeat(nets): extend LFRHARArchitecture to be compatible with TNC (permute)test: cover SeriesDataset, HAR datasets/DataModules, DIET, DIETLinearNo changes to existing public APIs.
pyproject.toml,requirements.txt, and every existing__init__.pyare byte-identical to upstream, no dependency or export changes required.Issue ticket number and link (If apply)
Closes #105 [Feature Request] Missing modules required to replicate published HAR paper.
Checklist before requesting a review
No new dependencies are introduced.
pyproject.tomlandrequirements.txtare unchanged from upstreammain.Summary by CodeRabbit
New Features
Tests