The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

[Paper] [Project] [BibTeX]

Abstract: Vision–language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

Reproducing the Results

Installation

git clone https://github.com/livctr/cvfm4rehab.git
cd cvfm4rehab
git submodule update --init --recursive
bash ./setup_cvfm4rehab_envs.sh [-y]

The bash script ./setup_cvfm4rehab_envs.sh creates four separate environments cvfm4rehab, cvfm4rehab_llava, cvfm4rehab_vila, and cvfm4rehab_longva for running different models. Add the -y flag to say "yes to all prompts". Alternatively, you can choose which environment to create based on the models you plan to use.

Environment	Models
`cvfm4rehab`	qwen2_5_vl_[7,32,72]b
`cvfm4rehab_llava`	internvl3p5_[2,8,38,30b_a3]b, internvl3_78b, llava_next_video_[7,72]b, llava_ov_[0p5,7,72]b
`cvfm4rehab_vila`	nvila_[8,15]b, longvila_8b
`cvfm4rehab_longva`	longva_7b

Running an Experiment

# (1) Manually fill in appropriate API keys / sbatch directive in `evaluate.sh.example`.
# (2) Run the following command
bash evaluate.sh.example --model [models] --task [tasks]

Replacements for [models]:

One model: qwen2_5_vl_7b, qwen2_5_vl_32b, qwen2_5_vl_72b, llava_ov_0p5b, llava_ov_7b, llava_ov_72b, llava_next_video_7b, llava_next_video_72b, internvl3p5_2b, internvl3p5_8b, internvl3p5_38b, internvl3_78b, internvl3p5_30b_a3b, internvl3p5_241b_a28b, longva_7b, nvila_8b, nvila_15b
Aliases: small, medium, big. Runs all models of size up to $\sim 8\text{B}$, $\sim 15\text{B}$ to $38\text{B}$, and $\sim 72\text{B}$ or larger parameters, respectively.
Alias: all. Runs all available models.
You can also run multiple models, comma-separated, as such: qwen2_5_vl_7b,qwen2_5_vl_32b,qwen2_5_vl_72b

Replacements for [tasks]

Choose which task to evaluate the VLM on. The task yaml files are under "lmms_eval/tasks/strokerehab/" with important choices listed below. The letters corrspond to those found on the right side of Table 1.
(A) Activity Identification: Ask a VLM to identify which one of nine activities is depicted in a video. See "postprocess/id/identification.ipynb" for results.
- strokerehab_identification_1: prompt with pre-existing descriptions of the activities.
- strokerehab_identification_2: use optimized prompts.
(B) Dose Quantification: Ask a VLM to identify one of five fine-grained actions for each $0.533$-s video segment. See "postprocess/primitives/exp_to_latex.ipynb".
- strokerehab_primitives_1, strokerehab_primitives_2: not explored much. The first requests the VLM to output multiple actions (since there could be $>1$ per segment), while the second requests the VLM to classify one action directly.
- strokerehab_primitives_3: explored in the paper. Breaks down the fine-grained actions along two axes: motion and grasp. This strategy prompts the VLM to output one answer for each axis and uses rule-based methods to reconstruct the action (two of the actions are distinguished based on a grasp in the immediate future).
(C) Dose Quantification RTT/Shelf: Same as above, but with a different dataset consisting of videos with more regular motions. See "postprocess/primitives/counting.ipynb" for results. You should run bash evaluate.sh --model qwen2_5_vl_prim --task strokerehab_counting --model_args "do_crop=True,do_postprocess=True". Here, qwen2_5_vl_prim runs Qwen2.5-VL-32B-Instruct with optimized prompting.
(D) Impairment Quantification: Ask a VLM to rate a subject's impairment level. See "postprocess/ia/eval.ipynb" for results.
- strokerehab_ia2_3_30,strokerehab_ia2_31_33: Ask the VLM a bunch of questions per item (numbered 3 through 33) and use rules to determine the final item-level score (0, 1, or 2).
- strokerehab_ia4_3_30,strokerehab_ia4_31_33: Prompt the VLM in a Chain-of-Thought manner and parse its final answer.

Running ablations

evaluate.sh.example also allows the model and task arguments to be replaced. Please ensure that both argument types are written in or else things might break.
Example: bash evaluate.sh.example --model qwen2_5_vl_7b --task strokerehab_primitives_2 --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_frames_num=1,sampling_strategy=dense,sampling_fps=1,overlap_frames_num=0"

A Sample of Findings

See the paper for others and quantitative results! But here are some failure modes we came across: (a) Large object bias: Left: The model misclassifies a combing activity as an RTT exercise, likely driven by contextual cues from larger objects such as the black mat and wristbands. Right: The model exhibits hand attribution errors, potentially due to the left hand's interaction with a dominant object (water bottle). (b) Overreliance on 2D semantics: Timeline of a $6$-s video with dotted lines marking $0.533$-s segments. Colors indicate grasp state: gray (no grasp), orange (grasp), and black (mixed). The ground truth contains two distinct grasps. When queried, "Is the subject's right hand grasping the pink object? Answer 'Yes' or 'No' directly." for each segment, the VLM misinterprets visual proximity as physical contact and fails to distinguish the two separate grasp events. (c) Hallucination: A patient with severe impairment attempts a shoulder flexion task (ground truth Fugl-Meyer score of $0$). For reference, the diagram on the right depicts successful completion of the task. The model hallucinates movement and incorrectly reports task success.

Citing cvfm4rehab

Coming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
LLaVA-NeXT @ 09e5840		LLaVA-NeXT @ 09e5840
LongVA @ f4c8781		LongVA @ f4c8781
VILA @ 6a58952		VILA @ 6a58952
assets		assets
data		data
lmms_eval		lmms_eval
postprocess		postprocess
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
evaluate.sh.example		evaluate.sh.example
pyproject.toml		pyproject.toml
setup.py		setup.py
setup_cvfm4rehab_envs.sh		setup_cvfm4rehab_envs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Reproducing the Results

A Sample of Findings

Citing cvfm4rehab

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Reproducing the Results

A Sample of Findings

Citing cvfm4rehab

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages