Skip to content

Conversation

@quarpeeze
Copy link

This is a walkthrough that can help students navigate the repository in the beginning of the course. It should make the starting process easier and not only describe, but also make a quick demo of the functionality included in the pv211-utils.

  1. For every project (Cranfield, ARQMath, CQADupStack, TREC) there is a brief description and demo code snippet. It lets the user see the data and extract it from the given dataset.
  2. There is a section on how to use datasets.py module to load data.
  3. Text Preprocessing techniques such as lemmatization and stemming are demonstrated with examples.
  4. Full demo of the systems package with code snippets to better understand how this package helps in retrieval tasks.
  5. All the contents of the repo are (to the possible extent) mapped to Manning et al. 2008 book, with direct links to the book sections.
  6. Extra links and useful sources (like articles about BERT) are included. There may be need to include more/filter out the current ones.

I tried to look at it from students' perspective and really make something I would rather have when I just started the course, so I hope it's at least a bit helpful :)

"\n",
"🔗**GitHub:** [https://github.com/MIR-MU/pv211-utils](https://github.com/MIR-MU/pv211-utils)\n",
"\n",
"🔗**GitLab:** [https://gitlab.fi.muni.cz/xstefan3/pv211-utils](https://gitlab.fi.muni.cz/xstefan3/pv211-utils)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not reference the gitlab repo -- it is no longer supported

"Code implementations here are mostly linked to concepts from the book [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/) by Manning, Raghavan, & Schütze (2008). You can download the full PDF version [HERE](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf).\n",
"\n",
"\n",
"- Starting off (cloning and setting up the repo) -> [Click here](#starting-off)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why, but the links do not work when looking at the notebook in Github, i.e. here: https://github.com/MIR-MU/pv211-utils/blob/1ad51a5874f5f8ce6d7c44a5ed7db0a422f72bca/walkthrough.ipynb

maybe there would be some easy fix, but I understand that the intended use is probably to download and access the notebook from some jupyter env, right

@stefanik12
Copy link
Member

stefanik12 commented Sep 2, 2025

This is a walkthrough that can help students navigate the repository in the beginning of the course. It should make the starting process easier and not only describe, but also make a quick demo of the functionality included in the pv211-utils.

I tried to look at it from students' perspective and really make something I would rather have when I just started the course, so I hope it's at least a bit helpful :)

I think it would be nice to add this -- or any disclosure of indended users -- to the beginning of the notebook. So that the reader knows (s)he's on the right place.

"cell_type": "markdown",
"metadata": {},
"source": [
"## Cranfield Collection"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe it would be nice to also walk the user over the common concepts of all the projects: i.e. that there are Documents, Judgements, IRSystem, etc., and how they interact together. Ideally also including the links to the base implementations in the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice to show how these objects are used: you already cover that at the end of the notebook:

results_with_exp = list(retriever_with_exp.search(query))
for doc in results_with_exp:
    print(f"→ Doc {doc.document_id}: {doc.body[:80]}...")

...so maybe just copying a small demonstration of this kind to the beginning, before the introduction of specific projects (Cranfield, ...)?

"- ARQMath ---> *[ detailed demo...](#cranfield-collection)*\n",
"- Cranfield ---> *[detailed demo...](#cqadupstack-collection)*\n",
"- TREC ---> *[detailed demo...](#arqmath-collection)*\n",
"- BEIR ---> *[detailed demo...](#trec-collection)*\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure tbh where these links were intended to lead. Is this some future section that is not there yet?

"source": [
"# first let's install nltk if not already installed\n",
"\n",
"!pip install nltk"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing nltk from the pv211_utils dependencies?

@stefanik12 stefanik12 self-assigned this Sep 2, 2025
Copy link
Member

@stefanik12 stefanik12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quarpeeze thank you for a very nice contribution! All the links will definitely be very useful.
I am thinking that maybe we could make this notebook a default first thing that the user sees when they enter the iirhub.

I am leaving a couple of ideas for further improving the notebook and you can decide which/how to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants