Skip to content

Feature/ paper discovery module chitchat#319

Open
fiifidawson wants to merge 22 commits into
EPFLiGHT:v2from
fiifidawson:feature/-paper-discovery-module-chitchat
Open

Feature/ paper discovery module chitchat#319
fiifidawson wants to merge 22 commits into
EPFLiGHT:v2from
fiifidawson:feature/-paper-discovery-module-chitchat

Conversation

@fiifidawson

Copy link
Copy Markdown

No description provided.

Introduce the mmore.paper_discovery package with initial components for a paper discovery pipeline:

- schema: dataclasses for CategoryQuery, Paper, and SynonymEntry to normalize data shapes.
- boolean: functions to load synonym tables and build category-level boolean queries (load_synonyms, _or_group, build_boolean_queries).
- logging_config: simple logger configuration for PaperDiscovery.
- sources/base: SourceAdapter protocol defining the search interface and guidance that adapters must not raise on network errors.
- Stubs added for other modules (config, pdf, pipeline, and individual source modules) to be implemented later.

These changes set up the core types and query-building logic used by downstream source adapters and pipeline stages.
- Introduce src/mmore/paper_discovery/sources/__init__.py which centralizes source adapter imports and registers them in REGISTRY (openalex, europepmc, arxiv; google_scholar is commented).
- Adds typed get_adapter(name, **kwargs) to instantiate a SourceAdapter or raise a ValueError for unknown sources. Provides a single entrypoint for resolving source adapters.
- Introduce a new `paper-discovery` subcommand and entrypoint to run the Paper Discovery pipeline.
- Adds src/mmore/run_paper_discovery.py which loads a PaperDiscoveryConfig, constructs and runs PaperDiscoveryPipeline (with timing, dotenv, and profiling support), and wires it into the CLI (src/mmore/cli.py).
- Also removes a stale tmp_pdf file.
  - boolean: skip categories with no resolved synonyms instead of emitting an empty boolean string
  - schema: include search_category in Paper.to_dict() output
  - openalex: widen exception catch to never raise from search()
  - tests/conftest: make langchain_milvus import optional so the suite is collectible on partial installs
Add robust PDF fetching with paywall detection and optional EZproxy proxying, plus progress/counting and partial-result handling.

- Add pdf_proxy_prefix config option and docs.
- Replace simple download return value with DownloadResult (path, paywalled, errored, status), detect common paywall statuses, and use a polite default User-Agent.
- Introduce _proxify helper to wrap PDF URLs via an EZproxy prefix and use it for initial/follow-up fetches.
- Improve logging and error handling for network failures; surface paywalled PDFs separately from errors.
- In the pipeline, add tqdm progress (with a lightweight fallback), track succeeded/paywalled/errored/skipped counts, log a summary and a tip when paywalled PDFs are encountered, and ensure partial results are written on KeyboardInterrupt.
- Add arXiv client improvements: request timeout, 429 backoff handling, and related logging.
- Update .gitignore to include local paper_discovery artifacts (pdf cache and papers.json) and normalize example output paths.

@fabnemEPFL fabnemEPFL left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty cool PR with this new module. several things to change to integrate it properly into mmore

Comment on lines +23 to +28
<!-- ```bash
pip install scholarly
```

`scholarly` is **not** in the `paper_discovery` extra by design — it is
captcha-prone. Install only if needed. -->

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have this commented?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is captcha-prone and at bit unreliable. I'll uncomment the install block and the Sources row, and re-document it as an opt-in source.

| OpenAlex | none | ~1 req/s; inverted-index abstracts are rebuilt automatically |
| Europe PMC | none | ~1 req/s; uses `resultType=core` |
| arXiv | none | **1 req / 3 s** (strict, ToS); query is simplified to top terms; 30 s back-off on 429 |
<!-- | Google Scholar | none | Opt-in (`scholarly`), captcha-prone, best-effort | -->

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have this commented?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will uncomment in the next push

```json
[
{"word": "Foundation model",
"synonyms": ["LLM", "large language model", "GPT"]},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this case-sensitive?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

| `pdf_dir` | `./pdf_cache` | Reused across runs (see *PDF caching* below) |
| `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF |
| `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) |
| `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what the user_agent field is about

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the HTTP User-Agent header sent on every outbound request to OpenAlex / Europe PMC / arXiv / publisher PDF endpoints.
Sources prefer (and OpenAlex requires) a string that identifies the caller so they can contact us.

| `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF |
| `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) |
| `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app |
| `arxiv_category_map` | `null` | Maps category-name substrings to arXiv category codes (e.g. `cs.LG`) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please write a more concrete example of what the user should write in this field when it's not null

Comment thread src/mmore/paper_discovery/pipeline.py Outdated
Comment on lines +91 to +93
def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None:
cfg = self.config
papers = list(papers)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None:
cfg = self.config
papers = list(papers)
def _enrich_with_pdf_text(self, papers: List[Paper]) -> None:
cfg = self.config

}


def get_adapter(name: str, **kwargs) -> SourceAdapter:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a docstring



def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]:
"""Load synonyms from a JSON file.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify the input properly

Comment thread src/mmore/paper_discovery/boolean.py Outdated
def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]:
"""Load synonyms from a JSON file.

Expected format:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expected format:
Returns:


def build_boolean_queries(
synonyms: List[SynonymEntry],
categories: Dict[str, List[str]],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more specific about the expected format for categories

…ass A)

  - Adapters now explicitly inherit `SourceAdapter` (EPFLiGHT#20, EPFLiGHT#22, EPFLiGHT#23, EPFLiGHT#24)
  - `SourceAdapter` Protocol clarified: `name` + `search()` only;
    ctor kwargs documented as `get_adapter()` surface (EPFLiGHT#11, EPFLiGHT#12, EPFLiGHT#27, EPFLiGHT#29)
  - Public docstrings filled in (`PaperDiscoveryPipeline`, `run`,
    `load_synonyms`, `build_boolean_queries`, `get_adapter`, `extract_text`)
    (EPFLiGHT#17, EPFLiGHT#30, EPFLiGHT#31, EPFLiGHT#33, EPFLiGHT#34, EPFLiGHT#35, EPFLiGHT#36)
  - `_enrich_with_pdf_text` signature `Iterable[Paper]` -> `List[Paper]` (EPFLiGHT#32)
  - `boolean.by_word` lookup made case-insensitive (EPFLiGHT#3)
  - `config.py` / docs: clarified what `user_agent` is and gave a
    concrete example (EPFLiGHT#4, EPFLiGHT#5, EPFLiGHT#13)
  - Docs: uncommented Google Scholar install + sources-table row,
    added `user_agent` subsection, rewrote "User-managed UA" ->
    "Why we don't spoof the UA" (EPFLiGHT#1, EPFLiGHT#2, EPFLiGHT#6)
  - `examples/config.yaml`: commented `google_scholar` source with
    doc pointer; clarified `arxiv_category_map` purpose (EPFLiGHT#7, EPFLiGHT#8)
  - Drive-by: fixed latent `list[str](REGISTRY)` typo in `get_adapter`
    error path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants