Feature/ paper discovery module chitchat by fiifidawson · Pull Request #319 · EPFLiGHT/mmore

fiifidawson · 2026-06-05T07:20:39Z

No description provided.

Introduce the mmore.paper_discovery package with initial components for a paper discovery pipeline: - schema: dataclasses for CategoryQuery, Paper, and SynonymEntry to normalize data shapes. - boolean: functions to load synonym tables and build category-level boolean queries (load_synonyms, _or_group, build_boolean_queries). - logging_config: simple logger configuration for PaperDiscovery. - sources/base: SourceAdapter protocol defining the search interface and guidance that adapters must not raise on network errors. - Stubs added for other modules (config, pdf, pipeline, and individual source modules) to be implemented later. These changes set up the core types and query-building logic used by downstream source adapters and pipeline stages.

- currently not being used

- Introduce src/mmore/paper_discovery/sources/__init__.py which centralizes source adapter imports and registers them in REGISTRY (openalex, europepmc, arxiv; google_scholar is commented). - Adds typed get_adapter(name, **kwargs) to instantiate a SourceAdapter or raise a ValueError for unknown sources. Provides a single entrypoint for resolving source adapters.

- Introduce a new `paper-discovery` subcommand and entrypoint to run the Paper Discovery pipeline. - Adds src/mmore/run_paper_discovery.py which loads a PaperDiscoveryConfig, constructs and runs PaperDiscoveryPipeline (with timing, dotenv, and profiling support), and wires it into the CLI (src/mmore/cli.py). - Also removes a stale tmp_pdf file.

- boolean: skip categories with no resolved synonyms instead of emitting an empty boolean string - schema: include search_category in Paper.to_dict() output - openalex: widen exception catch to never raise from search() - tests/conftest: make langchain_milvus import optional so the suite is collectible on partial installs

Add robust PDF fetching with paywall detection and optional EZproxy proxying, plus progress/counting and partial-result handling. - Add pdf_proxy_prefix config option and docs. - Replace simple download return value with DownloadResult (path, paywalled, errored, status), detect common paywall statuses, and use a polite default User-Agent. - Introduce _proxify helper to wrap PDF URLs via an EZproxy prefix and use it for initial/follow-up fetches. - Improve logging and error handling for network failures; surface paywalled PDFs separately from errors. - In the pipeline, add tqdm progress (with a lightweight fallback), track succeeded/paywalled/errored/skipped counts, log a summary and a tip when paywalled PDFs are encountered, and ensure partial results are written on KeyboardInterrupt. - Add arXiv client improvements: request timeout, 429 backoff handling, and related logging. - Update .gitignore to include local paper_discovery artifacts (pdf cache and papers.json) and normalize example output paths.

fabnemEPFL

pretty cool PR with this new module. several things to change to integrate it properly into mmore

fabnemEPFL · 2026-06-13T17:50:13Z

+<!-- ```bash
+pip install scholarly
+```
+
+`scholarly` is **not** in the `paper_discovery` extra by design — it is
+captcha-prone. Install only if needed. -->


why have this commented?

It is captcha-prone and at bit unreliable. I'll uncomment the install block and the Sources row, and re-document it as an opt-in source.

fabnemEPFL · 2026-06-13T17:50:44Z

+| OpenAlex       | none | ~1 req/s; inverted-index abstracts are rebuilt automatically |
+| Europe PMC     | none | ~1 req/s; uses `resultType=core` |
+| arXiv          | none | **1 req / 3 s** (strict, ToS); query is simplified to top terms; 30 s back-off on 429 |
+<!-- | Google Scholar | none | Opt-in (`scholarly`), captcha-prone, best-effort | -->


why have this commented?

Will uncomment in the next push

fabnemEPFL · 2026-06-13T17:53:22Z

+```json
+[
+  {"word": "Foundation model",
+   "synonyms": ["LLM", "large language model", "GPT"]},


is this case-sensitive?

fabnemEPFL · 2026-06-13T17:56:53Z

+| `pdf_dir` | `./pdf_cache` | Reused across runs (see *PDF caching* below) |
+| `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF |
+| `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) |
+| `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app |


I don't get what the user_agent field is about

It's the HTTP User-Agent header sent on every outbound request to OpenAlex / Europe PMC / arXiv / publisher PDF endpoints.
Sources prefer (and OpenAlex requires) a string that identifies the caller so they can contact us.

fabnemEPFL · 2026-06-13T17:57:31Z

+| `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF |
+| `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) |
+| `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app |
+| `arxiv_category_map` | `null` | Maps category-name substrings to arXiv category codes (e.g. `cs.LG`) |


please write a more concrete example of what the user should write in this field when it's not null

fabnemEPFL · 2026-06-14T07:41:29Z

+    def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None:
+        cfg = self.config
+        papers = list(papers)


Suggested change

def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None:

cfg = self.config

papers = list(papers)

def _enrich_with_pdf_text(self, papers: List[Paper]) -> None:

cfg = self.config

fabnemEPFL · 2026-06-14T07:46:29Z

+}
+
+
+def get_adapter(name: str, **kwargs) -> SourceAdapter:


add a docstring

fabnemEPFL · 2026-06-14T07:47:53Z

+
+
+def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]:
+    """Load synonyms from a JSON file.


specify the input properly

fabnemEPFL · 2026-06-14T07:48:04Z

+def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]:
+    """Load synonyms from a JSON file.
+
+    Expected format:


Suggested change

Expected format:

Returns:

fabnemEPFL · 2026-06-14T07:49:36Z

+
+def build_boolean_queries(
+    synonyms: List[SynonymEntry],
+    categories: Dict[str, List[str]],


be more specific about the expected format for categories

…ass A) - Adapters now explicitly inherit `SourceAdapter` (EPFLiGHT#20, EPFLiGHT#22, EPFLiGHT#23, EPFLiGHT#24) - `SourceAdapter` Protocol clarified: `name` + `search()` only; ctor kwargs documented as `get_adapter()` surface (EPFLiGHT#11, EPFLiGHT#12, EPFLiGHT#27, EPFLiGHT#29) - Public docstrings filled in (`PaperDiscoveryPipeline`, `run`, `load_synonyms`, `build_boolean_queries`, `get_adapter`, `extract_text`) (EPFLiGHT#17, EPFLiGHT#30, EPFLiGHT#31, EPFLiGHT#33, EPFLiGHT#34, EPFLiGHT#35, EPFLiGHT#36) - `_enrich_with_pdf_text` signature `Iterable[Paper]` -> `List[Paper]` (EPFLiGHT#32) - `boolean.by_word` lookup made case-insensitive (EPFLiGHT#3) - `config.py` / docs: clarified what `user_agent` is and gave a concrete example (EPFLiGHT#4, EPFLiGHT#5, EPFLiGHT#13) - Docs: uncommented Google Scholar install + sources-table row, added `user_agent` subsection, rewrote "User-managed UA" -> "Why we don't spoof the UA" (EPFLiGHT#1, EPFLiGHT#2, EPFLiGHT#6) - `examples/config.yaml`: commented `google_scholar` source with doc pointer; clarified `arxiv_category_map` purpose (EPFLiGHT#7, EPFLiGHT#8) - Drive-by: fixed latent `list[str](REGISTRY)` typo in `get_adapter` error path

…thub.com/fiifidawson/mmore into feature/-paper-discovery-module-chitchat

fiifidawson added 20 commits May 22, 2026 05:56

docs: chitchat modules

a379d0f

docs: chitchat module implementation plan

e890f9c

Update .gitignore

1120f78

chore: openalex-api-integration

4f52420

chore: europepmc-api-integration

fc50836

chore: arxiv-api-integration

d460622

chore: google-scholary-api-integration-niu

0061a78

- currently not being used

chore: pdf-extraction-integration

8d73495

chore: config-pipeline-scripts

7937c84

chore: initial module

645d1ad

feat: paper-discovery integration with pyproject.toml

5a90667

feat(paper-discovery): PDF cache reuse with force_redownload override

7cfe002

docs: paper-discovery-module

2fd4e86

Merge branch 'v2' into feature/-paper-discovery-module-chitchat

69cdd01

Merge branch 'v2' into feature/-paper-discovery-module-chitchat

c2ef096

fabnemEPFL requested changes Jun 14, 2026

View reviewed changes

fiifidawson added 2 commits June 26, 2026 07:15

Merge branch 'feature/-paper-discovery-module-chitchat' of https://gi…

030ab07

…thub.com/fiifidawson/mmore into feature/-paper-discovery-module-chitchat



		def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]:
		"""Load synonyms from a JSON file.

Uh oh!

Conversation

fiifidawson commented Jun 5, 2026

Uh oh!

fabnemEPFL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants