Feature/ paper discovery module chitchat#319
Conversation
Introduce the mmore.paper_discovery package with initial components for a paper discovery pipeline: - schema: dataclasses for CategoryQuery, Paper, and SynonymEntry to normalize data shapes. - boolean: functions to load synonym tables and build category-level boolean queries (load_synonyms, _or_group, build_boolean_queries). - logging_config: simple logger configuration for PaperDiscovery. - sources/base: SourceAdapter protocol defining the search interface and guidance that adapters must not raise on network errors. - Stubs added for other modules (config, pdf, pipeline, and individual source modules) to be implemented later. These changes set up the core types and query-building logic used by downstream source adapters and pipeline stages.
- currently not being used
- Introduce src/mmore/paper_discovery/sources/__init__.py which centralizes source adapter imports and registers them in REGISTRY (openalex, europepmc, arxiv; google_scholar is commented). - Adds typed get_adapter(name, **kwargs) to instantiate a SourceAdapter or raise a ValueError for unknown sources. Provides a single entrypoint for resolving source adapters.
- Introduce a new `paper-discovery` subcommand and entrypoint to run the Paper Discovery pipeline. - Adds src/mmore/run_paper_discovery.py which loads a PaperDiscoveryConfig, constructs and runs PaperDiscoveryPipeline (with timing, dotenv, and profiling support), and wires it into the CLI (src/mmore/cli.py). - Also removes a stale tmp_pdf file.
- boolean: skip categories with no resolved synonyms instead of emitting an empty boolean string - schema: include search_category in Paper.to_dict() output - openalex: widen exception catch to never raise from search() - tests/conftest: make langchain_milvus import optional so the suite is collectible on partial installs
Add robust PDF fetching with paywall detection and optional EZproxy proxying, plus progress/counting and partial-result handling. - Add pdf_proxy_prefix config option and docs. - Replace simple download return value with DownloadResult (path, paywalled, errored, status), detect common paywall statuses, and use a polite default User-Agent. - Introduce _proxify helper to wrap PDF URLs via an EZproxy prefix and use it for initial/follow-up fetches. - Improve logging and error handling for network failures; surface paywalled PDFs separately from errors. - In the pipeline, add tqdm progress (with a lightweight fallback), track succeeded/paywalled/errored/skipped counts, log a summary and a tip when paywalled PDFs are encountered, and ensure partial results are written on KeyboardInterrupt. - Add arXiv client improvements: request timeout, 429 backoff handling, and related logging. - Update .gitignore to include local paper_discovery artifacts (pdf cache and papers.json) and normalize example output paths.
fabnemEPFL
left a comment
There was a problem hiding this comment.
pretty cool PR with this new module. several things to change to integrate it properly into mmore
| <!-- ```bash | ||
| pip install scholarly | ||
| ``` | ||
|
|
||
| `scholarly` is **not** in the `paper_discovery` extra by design — it is | ||
| captcha-prone. Install only if needed. --> |
There was a problem hiding this comment.
why have this commented?
There was a problem hiding this comment.
It is captcha-prone and at bit unreliable. I'll uncomment the install block and the Sources row, and re-document it as an opt-in source.
| | OpenAlex | none | ~1 req/s; inverted-index abstracts are rebuilt automatically | | ||
| | Europe PMC | none | ~1 req/s; uses `resultType=core` | | ||
| | arXiv | none | **1 req / 3 s** (strict, ToS); query is simplified to top terms; 30 s back-off on 429 | | ||
| <!-- | Google Scholar | none | Opt-in (`scholarly`), captcha-prone, best-effort | --> |
There was a problem hiding this comment.
why have this commented?
There was a problem hiding this comment.
Will uncomment in the next push
| ```json | ||
| [ | ||
| {"word": "Foundation model", | ||
| "synonyms": ["LLM", "large language model", "GPT"]}, |
There was a problem hiding this comment.
is this case-sensitive?
| | `pdf_dir` | `./pdf_cache` | Reused across runs (see *PDF caching* below) | | ||
| | `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF | | ||
| | `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) | | ||
| | `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app | |
There was a problem hiding this comment.
I don't get what the user_agent field is about
There was a problem hiding this comment.
It's the HTTP User-Agent header sent on every outbound request to OpenAlex / Europe PMC / arXiv / publisher PDF endpoints.
Sources prefer (and OpenAlex requires) a string that identifies the caller so they can contact us.
| | `force_redownload` | `false` | Set `true` to ignore the on-disk cache and re-fetch every PDF | | ||
| | `pdf_proxy_prefix` | `null` | Optional EZproxy prefix for institutional access (see *Paywalled PDFs* below) | | ||
| | `user_agent` | `mmore-paper-discovery/1.0` | Be polite — customize per app | | ||
| | `arxiv_category_map` | `null` | Maps category-name substrings to arXiv category codes (e.g. `cs.LG`) | |
There was a problem hiding this comment.
please write a more concrete example of what the user should write in this field when it's not null
| def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None: | ||
| cfg = self.config | ||
| papers = list(papers) |
There was a problem hiding this comment.
| def _enrich_with_pdf_text(self, papers: Iterable[Paper]) -> None: | |
| cfg = self.config | |
| papers = list(papers) | |
| def _enrich_with_pdf_text(self, papers: List[Paper]) -> None: | |
| cfg = self.config |
| } | ||
|
|
||
|
|
||
| def get_adapter(name: str, **kwargs) -> SourceAdapter: |
|
|
||
|
|
||
| def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]: | ||
| """Load synonyms from a JSON file. |
There was a problem hiding this comment.
specify the input properly
| def load_synonyms(path: Union[str, Path]) -> List[SynonymEntry]: | ||
| """Load synonyms from a JSON file. | ||
|
|
||
| Expected format: |
There was a problem hiding this comment.
| Expected format: | |
| Returns: |
|
|
||
| def build_boolean_queries( | ||
| synonyms: List[SynonymEntry], | ||
| categories: Dict[str, List[str]], |
There was a problem hiding this comment.
be more specific about the expected format for categories
…ass A) - Adapters now explicitly inherit `SourceAdapter` (EPFLiGHT#20, EPFLiGHT#22, EPFLiGHT#23, EPFLiGHT#24) - `SourceAdapter` Protocol clarified: `name` + `search()` only; ctor kwargs documented as `get_adapter()` surface (EPFLiGHT#11, EPFLiGHT#12, EPFLiGHT#27, EPFLiGHT#29) - Public docstrings filled in (`PaperDiscoveryPipeline`, `run`, `load_synonyms`, `build_boolean_queries`, `get_adapter`, `extract_text`) (EPFLiGHT#17, EPFLiGHT#30, EPFLiGHT#31, EPFLiGHT#33, EPFLiGHT#34, EPFLiGHT#35, EPFLiGHT#36) - `_enrich_with_pdf_text` signature `Iterable[Paper]` -> `List[Paper]` (EPFLiGHT#32) - `boolean.by_word` lookup made case-insensitive (EPFLiGHT#3) - `config.py` / docs: clarified what `user_agent` is and gave a concrete example (EPFLiGHT#4, EPFLiGHT#5, EPFLiGHT#13) - Docs: uncommented Google Scholar install + sources-table row, added `user_agent` subsection, rewrote "User-managed UA" -> "Why we don't spoof the UA" (EPFLiGHT#1, EPFLiGHT#2, EPFLiGHT#6) - `examples/config.yaml`: commented `google_scholar` source with doc pointer; clarified `arxiv_category_map` purpose (EPFLiGHT#7, EPFLiGHT#8) - Drive-by: fixed latent `list[str](REGISTRY)` typo in `get_adapter` error path
…thub.com/fiifidawson/mmore into feature/-paper-discovery-module-chitchat
No description provided.