SciCrawler⛏

Overview

A text crawling tool modified on Scrapy for scientific community, especially chemistry, material science, biology and environmental science.

Prerequisites

Python 3.8 or higher
MongoDB server (local or remote)
API keys for publisher platforms (where required)

Installation

git clone https://github.com/Laaery/scicrawler.git
cd scicrawler
pip install -r requirements.txt

Usage

Prepare a list of DOIs to scrape, categorized by publishers, in independent csv files within doi_list/ directory.
Establish a MongoDB database to store the scraped data.
Run the run_spider.py script. For example, to download data via Elsevier API, use the following command:

python run_spider.py \
    --spider spider_elsevier_api \
    --domain api.elsevier.com \
    --doi_file ./doi_list/your_doi.csv \
    --publisher Elsevier \
    --api_key YOUR_API_KEY

Legal and Ethical Considerations

Important: Users are solely responsible for complying with:

Publisher terms of service
Copyright laws
API rate limits
Robots.txt directives

Disclaimer: This tool is provided for research purposes only. The developers are not responsible for any misuse or violations of publisher policies committed by users.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
doi_list		doi_list
sci_crawler		sci_crawler
README.md		README.md
requirements.txt		requirements.txt
run_spider.py		run_spider.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciCrawler⛏

Overview

Prerequisites

Installation

Usage

Legal and Ethical Considerations

About

Uh oh!

Releases

Packages

Languages

Laaery/scicrawler

Folders and files

Latest commit

History

Repository files navigation

SciCrawler⛏

Overview

Prerequisites

Installation

Usage

Legal and Ethical Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages