WebScraper

Web Scraping API

Installation

Please clone this GIT repository

git clone https://gitlab.com/miloszsobiczewski/webscraper.git/

Change directory to webscraper and run

sudo docker-compose up

After docker image installation is complete you can find the application running in you browser under following url:

http://0.0.0.0:8000/api/

How to use

Options

Interface of the API allows to select few options:

Site url - url that will be the target of web scraping
Text indicator - indicates whether or not to scrap and save text to local system
Image indicator - indicates whether or not site will be scanned for images and then saved to local system if possible

Database

Standard Django SQLite database was used. Following information are stored for scraping tasks:

site url
status
text indicator
image indicator
text file location (if saved)
image files locations (if saved)
schedule date

All data regarding scheduled tasks can be find here:

http://0.0.0.0:8000/admin/api/task/

login: new

password: New_12345

Scraped files are stored in ./static/api/TAKS_ID directory

Tasks

Mentioned database was also used for scraping tasks monitoring. Additionally Task status dashboard was created under following url:

http://0.0.0.0:8000/api/tasks/

Dashboard allows to check the current task status and download scraped data ( text and images ) for all completed ones.

Unit Tests

For all dedicated methods used in the service unit tests were written. To run unit tests use following command with appropriate CONTAINDER_ID:

sudo docker exec -it CONTAINDER_ID python unittests.py

Details

Scraping options

Provided site url will be checked for response availability.
Either text or image - one of the scraping options needs to be selected.

Done and "to do"

Tested and working well on urls like:
- https://semantive.pl/blog/
- https://msobiczewski.pythonanywhre.com/
Not very effective on following ones:
- https://www.wykop.pl/
- https://www.wp.pl/
It is not 100% REST API, though it can be transformed using Django Rest Framework - which a did not have the opportunity to use so far.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
WebScraper		WebScraper
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScraper

Installation

How to use

Options

Database

Tasks

Unit Tests

Details

Scraping options

Done and "to do"

About

Uh oh!

Releases

Packages

Uh oh!

Languages

miloszsobiczewski/WebScraper

Folders and files

Latest commit

History

Repository files navigation

WebScraper

Installation

How to use

Options

Database

Tasks

Unit Tests

Details

Scraping options

Done and "to do"

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages