web-scraper

This project contains the implementation to scrap the linkedin profiles, store them to the database and retrieve the scraped data from the the database. The REST APIs are provided to retrieve the list of profiles or the details of any specific profile.

Technology Stack

We have used,

1. Python 3
2. Django REST Framework
3. sqlite3 Database
4. Google Chrome v.83.0.4103.61    
5. Google Chrome driver v.83.0.4103.61

Project Structure:

├── API.md: API documentation
├── Assumption.md: asumption made for tracxn scraper app
├── config.ini.example: example file for config.ini to store linkedin credentials
├── linkedin_scraper
│   ├── admin.py
│   ├── apps.py
│   ├── chromedriver_linux64
│   ├── fixtures : contains inital data to load
│   ├── __init__.py
│   ├── migrations: database migrations
│   ├── models.py: database models for linkedin scraper app
│   ├── serializers.py: serializers for the models
│   ├── tests.py: test cases for view
│   ├── urls.py: url endpoints of linkedin scraper app
│   ├── utilities.py: Utility functions used in views
│   └── views.py: These views are called by API endpoints
├── Makefile: initial commands
├── manage.py
├── README.md: documentation file
├── requirements.txt: requirements needs to be install
├── task.md:task description
├── tracxn_scraper
│   ├── admin.py
│   ├── apps.py
│   ├── fixtures: contains inital data to load
│   ├── __init__.py
│   ├── migrations: database migrations
│   ├── models.py: database models for tracxn scraper app
│   ├── serializers.py: serializers for the models
│   ├── tests.py: test cases for view
│   ├── urls.py: urls for tracxn scraper app
│   └── views.py: These views are called by API endpoints
└── webscraper
    ├── asgi.py
    ├── __init__.py
    ├── settings.py: settings file for the project.
    ├── urls.py: base urls for apps of the projects
    └── wsgi.py

Running Locally

First, clone the repository to your local machine:

git clone https://github.com/fulltostack/web-scraper

cd web-scraper

Install the requirements:

pip install -r requirements/dev.txt

Create the .env file to the root directory of the project. You can refer this example file-

.env.example

Create the config.ini file to the root directory of the project. You can refer this example file-

config.ini.example

####Install Redis on the system:

Install Redis:

sudo apt update -y
sudo apt install redis-server -y

Test redis installation:

redis-cli ping

You should see:

$ redis-cli ping
pong

also you can refer this link for redis installation.

Apply the database migrations:

python manage.py migrate

Load the initial data:

python manage.py loaddata companies

Create administrator/super user:

python manage.py createsuperuser 


Note: It will prompt to enter username, email and password one by one. Please remember the username and password,
it will be used to login admin area or to hit an API to scrap/refresh the linkedin profiles.

Finally, run the development server:

python manage.py runserver

The site will be available at 127.0.0.1:8000.

Now, open a new terminal window, Run Celery with in your project root where manage.py lives:

celery -A webscraper worker -l info

Linting:

make lint

APIs Details

See documentation here.

Testing:

python manage.py test

SAMPLE USAGE

from scrape_scripts.linkedin_profile import Person, login
from selenium import webdriver

driver = webdriver.Chrome()
login(driver, email, password)

Person("https://www.linkedin.com/in/andre-iguodala-65b48ab5", driver=driver)

output:

Andre Iguodala 
Professional Basketball Player (NBA), Venture Partner at Catalyst Fund, Jumia Board Member, Investor, Bestselling Author 
San Francisco Bay Area 
https://www.linkedin.com/in/andre-iguodala-65b48ab5

It's scrape the person name, location and linkedin profile url.

Assumptions:

Assumptions

Further improvements

Along with the linkedin, tracxn data can be scraped, stored and retrieved in similar fashion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-scraper

Technology Stack

Project Structure:

Running Locally

.env.example

config.ini.example

Linting:

APIs Details

Testing:

SAMPLE USAGE

Assumptions:

Assumptions

Further improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
linkedin_scraper		linkedin_scraper
scrape_scripts		scrape_scripts
tracxn_scraper		tracxn_scraper
webscraper		webscraper
.editorignore		.editorignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
API.md		API.md
Assumption.md		Assumption.md
Makefile		Makefile
README.md		README.md
config.ini.example		config.ini.example
manage.py		manage.py
requirements.txt		requirements.txt
task.md		task.md

Folders and files

Latest commit

History

Repository files navigation

web-scraper

Technology Stack

Project Structure:

Running Locally

.env.example

config.ini.example

Linting:

APIs Details

Testing:

SAMPLE USAGE

Assumptions:

Assumptions

Further improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages