41 lines (26 loc) · 1.41 KB

DocCluster

Clustering scraped techcrunch articles with Spark

Data Collecting Part

singletechcrunchpaper.py
- Python script to scrap a single TechCrunch Page / Article and write to MongoDb hosted in mlab
techcrunch.py
- Find all the latest post url and pass it to singletechcrunchpaper.py.
scrapyTechCrunch.sh
- Script for the crontab job, run excatly one time everyday.

Data Read Part

SparkMongoConnector.scala
- Scala singleton class to connect and perform basic operation on data

Technology Used

Python libs
- Scrapy
- Pymongo
DB Used
- MongoDB
DB Connector
- Spark MongoDB Connector
Data Processing
- Apache Spark
Os scheduling
- Crontab

To Run files needed

application.conf file which contains the mongoDb username and password and link