Skip to content

Latest commit

 

History

History
41 lines (26 loc) · 1.41 KB

File metadata and controls

41 lines (26 loc) · 1.41 KB

DocCluster

Clustering scraped techcrunch articles with Spark

Data Collecting Part

  1. singletechcrunchpaper.py

    • Python script to scrap a single TechCrunch Page / Article and write to MongoDb hosted in mlab
  2. techcrunch.py

    • Find all the latest post url and pass it to singletechcrunchpaper.py.
  3. scrapyTechCrunch.sh

    • Script for the crontab job, run excatly one time everyday.

Data Read Part

  1. SparkMongoConnector.scala
    • Scala singleton class to connect and perform basic operation on data

Technology Used

  1. Python libs

  2. DB Used

  3. DB Connector

  4. Data Processing

  5. Os scheduling

To Run files needed

  1. application.conf file which contains the mongoDb username and password and link