A simple tool that collects massive amounts of links and text data from the internet.
TextHarvester is an easy-to-use tool for collecting and crawling urls from the Internet and downloading website content from collected urls into a text file. TextHarvester can be used to efficiently collect a lot of text for general purpose nlp.
Just follow the steps below. The process of collecting links and downloading content is very simple. You can forcefully stop the download process if needed. The downloads you have completed so far will be saved in a text file, as well as the urls you have already downloaded from.
- Main Repo: https://github.com/techboy-coder/Textharvester
- Docs (Google Colab) : https://colab.research.google.com/drive/1vpVg_bQzoKjZNX3-7DMUJ_5zLnnhGIkC?usp=sharing
- Install
- Import
- Collect/Harvest Links
- Download Content from Links
© 2020 - Techboy-Coder – (https://github.com/techboy-coder)