Skip to content
This repository was archived by the owner on Sep 18, 2024. It is now read-only.

Commit 9acce0a

Browse files
authored
Create README.md
1 parent 6bf6382 commit 9acce0a

1 file changed

Lines changed: 5 additions & 0 deletions

File tree

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# news-crawler
2+
3+
Crawling Web News and storing them in JSON Format
4+
5+
This project crawls the news' RSS Feed and optionally retrieves articles being stored in the Web Archive. This then extracts the full-text of an article, and it dumps it in a JSON semistructured file. All the articles are then associated to a timestamp. Articles that were not successfully parsed are dumped raw as HTML in a ```extra``` folder. Have a look at the shell script for an idea on how to make things work.

0 commit comments

Comments
 (0)