StreamSpider

Spider based on storm platform

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

limitation
reset interval
expire time
parallelism

Update settings dynamically

System will refetch settings after a certain time (cache), so it is possible to update settings dynamically.

Topology

There are one spout(URLReader) and five bolts in these topology. Bolts include URLFilter, Downloader, HTMLParser, HTMLSaver, URLSaver

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

This bolt is the controller, in charge of :

Handle repeated urls
Pattern download count, ignore limitation exceeded pattern.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

There something to (or can to be) configured

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

- **limitation**: download count limitation in an interval
- **interval**: duration to reset count
- **expire**: cache time
- **parallelism**: max number of workers working on this pattern(host)

TODO

ignore nun-text pages (binary file)
consume faster

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
deploy		deploy
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StreamSpider

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

Update settings dynamically

Topology

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

TODO

About

Uh oh!

Releases

Packages

Languages

License

newnius/StreamSpider

Folders and files

Latest commit

History

Repository files navigation

StreamSpider

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

Update settings dynamically

Topology

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

TODO

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages