Spider based on storm platform
- limitation
- reset interval
- expire time
- parallelism
System will refetch settings after a certain time (cache), so it is possible to update settings dynamically.
There are one spout(URLReader) and five bolts in these topology. Bolts include URLFilter, Downloader, HTMLParser, HTMLSaver, URLSaver
This bolt is the controller, in charge of :
- Handle repeated urls
- Pattern download count, ignore limitation exceeded pattern.
There something to (or can to be) configured
allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded
- **limitation**: download count limitation in an interval
- **interval**: duration to reset count
- **expire**: cache time
- **parallelism**: max number of workers working on this pattern(host)
- ignore nun-text pages (binary file)
- consume faster