-
Notifications
You must be signed in to change notification settings - Fork 0
katkad/KeywordsSpider
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
NAME
KeywordsSpider - web spider searching for keywords
SYNOPSIS
use KeywordsSpider;
KeywordsSpider::run(
outfile => "output_test",
infile => "export.sql",
keyfile => "keywords_new",
debug => 1,
skip_ref_regexp => "(^http://trala|^null|twig.html\$)",
allowed_keywords => "allowed_keywords",
web_depth => 5
);
DESCRIPTION
KeywordsSpider is web spider, which takes urls and keywords from file and outputs urls matching the keywords to another file.
Referers can be specified in input file. Their domain is matched to website's domain.
It spiders in 10 parallel processes.
It takes files as arguments and prepares attributes for Spider.
ARGUMENTS
infile
file with website and referer urls within. Like:
'domain.sk/twig.html','null'
domain.sk,domain2.sk
another-domain.sk/twig.html,null
another-domain.sk/twig.html,http://trala.sk
no space after comma, apostrophes not necessary
keyfile
file with newline separates keywords. Like:
word1
wuord2
wiaord3
allowed_keywords
file with newline separated keywords, which do not trigger ALERT to output file. Like:
wuord2
outfile
output file
debug
do you want debug to standard output ? It's turned off by default.
skip_ref_regexp
you can specify various referers for the same website. If you don't want to crawl specific domain,
or any part of url, you put the regular expression here. Like:
(^http://trala|^null|twig.html\$)
METHODS
run %ARGS
runs
SEE ALSO
Spider -- core spidering and matching module
COPYRIGHT
Copyright 2013 katkad
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
About
web spider searching for keywords
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published