| Modifier and Type | Field and Description |
|---|---|
protected static org.slf4j.Logger |
logger |
protected CrawlController |
myController
The controller instance that has created this crawler thread.
|
protected int |
myId
The id associated to the crawler thread running this instance
|
| Constructor and Description |
|---|
WebCrawler() |
| Modifier and Type | Method and Description |
|---|---|
CrawlController |
getMyController() |
int |
getMyId()
Get the id of the current crawler instance
|
Object |
getMyLocalData()
The CrawlController instance that has created this crawler instance will
call this function just before terminating this crawler thread.
|
Thread |
getThread() |
protected void |
handlePageStatusCode(WebURL webUrl,
int statusCode,
String statusDescription)
This function is called once the header of a page is fetched.
|
protected WebURL |
handleUrlBeforeProcess(WebURL curURL)
This function is called before processing of the page's URL
It can be overridden by subclasses for tweaking of the url before processing it.
|
void |
init(int id,
CrawlController crawlController)
Initializes the current instance of the crawler
|
boolean |
isNotWaitingForNewURLs() |
void |
onBeforeExit()
This function is called just before the termination of the current
crawler instance.
|
protected void |
onContentFetchError(WebURL webUrl)
This function is called if the content of a url could not be fetched.
|
protected void |
onPageBiggerThanMaxSize(String urlStr,
long pageSize)
This function is called if the content of a url is bigger than allowed size.
|
protected void |
onParseError(WebURL webUrl)
This function is called if there has been an error in parsing the content.
|
protected void |
onRedirectedStatusCode(Page page)
This function is called if the crawler encounters a page with a 3xx status code
|
void |
onStart()
This function is called just before starting the crawl by this crawler
instance.
|
protected void |
onUnexpectedStatusCode(String urlStr,
int statusCode,
String contentType,
String description)
This function is called if the crawler encountered an unexpected http status code ( a
status code other than 3xx)
|
protected void |
onUnhandledException(WebURL webUrl,
Throwable e)
This function is called when a unhandled exception was encountered during fetching
|
void |
run() |
void |
setThread(Thread myThread) |
protected boolean |
shouldFollowLinksIn(WebURL url)
Determine whether links found at the given URL should be added to the queue for crawling.
|
boolean |
shouldVisit(Page referringPage,
WebURL url)
Classes that extends WebCrawler should overwrite this function to tell the
crawler whether the given url should be crawled or not.
|
void |
visit(Page page)
Classes that extends WebCrawler should overwrite this function to process
the content of the fetched and parsed page.
|
protected static final org.slf4j.Logger logger
protected int myId
protected CrawlController myController
public void init(int id,
CrawlController crawlController)
throws InstantiationException,
IllegalAccessException
id - the id of this crawler instancecrawlController - the controller that manages this crawling sessionIllegalAccessExceptionInstantiationExceptionpublic int getMyId()
public CrawlController getMyController()
public void onStart()
public void onBeforeExit()
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription)
webUrl - WebUrl containing the statusCodestatusCode - Html Status Code numberstatusDescription - Html Status COde descriptionprotected WebURL handleUrlBeforeProcess(WebURL curURL)
curURL - current URL which can be tweaked before processingprotected void onPageBiggerThanMaxSize(String urlStr, long pageSize)
urlStr - - The URL which it's content is bigger than allowed sizeprotected void onRedirectedStatusCode(Page page)
page - Partial page objectprotected void onUnexpectedStatusCode(String urlStr, int statusCode, String contentType, String description)
urlStr - URL in which an unexpected error was encountered while crawlingstatusCode - Html StatusCodecontentType - Type of Contentdescription - Error Descriptionprotected void onContentFetchError(WebURL webUrl)
webUrl - URL which content failed to be fetchedprotected void onUnhandledException(WebURL webUrl, Throwable e)
webUrl - URL where a unhandled exception occuredprotected void onParseError(WebURL webUrl)
webUrl - URL which failed on parsingpublic Object getMyLocalData()
public boolean shouldVisit(Page referringPage, WebURL url)
url - the url which we are interested to know whether it should be
included in the crawl or not.referringPage - The Page in which this url was found.protected boolean shouldFollowLinksIn(WebURL url)
url - the URL of the page under considerationpublic void visit(Page page)
page - the page object that is just fetched and parsed.public Thread getThread()
public void setThread(Thread myThread)
public boolean isNotWaitingForNewURLs()
Copyright © 2017. All rights reserved.