Skip navigation links
A B C D E F G H I L M N O P R S T U V W 

A

add(String, String) - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Add a rule to the list of rules for this user agent.
addAuthInfo(AuthInfo) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
addDirectives(UserAgentDirectives) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
Store set of directives
addSeed(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Adds a new seed URL.
addSeed(String, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Adds a new seed URL.
addSeenUrl(String, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
This function can called to assign a specific document id to a url.
addUrlAndDocId(String, int) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
ALLOWED - Static variable in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
allows(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
Check if the host directives allows visiting path.
allows(WebURL) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
Please note that in the case of a bad URL, TRUE will be returned
AllTagMapper - Class in edu.uci.ics.crawler4j.parser
Maps all HTML tags (not ignore some of this)
AllTagMapper() - Constructor for class edu.uci.ics.crawler4j.parser.AllTagMapper
 
authenticationType - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
AuthInfo - Class in edu.uci.ics.crawler4j.crawler.authentication
Created by Avi Hayun on 11/23/2014.
AuthInfo() - Constructor for class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
Constructs a new AuthInfo.
AuthInfo(AuthInfo.AuthenticationType, FormSubmitEvent.MethodType, String, String, String) - Constructor for class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
This constructor should only be used by extending classes
AuthInfo.AuthenticationType - Enum in edu.uci.ics.crawler4j.crawler.authentication
 

B

BasicAuthInfo - Class in edu.uci.ics.crawler4j.crawler.authentication
Created by Avi Hayun on 11/25/2014.
BasicAuthInfo(String, String, String) - Constructor for class edu.uci.ics.crawler4j.crawler.authentication.BasicAuthInfo
Constructor
beginTransaction() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
BinaryParseData - Class in edu.uci.ics.crawler4j.parser
 
BinaryParseData() - Constructor for class edu.uci.ics.crawler4j.parser.BinaryParseData
 
byteArray2Int(byte[]) - Static method in class edu.uci.ics.crawler4j.util.Util
 
byteArray2Long(byte[]) - Static method in class edu.uci.ics.crawler4j.util.Util
 

C

characters(char[], int, int) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
checkAccess(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
Check if any of the rules say anything about the specified path
checkAccess(String, String) - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
 
close() - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
close() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
close() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
close() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
commit(Transaction) - Static method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
compare(UserAgentDirectives, UserAgentDirectives) - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives.UserAgentComparator
 
config - Variable in class edu.uci.ics.crawler4j.crawler.Configurable
 
config - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
Configurable - Class in edu.uci.ics.crawler4j.crawler
Several core components of crawler4j extend this class to make them configurable.
Configurable(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.crawler.Configurable
 
connect(HttpClientConnection, HttpRoute, int, HttpContext) - Method in class edu.uci.ics.crawler4j.fetcher.SniPoolingHttpClientConnectionManager
 
connectionManager - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
connectionMonitorThread - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
contains(String) - Method in class edu.uci.ics.crawler4j.url.TLDList
 
contentCharset - Variable in class edu.uci.ics.crawler4j.crawler.Page
The charset of the content.
contentData - Variable in class edu.uci.ics.crawler4j.crawler.Page
The content of this page in binary format.
contentEncoding - Variable in class edu.uci.ics.crawler4j.crawler.Page
The encoding of the content.
ContentFetchException - Exception in edu.uci.ics.crawler4j.crawler.exceptions
Created by Avi Hayun on 12/8/2014.
ContentFetchException() - Constructor for exception edu.uci.ics.crawler4j.crawler.exceptions.ContentFetchException
 
contentType - Variable in class edu.uci.ics.crawler4j.crawler.Page
The ContentType of this page.
Counters - Class in edu.uci.ics.crawler4j.frontier
 
Counters(Environment, CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.frontier.Counters
 
counters - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
Counters.ReservedCounterNames - Class in edu.uci.ics.crawler4j.frontier
 
counterValues - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
CrawlConfig - Class in edu.uci.ics.crawler4j.crawler
 
CrawlConfig() - Constructor for class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
CrawlController - Class in edu.uci.ics.crawler4j.crawler
The controller that manages a crawling session.
CrawlController(CrawlConfig, PageFetcher, RobotstxtServer) - Constructor for class edu.uci.ics.crawler4j.crawler.CrawlController
 
CrawlController.WebCrawlerFactory<T extends WebCrawler> - Interface in edu.uci.ics.crawler4j.crawler
 
crawlersLocalData - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Once the crawling session finishes the controller collects the local data of the crawler threads and stores them in this List.
createLayeredSocket(Socket, String, int, HttpContext) - Method in class edu.uci.ics.crawler4j.fetcher.SniSSLConnectionSocketFactory
 
customData - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
The 'customData' object can be used for passing custom crawl-related configurations to different components of the crawler.

D

delete(int) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
deleteFolder(File) - Static method in class edu.uci.ics.crawler4j.util.IO
 
deleteFolderContents(File) - Static method in class edu.uci.ics.crawler4j.util.IO
 
DISALLOWED - Static variable in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
disallows(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
Check if the host directives explicitly disallow visiting path.
discardContentIfNotConsumed() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
docIdServer - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
DocIDServer - Class in edu.uci.ics.crawler4j.frontier
 
DocIDServer(Environment, CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.frontier.DocIDServer
 

E

edu.uci.ics.crawler4j.crawler - package edu.uci.ics.crawler4j.crawler
 
edu.uci.ics.crawler4j.crawler.authentication - package edu.uci.ics.crawler4j.crawler.authentication
 
edu.uci.ics.crawler4j.crawler.exceptions - package edu.uci.ics.crawler4j.crawler.exceptions
 
edu.uci.ics.crawler4j.fetcher - package edu.uci.ics.crawler4j.fetcher
 
edu.uci.ics.crawler4j.frontier - package edu.uci.ics.crawler4j.frontier
 
edu.uci.ics.crawler4j.parser - package edu.uci.ics.crawler4j.parser
 
edu.uci.ics.crawler4j.robotstxt - package edu.uci.ics.crawler4j.robotstxt
 
edu.uci.ics.crawler4j.url - package edu.uci.ics.crawler4j.url
 
edu.uci.ics.crawler4j.util - package edu.uci.ics.crawler4j.util
 
ENABLE_SNI - Static variable in class edu.uci.ics.crawler4j.fetcher.SniSSLConnectionSocketFactory
 
endElement(String, String, String) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
entity - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
entryToObject(TupleInput) - Method in class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
env - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
env - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
equals(Object) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
ExtractedUrlAnchorPair - Class in edu.uci.ics.crawler4j.parser
 
ExtractedUrlAnchorPair() - Constructor for class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
extractUrls(String) - Static method in class edu.uci.ics.crawler4j.util.Net
 

F

fetchContent(Page, int) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
fetchedUrl - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
fetchPage(WebURL) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
fetchResponseHeaders - Variable in class edu.uci.ics.crawler4j.crawler.Page
Headers which were present in the response of the fetch request
finish() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
finished - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Is the crawling of this session finished?
FormAuthInfo - Class in edu.uci.ics.crawler4j.crawler.authentication
Created by Avi Hayun on 11/25/2014.
FormAuthInfo(String, String, String, String, String) - Constructor for class edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo
Constructor
frontier - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
Frontier - Class in edu.uci.ics.crawler4j.frontier
 
Frontier(Environment, CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.frontier.Frontier
 

G

get(int) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getAnchor() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
getAnchor() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getAuthenticationType() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getAuthInfos() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getBaseUrl() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getBodyText() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getCacheSize() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
getCanonicalURL(String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
getCanonicalURL(String, String) - Static method in class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
getCleanupDelaySeconds() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getConfig() - Method in class edu.uci.ics.crawler4j.crawler.Configurable
 
getConnectionTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getContentCharset() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getContentData() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getContentEncoding() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getContentType() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getCrawlDelay() - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Return the configured crawl delay in seconds
getCrawlersLocalData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Once the crawling session finishes the controller collects the local data of the crawler threads and stores them in a List.
getCrawlStorageFolder() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getCustomData() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getDatabaseEntryKey(WebURL) - Static method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getDefaultHeaders() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Return a copy of the default header collection.
getDepth() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getDocCount() - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
getDocId(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
Returns the docid of an already seen url.
getDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getDocIdServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getDomain() - Method in class edu.uci.ics.crawler4j.crawler.authentication.NtAuthInfo
 
getDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getEntity() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getFetchedUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getFetchResponseHeaders() - Method in class edu.uci.ics.crawler4j.crawler.Page
Returns headers which were present in the response of the fetch request
getFrontier() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getHost() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getHref() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
getHtml() - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
getHtml() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getHttpMethod() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getIgnoreUADiscrimination() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
getInstance() - Static method in class edu.uci.ics.crawler4j.url.TLDList
 
getLanguage() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getLastAccessTime() - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
getLength() - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
getLoginTarget() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getMaxConnectionsPerHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxDepthOfCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxDownloadSize() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxOutgoingLinksToFollow() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxPagesToFetch() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMaxTotalConnections() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getMetaTags() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getMetaTags() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getMovedToUrl() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getMyController() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
getMyId() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Get the id of the current crawler instance
getMyLocalData() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
The CrawlController instance that has created this crawler instance will call this function just before terminating this crawler thread.
getNewDocID(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
getNextURLs(int, List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getNumberOfAssignedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getNumberOfProcessedPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getNumberOfScheduledPages() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getOutgoingUrls() - Method in interface edu.uci.ics.crawler4j.parser.ParseData
 
getOutgoingUrls() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
getPageFetcher() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getPageSize() - Method in exception edu.uci.ics.crawler4j.crawler.exceptions.PageBiggerThanMaxSizeException
 
getParentDocid() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getParentUrl() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getParseData() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getPassword() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getPasswordFormStr() - Method in class edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo
 
getPath() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getPolitenessDelay() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getPort() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getPreferredHost() - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Return the specified preferred host name in robots.txt.
getPriority() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getProtocol() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getProxyHost() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyPassword() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyPort() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getProxyUsername() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getQueueLength() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
getRedirectedToUrl() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getResponseHeaders() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getRobotstxtServer() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
getSitemap() - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Return the listed sitemaps, or null if none was specified
getSocketTimeout() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getStatusCode() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
getStatusCode() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
getSubDomain() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getTag() - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
getTag() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getText() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getTextContent() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
getThread() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
getThreadMonitoringDelaySeconds() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getThreadShutdownDelaySeconds() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getTitle() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
getURL() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
getUserAgentName() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
getUserAgentString() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
getUsername() - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
getUsernameFormStr() - Method in class edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo
 
getValue(String) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
getWebURL() - Method in class edu.uci.ics.crawler4j.crawler.Page
 

H

handlePageStatusCode(WebURL, int, String) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called once the header of a page is fetched.
handleUrlBeforeProcess(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called before processing of the page's URL It can be overridden by subclasses for tweaking of the url before processing it.
hasBinaryContent(String) - Static method in class edu.uci.ics.crawler4j.util.Util
 
hashCode() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
hasPlainTextContent(String) - Static method in class edu.uci.ics.crawler4j.util.Util
 
host - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
host2directivesCache - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
HostDirectives - Class in edu.uci.ics.crawler4j.robotstxt
 
HostDirectives(RobotstxtConfig) - Constructor for class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
HtmlContentHandler - Class in edu.uci.ics.crawler4j.parser
 
HtmlContentHandler() - Constructor for class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
HtmlParseData - Class in edu.uci.ics.crawler4j.parser
 
HtmlParseData() - Constructor for class edu.uci.ics.crawler4j.parser.HtmlParseData
 
httpClient - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
httpMethod - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 

I

IdleConnectionMonitorThread - Class in edu.uci.ics.crawler4j.fetcher
 
IdleConnectionMonitorThread(PoolingHttpClientConnectionManager) - Constructor for class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 
increment(String) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
increment(String, long) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
init(int, CrawlController) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Initializes the current instance of the crawler
inProcessPages - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
InProcessPagesDB - Class in edu.uci.ics.crawler4j.frontier
This class maintains the list of pages which are assigned to crawlers but are not yet processed.
InProcessPagesDB(Environment) - Constructor for class edu.uci.ics.crawler4j.frontier.InProcessPagesDB
 
int2ByteArray(int) - Static method in class edu.uci.ics.crawler4j.util.Util
 
IO - Class in edu.uci.ics.crawler4j.util
 
IO() - Constructor for class edu.uci.ics.crawler4j.util.IO
 
isDiscardElement(String) - Method in class edu.uci.ics.crawler4j.parser.AllTagMapper
 
isEmpty() - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
 
isEnabled() - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
isFinished() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
isFinished - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
isFinished() - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
isFollowRedirects() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isIncludeBinaryContentInCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isIncludeHttpsPages() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isNotWaitingForNewURLs() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
isOnlineTldListUpdate() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isProcessBinaryContentInCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isRedirect() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
isResumableCrawling() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isSeenBefore(String) - Method in class edu.uci.ics.crawler4j.frontier.DocIDServer
 
isShutdownOnEmptyQueue() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
isShuttingDown() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
isTruncated() - Method in class edu.uci.ics.crawler4j.crawler.Page
 
isWildcard() - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
 

L

lastFetchTime - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
load(HttpEntity, int) - Method in class edu.uci.ics.crawler4j.crawler.Page
Loads the content of this page from a fetched HttpEntity.
logger - Static variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
logger - Static variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
logger - Static variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
logger - Static variable in class edu.uci.ics.crawler4j.fetcher.SniPoolingHttpClientConnectionManager
 
logger - Static variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
logger - Static variable in class edu.uci.ics.crawler4j.parser.Parser
 
logger - Static variable in class edu.uci.ics.crawler4j.robotstxt.PathRule
 
logger - Static variable in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
 
loginTarget - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
long2ByteArray(long) - Static method in class edu.uci.ics.crawler4j.util.Util
 

M

mapSafeAttribute(String, String) - Method in class edu.uci.ics.crawler4j.parser.AllTagMapper
 
mapSafeElement(String) - Method in class edu.uci.ics.crawler4j.parser.AllTagMapper
 
match(String) - Method in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Match the current user agent directive set with the given user agent.
matches(String) - Method in class edu.uci.ics.crawler4j.robotstxt.PathRule
Check if the specified path matches this rule
matchesRobotsPattern(String, String) - Static method in class edu.uci.ics.crawler4j.robotstxt.PathRule
Check if the specified path matches a robots.txt pattern
movedToUrl - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
mutex - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
mutex - Variable in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
myController - Variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
The controller instance that has created this crawler thread.
myId - Variable in class edu.uci.ics.crawler4j.crawler.WebCrawler
The id associated to the crawler thread running this instance

N

needsRefetch() - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
Net - Class in edu.uci.ics.crawler4j.util
Created by Avi Hayun on 9/22/2014.
Net() - Constructor for class edu.uci.ics.crawler4j.util.Net
 
newHttpUriRequest(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
Creates a new HttpUriRequest for the given url.
newInstance() - Method in interface edu.uci.ics.crawler4j.crawler.CrawlController.WebCrawlerFactory
 
NotAllowedContentException - Exception in edu.uci.ics.crawler4j.parser
Created by Avi on 8/19/2014.
NotAllowedContentException() - Constructor for exception edu.uci.ics.crawler4j.parser.NotAllowedContentException
 
NtAuthInfo - Class in edu.uci.ics.crawler4j.crawler.authentication
Authentication information for Microsoft Active Directory
NtAuthInfo(String, String, String, String) - Constructor for class edu.uci.ics.crawler4j.crawler.authentication.NtAuthInfo
 

O

objectToEntry(WebURL, TupleOutput) - Method in class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
onBeforeExit() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called just before the termination of the current crawler instance.
onContentFetchError(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if the content of a url could not be fetched.
onPageBiggerThanMaxSize(String, long) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if the content of a url is bigger than allowed size.
onParseError(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if there has been an error in parsing the content.
onRedirectedStatusCode(Page) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if the crawler encounters a page with a 3xx status code
onStart() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called just before starting the crawl by this crawler instance.
onUnexpectedStatusCode(String, int, String, String) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called if the crawler encountered an unexpected http status code ( a status code other than 3xx)
onUnhandledException(WebURL, Throwable) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
This function is called when a unhandled exception was encountered during fetching
openCursor(Transaction) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 

P

Page - Class in edu.uci.ics.crawler4j.crawler
This class contains the data for a fetched and parsed page.
Page(WebURL) - Constructor for class edu.uci.ics.crawler4j.crawler.Page
 
PageBiggerThanMaxSizeException - Exception in edu.uci.ics.crawler4j.crawler.exceptions
Created by Avi Hayun on 12/8/2014.
PageBiggerThanMaxSizeException(long) - Constructor for exception edu.uci.ics.crawler4j.crawler.exceptions.PageBiggerThanMaxSizeException
 
pageFetcher - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
PageFetcher - Class in edu.uci.ics.crawler4j.fetcher
 
PageFetcher(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
pageFetcher - Variable in class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
PageFetchResult - Class in edu.uci.ics.crawler4j.fetcher
 
PageFetchResult() - Constructor for class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
parse(Page, String) - Method in class edu.uci.ics.crawler4j.parser.Parser
 
parse(String, RobotstxtConfig) - Static method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtParser
 
parseData - Variable in class edu.uci.ics.crawler4j.crawler.Page
The parsed data populated by parsers
ParseData - Interface in edu.uci.ics.crawler4j.parser
 
ParseException - Exception in edu.uci.ics.crawler4j.crawler.exceptions
Created by Avi Hayun on 12/8/2014.
ParseException() - Constructor for exception edu.uci.ics.crawler4j.crawler.exceptions.ParseException
 
Parser - Class in edu.uci.ics.crawler4j.parser
 
Parser(CrawlConfig) - Constructor for class edu.uci.ics.crawler4j.parser.Parser
 
password - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
PathRule - Class in edu.uci.ics.crawler4j.robotstxt
 
PathRule(int, String) - Constructor for class edu.uci.ics.crawler4j.robotstxt.PathRule
Create a new path rule, based on the specified pattern
pattern - Variable in class edu.uci.ics.crawler4j.robotstxt.PathRule
 
port - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
PROCESSED_PAGES - Static variable in class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
protocol - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
put(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.WorkQueues
 
putIntInByteArray(int, byte[], int) - Static method in class edu.uci.ics.crawler4j.util.Util
 

R

redirect - Variable in class edu.uci.ics.crawler4j.crawler.Page
Redirection flag
redirectedToUrl - Variable in class edu.uci.ics.crawler4j.crawler.Page
The URL to which this page will be redirected to
removeURL(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.InProcessPagesDB
 
ReservedCounterNames() - Constructor for class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
resolveUrl(String, String) - Static method in class edu.uci.ics.crawler4j.url.UrlResolver
Resolves a given relative URL against a base URL.
responseHeaders - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
robotsPatternToRegexp(String) - Static method in class edu.uci.ics.crawler4j.robotstxt.PathRule
Match a pattern defined in a robots.txt file to a path Following the pattern definition as stated on: https://support.google.com/webmasters/answer/6062596?hl=en&ref_topic=6061961 This page defines the following items: * matches any sequence of characters, including / $ matches the end of the line
RobotstxtConfig - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtConfig() - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
RobotstxtParser - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtParser() - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtParser
 
robotstxtServer - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
RobotstxtServer - Class in edu.uci.ics.crawler4j.robotstxt
 
RobotstxtServer(RobotstxtConfig, PageFetcher) - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
RobotstxtServer(RobotstxtConfig, PageFetcher, int) - Constructor for class edu.uci.ics.crawler4j.robotstxt.RobotstxtServer
 
run() - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
run() - Method in class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 

S

schedule(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
scheduleAll(List<WebURL>) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
SCHEDULED_PAGES - Static variable in class edu.uci.ics.crawler4j.frontier.Counters.ReservedCounterNames
 
scheduledPages - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
setAnchor(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
setAnchor(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setAuthenticationType(AuthInfo.AuthenticationType) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setAuthInfos(List<AuthInfo>) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setBinaryContent(byte[]) - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
setCacheSize(int) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setCleanupDelaySeconds(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setConnectionTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setContentCharset(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentData(byte[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentEncoding(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setContentType(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setCrawlStorageFolder(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
The folder which will be used by crawler for storing the intermediate crawl data.
setCustomData(Object) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setDefaultHeaders(Collection<? extends Header>) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Set the default header collection (creating copies of the provided headers).
setDepth(short) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setDocIdServer(DocIDServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setDomain(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.NtAuthInfo
 
setEnabled(boolean) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setEntity(HttpEntity) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setFetchedUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setFetchResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setFollowRedirects(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setFrontier(Frontier) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setHost(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setHref(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
setHtml(String) - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
setHtml(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setHttpMethod(FormSubmitEvent.MethodType) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setIgnoreUADiscrimination(boolean) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setIncludeBinaryContentInCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setIncludeHttpsPages(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setLanguage(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setLoginTarget(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setMaxConnectionsPerHost(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setMaxDepthOfCrawling(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum depth of crawling For unlimited depth this parameter should be set to -1
setMaxDownloadSize(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setMaxOutgoingLinksToFollow(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setMaxPagesToFetch(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Maximum number of pages to fetch For unlimited number of pages, this parameter should be set to -1
setMaxTotalConnections(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setMetaTags(Map<String, String>) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setMovedToUrl(String) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setOnlineTldListUpdate(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should the TLD list be updated automatically on each run? Alternatively, it can be loaded from the embedded tld-names.txt resource file that was obtained from https://publicsuffix.org/list/effective_tld_names.dat
setOutgoingUrls(Set<WebURL>) - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
setOutgoingUrls(Set<WebURL>) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setOutgoingUrls(Set<WebURL>) - Method in interface edu.uci.ics.crawler4j.parser.ParseData
 
setOutgoingUrls(Set<WebURL>) - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
setPageFetcher(PageFetcher) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setParentDocid(int) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setParentUrl(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setParseData(ParseData) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setPassword(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setPasswordFormStr(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo
 
setPath(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setPolitenessDelay(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Politeness delay in milliseconds (delay between sending two requests to the same host).
setPort(int) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setPriority(byte) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setProcessBinaryContentInCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should we process binary content such as images, audio, ...
setProcessed(WebURL) - Method in class edu.uci.ics.crawler4j.frontier.Frontier
 
setProtocol(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setProxyHost(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setProxyPassword(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If crawler should run behind a proxy and user/pass is needed for authentication in proxy, this parameter can be used for specifying the password.
setProxyPort(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setProxyUsername(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setRedirect(boolean) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setRedirectedToUrl(String) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setResponseHeaders(Header[]) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setResumableCrawling(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
If this feature is enabled, you would be able to resume a previously stopped/crashed crawl.
setRobotstxtServer(RobotstxtServer) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
setShutdownOnEmptyQueue(boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Should the crawler stop running when the queue is empty?
setSocketTimeout(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setStatusCode(int) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
setStatusCode(int) - Method in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 
setTag(String) - Method in class edu.uci.ics.crawler4j.parser.ExtractedUrlAnchorPair
 
setTag(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setText(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setTextContent(String) - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
setThread(Thread) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
 
setThreadMonitoringDelaySeconds(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setThreadShutdownDelaySeconds(int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
setTitle(String) - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
setURL(String) - Method in class edu.uci.ics.crawler4j.url.WebURL
 
setUseOnline(boolean) - Static method in class edu.uci.ics.crawler4j.url.TLDList
If online is set to true, the list of TLD files will be downloaded and refreshed, otherwise the one cached in src/main/resources/tld-names.txt will be used.
setUserAgent(String) - Method in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
Change the user agent string used to crawl after initialization.
setUserAgentName(String) - Method in class edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig
 
setUserAgentString(String) - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
user-agent string that is used for representing your crawler to web servers.
setUsername(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
setUsernameFormStr(String) - Method in class edu.uci.ics.crawler4j.crawler.authentication.FormAuthInfo
 
setValue(String, long) - Method in class edu.uci.ics.crawler4j.frontier.Counters
 
setWebURL(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.Page
 
shouldFollowLinksIn(WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Determine whether links found at the given URL should be added to the queue for crawling.
shouldVisit(Page, WebURL) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Classes that extends WebCrawler should overwrite this function to tell the crawler whether the given url should be crawled or not.
shutdown() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Set the current crawling session set to 'shutdown'.
shutdown() - Method in class edu.uci.ics.crawler4j.fetcher.IdleConnectionMonitorThread
 
shutDown() - Method in class edu.uci.ics.crawler4j.fetcher.PageFetcher
 
shuttingDown - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
Is the crawling session set to 'shutdown'.
sleep(int) - Static method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
SniPoolingHttpClientConnectionManager - Class in edu.uci.ics.crawler4j.fetcher
Class to work around the exception thrown by the SSL subsystem when the server is incorrectly configured for SNI.
SniPoolingHttpClientConnectionManager(Registry<ConnectionSocketFactory>) - Constructor for class edu.uci.ics.crawler4j.fetcher.SniPoolingHttpClientConnectionManager
 
SniSSLConnectionSocketFactory - Class in edu.uci.ics.crawler4j.fetcher
Class to work around the exception thrown by the SSL subsystem when the server is incorrectly configured for SNI.
SniSSLConnectionSocketFactory(SSLContext, HostnameVerifier) - Constructor for class edu.uci.ics.crawler4j.fetcher.SniSSLConnectionSocketFactory
 
start(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and wait for it to finish.
start(CrawlController.WebCrawlerFactory<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and wait for it to finish.
start(CrawlController.WebCrawlerFactory<T>, int, boolean) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
 
startElement(String, String, String, Attributes) - Method in class edu.uci.ics.crawler4j.parser.HtmlContentHandler
 
startNonBlocking(CrawlController.WebCrawlerFactory<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and return immediately.
startNonBlocking(Class<T>, int) - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Start the crawling session and return immediately.
statisticsDB - Variable in class edu.uci.ics.crawler4j.frontier.Counters
 
statusCode - Variable in class edu.uci.ics.crawler4j.crawler.Page
Status of the page
statusCode - Variable in class edu.uci.ics.crawler4j.fetcher.PageFetchResult
 

T

TextParseData - Class in edu.uci.ics.crawler4j.parser
 
TextParseData() - Constructor for class edu.uci.ics.crawler4j.parser.TextParseData
 
TLDList - Class in edu.uci.ics.crawler4j.url
This class is a singleton which obtains a list of TLDs (from online or a local file) in order to compare against those TLDs
toByteArray(HttpEntity, int) - Method in class edu.uci.ics.crawler4j.crawler.Page
Read contents from an entity, with a specified maximum.
toString() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
 
toString() - Method in class edu.uci.ics.crawler4j.parser.BinaryParseData
 
toString() - Method in class edu.uci.ics.crawler4j.parser.HtmlParseData
 
toString() - Method in interface edu.uci.ics.crawler4j.parser.ParseData
 
toString() - Method in class edu.uci.ics.crawler4j.parser.TextParseData
 
toString() - Method in class edu.uci.ics.crawler4j.url.WebURL
 
truncated - Variable in class edu.uci.ics.crawler4j.crawler.Page
Whether the content was truncated because the received data exceeded the imposed maximum
type - Variable in class edu.uci.ics.crawler4j.robotstxt.PathRule
 

U

UNDEFINED - Static variable in class edu.uci.ics.crawler4j.robotstxt.HostDirectives
 
url - Variable in class edu.uci.ics.crawler4j.crawler.Page
The URL of this page.
URLCanonicalizer - Class in edu.uci.ics.crawler4j.url
See http://en.wikipedia.org/wiki/URL_normalization for a reference Note: some parts of the code are adapted from: http://stackoverflow.com/a/4057470/405418
URLCanonicalizer() - Constructor for class edu.uci.ics.crawler4j.url.URLCanonicalizer
 
UrlResolver - Class in edu.uci.ics.crawler4j.url
 
UrlResolver() - Constructor for class edu.uci.ics.crawler4j.url.UrlResolver
 
UserAgentDirectives - Class in edu.uci.ics.crawler4j.robotstxt
The UserAgentDirectives class stores the configuration for a single user agent as defined in the robots.txt.
UserAgentDirectives(Set<String>) - Constructor for class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
Create a UserAgentDirectives clause
UserAgentDirectives.UserAgentComparator - Class in edu.uci.ics.crawler4j.robotstxt
 
userAgents - Variable in class edu.uci.ics.crawler4j.robotstxt.UserAgentDirectives
 
username - Variable in class edu.uci.ics.crawler4j.crawler.authentication.AuthInfo
 
Util - Class in edu.uci.ics.crawler4j.util
 
Util() - Constructor for class edu.uci.ics.crawler4j.util.Util
 

V

validate() - Method in class edu.uci.ics.crawler4j.crawler.CrawlConfig
Validates the configs specified by this instance.
valueOf(String) - Static method in enum edu.uci.ics.crawler4j.crawler.authentication.AuthInfo.AuthenticationType
Returns the enum constant of this type with the specified name.
values() - Static method in enum edu.uci.ics.crawler4j.crawler.authentication.AuthInfo.AuthenticationType
Returns an array containing the constants of this enum type, in the order they are declared.
visit(Page) - Method in class edu.uci.ics.crawler4j.crawler.WebCrawler
Classes that extends WebCrawler should overwrite this function to process the content of the fetched and parsed page.

W

waitingList - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
waitingLock - Variable in class edu.uci.ics.crawler4j.crawler.CrawlController
 
waitUntilFinish() - Method in class edu.uci.ics.crawler4j.crawler.CrawlController
Wait until this crawling session finishes.
WebCrawler - Class in edu.uci.ics.crawler4j.crawler
WebCrawler class in the Runnable class that is executed by each crawler thread.
WebCrawler() - Constructor for class edu.uci.ics.crawler4j.crawler.WebCrawler
 
WebURL - Class in edu.uci.ics.crawler4j.url
 
WebURL() - Constructor for class edu.uci.ics.crawler4j.url.WebURL
 
WebURLTupleBinding - Class in edu.uci.ics.crawler4j.frontier
 
WebURLTupleBinding() - Constructor for class edu.uci.ics.crawler4j.frontier.WebURLTupleBinding
 
workQueues - Variable in class edu.uci.ics.crawler4j.frontier.Frontier
 
WorkQueues - Class in edu.uci.ics.crawler4j.frontier
 
WorkQueues(Environment, String, boolean) - Constructor for class edu.uci.ics.crawler4j.frontier.WorkQueues
 
A B C D E F G H I L M N O P R S T U V W 
Skip navigation links

Copyright © 2017. All rights reserved.