Method and system for incremental web crawling

A Web crawler creates an index of documents in a document store on a computer network. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined "seed" URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a local commit time (LCT) for each document and a deleted documents count (DDC) and LCT or maximum LCT (MLCT) for each folder (this assumes that the store supports a folder hierarchy and the MLCT, LCT and DDC properties). Thereafter, in an incremental crawl, the crawler determines, for each folder, (1) whether the DDC for that folder has changed and (2) whether the MLCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. The deleted documents are then deleted from the History Table and index. If the MLCT is more recent, the crawler queries the document store for the URLs of linked documents having a LCT more recent than the MLCT in the History Table for the folder. The History Table and index are then updated accordingly to reflect the changes to the document store.
Crawler стержня создает индекс документов в магазине документа на компьютерной сети. В первоначально crawl, crawler создает первый полный индекс для магазина документа. Первый полный crawl основан на комплекте предопределенного "семени" URLs и ограничений crawl, и включает рекурсивно retrieving каждое folder/document сразу или косвенно после того как он соединен к семени URLs. In the process of создавать первый полный индекс, crawler создает таблицу истории содержа перечень URLs для каждых найденных скоросшивателя и документа в первом полном crawl. Таблица истории также вклюает местное поручает время (LCT) для каждого документа и уничтоженные документы подсчитывают (DDC) и LCT или максимум LCT (MLCT) для каждого скоросшивателя (это предполагает магазин поддерживает иерархию скоросшивателя и свойства MLCT, LCT и DDC). В дальнейшем, в дифференциальном crawl, crawler обусловливает, для каждого скоросшивателя, (1) изменяло ли DDC для того скоросшивателя и (2) ли MLCT недавне чем соответствуя значение в таблице истории. Если DDC изменяло, то crawler получает полный перечень детали (URLs) в что скоросшиватель, и сравнивает список с URLs в таблице истории для того чтобы определить уничтоженные документы. Уничтоженные документы после этого уничтожены от таблицы и индекса истории. Если MLCT недавне, то crawler запрашивает магазин документа для URLs соединенных документов имея LCT более недавнее чем MLCT в таблице истории для скоросшивателя. Таблица и индекс истории после этого уточнены соответственно для того чтобы отразить изменения к магазину документа.

Web www.patentalert.com

< (none)

< System and method for activity monitoring and reporting in a computer network

> Electronic document identification, filing, and retrieval system

> (none)

~ 00085