Discussion:
[General] Webboard: index only new pages
b***@mnogosearch.org
2017-04-10 23:40:36 UTC
Permalink
Author: Jeff Dwork
Email: ***@gmail.com
Message:
I'm indexing a mailing list archive. Pages never change. Every
week a few pages are added in a new directory. The archive is on
the same machine as the index, so my server directive is

Server https://domain.com/msgs/ file:///var/www/domain/msgs/

I ran a full index (indexer --drop; indexer --create; indexer -a)
after creating the archive. The next week I add new messages in a
new directory (for example: /var/www/domain/msgs/v117n013/). I
cannot get the new pages indexed. I tried 'indexer' with no
options and several variations on
indexer -a -u '%/v117n013/%'
all report 0 documents indexed.
So I have to run another full index.

How can I get only the new pages indexed?

Thanks,
Jeff

Reply: <http://www.mnogosearch.org/board/message.php?id=21817>
b***@mnogosearch.org
2017-04-11 09:55:56 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
I'm indexing a mailing list archive. Pages never change. Every
week a few pages are added in a new directory. The archive is on
the same machine as the index, so my server directive is
Server https://domain.com/msgs/ file:///var/www/domain/msgs/
I ran a full index (indexer --drop; indexer --create; indexer -a)
after creating the archive. The next week I add new messages in a
new directory (for example: /var/www/domain/msgs/v117n013/). I
cannot get the new pages indexed. I tried 'indexer' with no
options and several variations on
indexer -a -u '%/v117n013/%'
all report 0 documents indexed.
So I have to run another full index.
How can I get only the new pages indexed?
You need to re-crawl the index page:

indexer -am -u https://domain.com/msgs/

The you can run like this:

indexer -u '%/v117n013/%'


Btw, don't forget to set Period to some huge value.
Post by b***@mnogosearch.org
Thanks,
Jeff
Reply: <http://www.mnogosearch.org/board/message.php?id=21818>
b***@mnogosearch.org
2017-04-12 09:08:42 UTC
Permalink
Author: Jeff Dwork
Email: ***@gmail.com
Message:
Unfortunately it did not work, but I found a working method.

I added 'Period 30y' before my 'Server' command in config file and
did
indexer --drop
indexer --create
indexer -a
It ran forever. I killed it (ctrl-C) and it reported crawling over
500000 pages - there are about 16000 pages on the site.

I removed the 'Period' command and reindexed the site. I then
added a new directory with the newest pages and did:

indexer -ai -u 'https://domain.com/msgs/v117n014/%.html'
indexer --index

This processed only the new pages and correctly added them to the
index.

Thanks,
Jeff

Reply: <http://www.mnogosearch.org/board/message.php?id=21819>
b***@mnogosearch.org
2017-04-13 05:42:48 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
Unfortunately it did not work, but I found a working method.
I added 'Period 30y' before my 'Server' command in config file and
did
indexer --drop
indexer --create
indexer -a
It ran forever. I killed it (ctrl-C) and it reported crawling over
500000 pages - there are about 16000 pages on the site.
It seems 30y makes some integer overflow.
Should work with "Period 1y".
Post by b***@mnogosearch.org
I removed the 'Period' command and reindexed the site. I then
indexer -ai -u 'https://domain.com/msgs/v117n014/%.html'
indexer --index
The above command will insert 'https://domain.com/msgs/v117n014/%.html' into the database. This is probably not what you need.


It should be:

indexer -ai -u 'https://domain.com/msgs/v117n014/'
indexer --index
Post by b***@mnogosearch.org
This processed only the new pages and correctly added them to the
index.
Thanks,
Jeff
Reply: <http://www.mnogosearch.org/board/message.php?id=21820>

Loading...