Discussion:
[General] Webboard: tagging or categorizing without crawling again
b***@mnogosearch.org
2014-12-05 14:07:28 UTC
Permalink
Author: bruno
Email: ***@gmail.com
Message:
Hi Alexander and big congrats for the amazing tool you've built.
I intend to use it as a seo tool but i came to an issue : i would like to
tag or categorize the urls after having already fetched the content but i
can't figure how to do it.
We sometimes miss the correct structure and it's really a pain to have to
crawl again the whole site to rebuild the categorization as the urls are
arleady in the base.

Many thanks for your help!
kind regards,
Bruno

Reply: <http://www.mnogosearch.org/board/message.php?id=21666>
b***@mnogosearch.org
2014-12-06 18:34:18 UTC
Permalink
Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hi Bruno,
Post by b***@mnogosearch.org
Hi Alexander and big congrats for the amazing tool you've built.
I intend to use it as a seo tool but i came to an issue : i would like to
tag or categorize the urls after having already fetched the content but i
can't figure how to do it.
We sometimes miss the correct structure and it's really a pain to have to
crawl again the whole site to rebuild the categorization as the urls are
arleady in the base.
Many thanks for your help!
kind regards,
Bruno
How would you like to tag? Manually? Or in some automated way,
using document properties (e.g. document words, URL, etc)?


Reply: <http://www.mnogosearch.org/board/message.php?id=21667>
b***@mnogosearch.org
2014-12-08 17:17:04 UTC
Permalink
Author: bruno
Email: ***@gmail.com
Message:
Thanks for your reply,

it would be by using documents properties.
Actually, the way of using tag or categories is perfect but, i don't want
to crawl again the whole site because i didn't write my tagging rule in
the correct way the first time.

Many thanks!
Bruno

Reply: <http://www.mnogosearch.org/board/message.php?id=21668>
b***@mnogosearch.org
2014-12-12 05:07:45 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
Actually, the way of using tag or categories is perfect but, i don't want
to crawl again the whole site because i didn't write my tagging rule in
the correct way the first time.
This task consists of two parts:

a. update what you have in the tables "server" and "srvinfo".
This is done automatically when you start crawling.
"indexer -n0" will do this. Note, this is enough when you just need
to rename some tag to a new value.

But usually this is not enough,
as you might want to redistribute documents between tags
(i.e. split a single tag into multiple ones, or join multiple tags
into a single one, or do some more complex redistribution).
In these cases part "b" is also needed.


b. update the table "url" to refer to the table "server" properly.
There is no a special command for this. Normally, documents are
updated properly only when they're crawled next time.
But there is a trick to use "Skip" option temporarily,
to avoid real downloading.


Suppose you want to split the section of your site
into two subsections and assign different tags for them.

What you do is:

1. Change indexer.conf:

# Remove the old command
Tag doc
Server http://host/doc/


# And add two new commands instead
Tag doca
Server skip http://host/doc/a/

Tag docb
Server skip http://host/doc/b/


Notice the "skip" option in the new commands.


2. Run "indexer -am -u 'http://host/doc/%'"

It will a kind "crawl" all documents, but without real downloading.
It will actually only nothing else but execute a query like this
for every document:

UPDATE url SET status=200,next_index_time=1418965297, site_id=-1519382294,server_id=-1738492707 WHERE rec_id=259;


3. Make sure not to forget to remove the "skip" options
from the new "Server" commands in indexer.conf.

4. Check that everything went well:
SELECT server.tag,url.url FROM url,server WHERE url.server_id=server.rec_id;




Reply: <http://www.mnogosearch.org/board/message.php?id=21669>
b***@mnogosearch.org
2014-12-12 13:40:53 UTC
Permalink
Author: bruno
Email: ***@gmail.com
Message:
Thank you alexander, that was exactly what i was looking for!
Kind regards,
Bruno

Reply: <http://www.mnogosearch.org/board/message.php?id=21670>

Loading...