Discussion:
[General] Webboard: Index full html code in DDB
b***@mnogosearch.org
2016-05-30 13:19:44 UTC
Permalink
Author: rafikCyc
Email: ***@gmail.com
Message:
Hello,

I would like to crawl the whole html code for each url.

Is there anyway to do this ?

I've tried this in the indexer.conf but it doesn't work :

Section headhtml 25 2058 "<head([^>]*)>(*.)</head>" $2
Section bodyhtml 26 2058 "<body([^>]*)>(*.)</body>" $2
Section htmlcode 25 2058 "<html([^>]*)>(*.)</html>" $2

Section body 1 2018 afterheaders html
gets the body but with all htlm tags stripped out :(


Thank you for your help


Reply: <http://www.mnogosearch.org/board/message.php?id=21772>
b***@mnogosearch.org
2016-05-30 19:13:50 UTC
Permalink
Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hello,
Post by b***@mnogosearch.org
Hello,
I would like to crawl the whole html code for each url.
Perhaps cached copy is what you're looking for.
In 3.4.x cached copies are stored in a separate table "cachedcopy".
Cached copies are compressed by default, but compression can
be switched off:

http://www.mnogosearch.org/doc34/msearch-cmdref-cachedcopyencoding.html
Post by b***@mnogosearch.org
Is there anyway to do this ?
Section headhtml 25 2058 "<head([^>]*)>(*.)</head>" $2
Section bodyhtml 26 2058 "<body([^>]*)>(*.)</body>" $2
Section htmlcode 25 2058 "<html([^>]*)>(*.)</html>" $2
Section body 1 2018 afterheaders html
gets the body but with all htlm tags stripped out :(
Thank you for your help
Reply: <http://www.mnogosearch.org/board/message.php?id=21773>

Loading...