Discussion:
Webboard: Saving html code in database
b***@mnogosearch.org
2013-12-10 18:44:22 UTC
Permalink
Author: fasfuuiios
Email:
Message:
I'm trying to use mnogosearch as simple parser because it is much
better than other scripts that were created specially for data
extraction and analysis in my opinion. Is it possible to store full
html code in database using "Section"? I have tried but it always strip
html tags. CachedCopy looks encrypted. I want to save full pages and
than explore dump with prepared parser to extract structured data.

If such thing is not possible with "Section" by default what source
code files I must explore? Any simple hack is possible?

Reply: <http://www.mnogosearch.org/board/message.php?id=21607>
b***@mnogosearch.org
2013-12-10 21:20:54 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
I'm trying to use mnogosearch as simple parser because it is much
better than other scripts that were created specially for data
extraction and analysis in my opinion. Is it possible to store full
html code in database using "Section"? I have tried but it always strip
html tags. CachedCopy looks encrypted.
It's a compressed content (using "deflate"), then wrapped into base64.
So to get the full HTML code, you can do base64-decode, followed by
zlib's inflate. This needs some programming. A simple PHP program
should do the trick.

Alternatively, you can extract cached copies using search.cgi,
like this:
./search.cgi "&cc=1&URL=http://www.site.com/test.html"
Post by b***@mnogosearch.org
I want to save full pages and
than explore dump with prepared parser to extract structured data.
If such thing is not possible with "Section" by default what source
code files I must explore? Any simple hack is possible?
Storing the original HTML code is possible in the version 3.4.
You can download a pre-release of 3.4.0 from here:
http://www.mnogosearch.org/Download/mnogosearch-3.4.0.tar.gz

3.4 stores cached copies differently (comparing to 3.3):
- in a new table "urlinfob", separately from the "Section" values.
- without base64 encoding (in a "BLOB" instead of "TEXT" column)
- compressed by default using deflate,
but with an option to switch compression off.

To store cached copies uncompressed, add this command
into indexer.conf:

CachedCopyEncoding identity

Note, the table name "urlinfob" will probably change to "cachedcopy"
in the final 3.4.0 release.

The 3.4 manual is already online.
These pages might be of interest for you:
http://www.mnogosearch.org/doc34/msearch-changelog.html
http://www.mnogosearch.org/doc34/msearch-cmdref-cachedcopyencoding.html


Reply: <http://www.mnogosearch.org/board/message.php?id=21608>
b***@mnogosearch.org
2013-12-13 12:46:07 UTC
Permalink
Author: fasfuuiios
Email:
Message:
With such options mnogosearch can be positioned mot only as search
engine but also as universal data miner to collect and analyze some
data with external parsing libraries. In most cases so-called parsers
can't crawl sites normally. So if anyone needs to download some site it
is better to use mnogosearch. With wget it becames unpredictable.
Probably the only concurent of mnogosearch is python library named
scrapy. But it is also needs preparations for everything. And it is
unpredictable on high volume of data, in my opinion.

Reply: <http://www.mnogosearch.org/board/message.php?id=21613>

Loading...