Discussion:
[General] Webboard: Some section not indexed in DB
b***@mnogosearch.org
2016-02-10 11:45:57 UTC
Permalink
Author: Guillaume
Email: ***@atlza.com
Message:
Hi Again,

I'm having a problem with some Section lines in the indexer.conf wit mnogosearch 3.4.1

Here is an extract of my indexer.conf :

Section ResponseTime 0 32
# Standard sections: body, title
Section body 1 1024
Section title 2 256

# HTML meta tags, e.g. <META NAME="KEYWORDS" CONTENT="xxxx">
Section meta.keywords 3
Section meta.description 4 256

# Incoming link text
Section ilinktext 5 128

# Document's URL part
Section url.file 6 0
Section url.path 7 0
Section url.host 8 0
Section url.proto 9 0

# Useful meta information
Section Charset 10 32
Section Content-Type 11 64
Section Content-Language 12 16

# Message/rfc822 headers
#Section msg.from 15
#Section msg.to 16
#Section msg.subject 17

# A user defined section example.
# Extract text between <h1> and </h1> tags:
#Section h1 20 128 "<h1>(.*)</h1>" $1
Section h1 26 256 "<h1[^>]*>(.*)</h1>" $1
Section h2 26 256 "<h2[^>]*>(.*)</h2>" $1
Section h3 26 256 "<h3[^>]*>(.*)</h3>" $1
Section canonical 33 1024 '<link rel="canonical" +href="([^"]*)"' $1
Section ogdescription 33 300 '<meta property="og:description" +content="([^"]*")' $1
Section ogtitle 34 128 '<meta property="og:title" +content="([^"]*")' $1

# Uncomment the following lines if you want index MP3 tags.
#Section MP3.Song 25
#Section MP3.Album 26
#Section MP3.Artist 27
#Section MP3.Year 28

# HTTP headers, e.g. "Server" HTTP header
#Section header.server 30
Section header 30 128
Section header.server 30 128
Section header.Date 30 128
Section header.Last-Modified 30 128
Section header.Etag 30 128
Section header.X-Robots-Tag 30 128
# HTML tag attributes
Section attribute.alt 35 128
Section attribute.label 36 128
Section attribute.summary 37 128
Section attribute.title 38 128

----

And after crawl, the only info saved in the urlinfo table are :
Canonical
Charset
Content-language
Content-type
h1
h2
h3
ogdescription
ogtitle
ResponseTime

As we can see various sections are missing, including some importants one as Title and meta.description which I've checked exist in my server.
This results are the same for various documents and various servers.

I've also tried to not set a length to title, body and meta.description as in the 3.4 documentation example, but is doesn't work better.

Did I miss something ?

Thanks for the help, mnogosearch is a great tool !



Reply: <http://www.mnogosearch.org/board/message.php?id=21746>
b***@mnogosearch.org
2016-02-12 11:29:32 UTC
Permalink
Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hi Guillaume,

title, body and meta.description are not really needed to be in urlinfo for search purposes in 3.4.x. Search and search result presentation should work fine.

But you might of course need them for some other external purposes, e.g. site analysis. The intention in the latest changes in 3.4.x
was not to store sections in urlinfo by default, but they should be
stored if the "length" parameter is set to non-zero.
It seems something went wrong. I'll check it after the weekend
(currently out of my development box).
Post by b***@mnogosearch.org
Hi Again,
I'm having a problem with some Section lines in the indexer.conf wit mnogosearch 3.4.1
Section ResponseTime 0 32
# Standard sections: body, title
Section body 1 1024
Section title 2 256
# HTML meta tags, e.g. <META NAME="KEYWORDS" CONTENT="xxxx">
Section meta.keywords 3
Section meta.description 4 256
# Incoming link text
Section ilinktext 5 128
# Document's URL part
Section url.file 6 0
Section url.path 7 0
Section url.host 8 0
Section url.proto 9 0
# Useful meta information
Section Charset 10 32
Section Content-Type 11 64
Section Content-Language 12 16
# Message/rfc822 headers
#Section msg.from 15
#Section msg.to 16
#Section msg.subject 17
# A user defined section example.
#Section h1 20 128 "<h1>(.*)</h1>" $1
Section h1 26 256 "<h1[^>]*>(.*)</h1>" $1
Section h2 26 256 "<h2[^>]*>(.*)</h2>" $1
Section h3 26 256 "<h3[^>]*>(.*)</h3>" $1
Section canonical 33 1024 '<link rel="canonical" +href="([^"]*)"' $1
Section ogdescription 33 300 '<meta property="og:description" +content="([^"]*")' $1
Section ogtitle 34 128 '<meta property="og:title" +content="([^"]*")' $1
# Uncomment the following lines if you want index MP3 tags.
#Section MP3.Song 25
#Section MP3.Album 26
#Section MP3.Artist 27
#Section MP3.Year 28
# HTTP headers, e.g. "Server" HTTP header
#Section header.server 30
Section header 30 128
Section header.server 30 128
Section header.Date 30 128
Section header.Last-Modified 30 128
Section header.Etag 30 128
Section header.X-Robots-Tag 30 128
# HTML tag attributes
Section attribute.alt 35 128
Section attribute.label 36 128
Section attribute.summary 37 128
Section attribute.title 38 128
----
Canonical
Charset
Content-language
Content-type
h1
h2
h3
ogdescription
ogtitle
ResponseTime
As we can see various sections are missing, including some importants one as Title and meta.description which I've checked exist in my server.
This results are the same for various documents and various servers.
I've also tried to not set a length to title, body and meta.description as in the 3.4 documentation example, but is doesn't work better.
Did I miss something ?
Thanks for the help, mnogosearch is a great tool !
Reply: <http://www.mnogosearch.org/board/message.php?id=21747>

Loading...