Discussion:
Webboard: Regex syntax for sections with multiple matches
b***@mnogosearch.org
2013-11-27 13:53:48 UTC
Permalink
Author: Felix Heller
Email: ***@aimcom.de
Message:
Hello,

I've installed and configured MnoGoSearch as a powerful full text search engine for
CMS websites a few days ago. But right now I am a little bit confused about the
configuration of document sections.

I would like to index the headlines (<h1>, <h2>, <h3>) in special fields so that I
can weight them more in comparison to the body text.

There is one example given in indexer.conf:
Section h1 26 128 "<h1>(.*)</h1>" $1

This works fine because normally there is only one <h1> on a webpage. But when I try
to index all <h2> headlines using the regular expression "<h2>(.*)</h2>" $1, the
whole content between the first <h2> and the last <h2> gets indexed. What I would
like to get is only the text between the <h2>...</h2> tags.

Could somebody please tell me if there is a solution for that problem?

Thanks a lot for your help
Felix

Reply: <http://www.mnogosearch.org/board/message.php?id=21590>
b***@mnogosearch.org
2013-11-27 16:22:21 UTC
Permalink
Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hello,
Post by b***@mnogosearch.org
Hello,
I've installed and configured MnoGoSearch as a powerful full text search engine for
CMS websites a few days ago. But right now I am a little bit confused about the
configuration of document sections.
I would like to index the headlines (<h1>, <h2>, <h3>) in special fields so that I
can weight them more in comparison to the body text.
Section h1 26 128 "<h1>(.*)</h1>" $1
This works fine because normally there is only one <h1> on a webpage. But when I try
to index all <h2> headlines using the regular expression "<h2>(.*)</h2>" $1, the
whole content between the first <h2> and the last <h2> gets indexed. What I would
like to get is only the text between the <h2>...</h2> tags.
Could somebody please tell me if there is a solution for that problem?
There are two problems here:
1. Nested tags: <h2>...<xxx>...</xxx>...</h2>

Unfortunately, there is no a general solution for this,
because the underlying regexp library does not support
so called "non-greedy quantifiers". We definitely need
to switch to the PCRE library eventually, to make it possible.

But there is a workaround that I think should work for <h2> and <h3>.
The idea is that <h2> and <h3> usually do not have nested tags,
so the regexp can scan everything until the next '<' character:

Section h2 27 128 "<h2>([^<]*)</h2>" $1
Section h3 28 128 "<h3>([^<]*)</h3>" $1

It will work for: <h2>text text</h2>

It will not work for: <h2>text <xxx>text</xxx> text</h2>
where xxx is some other tag.

Do you know any tags that are possible inside <h2></h2> or <h3></h3>?


2. Multiple <h2> or <h3> tags.
The user defined sections do not support multiple entries.
They catch only the first match. Adding support for multiple
matches (e.g. to concatenate them) will need some coding.
Post by b***@mnogosearch.org
Thanks a lot for your help
Felix
Reply: <http://www.mnogosearch.org/board/message.php?id=21591>

Loading...