Discussion:
Webboard: Indexer with regex
b***@mnogosearch.org
2013-11-09 17:58:41 UTC
Permalink
Author: Laurent
Email:
Message:
Hi Guys,

It's a long since my udm-gw script in Y2K.
I am back on mnoGosearch and face a newbie issue I cant solve.

I want to index a server but not some specific regex on it.
I tried disallow with server, all fails.
Server disallow with pattern is not possible to me, no try.

Here I want to index www.a.com/
without www.a.com/news/*/2000/*
and www.a.com/index.html?*setlang=za

I did:
Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server allow www.a.com/

I also tried using .* as pattern for any instead of *, no success.

Any help appreciated :-)

Thanks

Reply: <http://www.mnogosearch.org/board/message.php?id=21576>
b***@mnogosearch.org
2013-11-09 18:15:08 UTC
Permalink
Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hi,
Post by b***@mnogosearch.org
Hi Guys,
It's a long since my udm-gw script in Y2K.
I am back on mnoGosearch and face a newbie issue I cant solve.
I want to index a server but not some specific regex on it.
I tried disallow with server, all fails.
Can you please clarify what fails?
Does it crawl the entire site?
Or does it crawl nothing?
Post by b***@mnogosearch.org
Server disallow with pattern is not possible to me, no try.
Here I want to index www.a.com/
without www.a.com/news/*/2000/*
and www.a.com/index.html?*setlang=za
Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server allow www.a.com/
The correct command is:

Server http://www.a.com/

Notice the "http://" prefix.
Post by b***@mnogosearch.org
I also tried using .* as pattern for any instead of *, no success.
".*" is correct.

Btw, which version are you using?
Post by b***@mnogosearch.org
Any help appreciated :-)
Thanks
Reply: <http://www.mnogosearch.org/board/message.php?id=21577>
b***@mnogosearch.org
2013-11-09 19:17:03 UTC
Permalink
Author: Laurent
Email:
Message:
Hi Alex,

Thanks for your answer.

I did not wrote perfectly the URL.
What you wrote is what I did and it does not work, apparently.
I am on FreeBSD, mnoGo 3.3.14

Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server https://allow www.a.com/

Is this the correct format ?

In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
as well as:
https://www.a.com/index.html?Special/file_2007_Conference

thanks



Reply: <http://www.mnogosearch.org/board/message.php?id=21578>
b***@mnogosearch.org
2013-11-09 19:32:50 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
Hi Alex,
Thanks for your answer.
I did not wrote perfectly the URL.
What you wrote is what I did and it does not work, apparently.
I am on FreeBSD, mnoGo 3.3.14
Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server https://allow www.a.com/
Is this the correct format ?
Try this:

Disallow regex "www[.]a[.]com/news/.*/2000/.*"
Disallow regex "www[.]a[.]com/index[.]html[?].*setlang=za"
Server allow https://www.a.com/

If it does not help, try this command:

indexer -amv6 -u "https://www.a.com/index.php?title=Toto&value=1&setlang=za"

It will print debug output and explain why this URL
is accepted or rejected. Please post its output here.
Post by b***@mnogosearch.org
In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
https://www.a.com/index.html?Special/file_2007_Conference
thanks
Reply: <http://www.mnogosearch.org/board/message.php?id=21579>
b***@mnogosearch.org
2013-11-10 07:44:32 UTC
Permalink
Author: Laurent
Email:
Message:
indexer from mnogosearch-3.3.14-mysql started with '/usr/local/etc/mnogosearch/indexer.conf'
[57177]{01} URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Server Path Allow 'https://www.a.com/'
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} ROBOTS: https://www.a.com/robots.txt
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: ***@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.Accept-Ranges: bytes
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Length: 0
[57177]{01} Response.Content-Type: text/plain
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.ETag: "1ea26b-0-4e0f93dabf240"
[57177]{01} Response.Last-Modified: Mon, 08 Jul 2013 05:23:13 GMT
[57177]{01} Response.Method: Disallow
[57177]{01} Response.Period: 604800
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: ***@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 360
[57177]{01} Response.Server: Apache
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.URL: https://www.a.com/robots.txt
[57177]{01} Response.URL_ID: -1277106540
[57177]{01} Response.Vary: Accept-Encoding
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: ***@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.body: <NULL>
[57177]{01} Response.Cache-Control: private, must-revalidate, max-age=0
[57177]{01} Response.CachedCopy: <NULL>
[57177]{01} Response.Charset: <NULL>
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Language: en
[57177]{01} Response.Content-Length: 7496
[57177]{01} Response.Content-Type: text/html
[57177]{01} Response.crc32: 1003223498
[57177]{01} Response.crc32old: 1003223498
[57177]{01} Response.crosswords: <NULL>
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.Expires: Thu, 01 Jan 1970 00:00:00 GMT
[57177]{01} Response.Hops: 14
[57177]{01} Response.ID: 405428
[57177]{01} Response.Last-Modified: Mon, 14 Oct 2013 15:14:00 GMT
[57177]{01} Response.MaxDocPerSite: 0
[57177]{01} Response.MaxHops: 256
[57177]{01} Response.meta.description: <NULL>
[57177]{01} Response.meta.keywords: <NULL>
[57177]{01} Response.Method: Disallow
[57177]{01} Response.msg.from: <NULL>
[57177]{01} Response.msg.subject: <NULL>
[57177]{01} Response.msg.to: <NULL>
[57177]{01} Response.Period: 604800
[57177]{01} Response.PrevStatus: 200
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: ***@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 7952
[57177]{01} Response.Server: Apache
[57177]{01} Response.Server-Charset: utf-8
[57177]{01} Response.Server_id: -1149994654
[57177]{01} Response.Site_id: -1149994654
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.title: <NULL>
[57177]{01} Response.URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Response.url.file: <NULL>
[57177]{01} Response.url.host: <NULL>
[57177]{01} Response.url.path: <NULL>
[57177]{01} Response.url.proto: <NULL>
[57177]{01} Response.URL_ID: 1908964734
[57177]{01} Response.Vary: Accept-Encoding,Cookie
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Content-Type-Options: nosniff
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Status: 200 OK
[57177]{01} Stored rec_id: 405428 Size: 25459 Ratio: 29.35%
[57177]{01} Guesser: Lang: en, Charset: utf-8
[57177]{01} SectionFilter: Allow by default
[57177]{01} Link '/favicon.ico' https://www.a.com/favicon.ico
[57177]{01} Server applied: site_id: -1149994654 URL: https://www.a.com/
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} Link '/opensearch_desc.php' https://www.a.com/opensearch_desc.php

Reply: <http://www.mnogosearch.org/board/message.php?id=21581>
b***@mnogosearch.org
2013-11-11 20:22:45 UTC
Permalink
Author: Alexander Barkov
Post by b***@mnogosearch.org
indexer from mnogosearch-3.3.14-mysql started with '/usr/local/etc/mnogosearch/indexer.conf'
[57177]{01} URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Server Path Allow 'https://www.a.com/'
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
Can you please send your indexer.conf to ***@mnogosearch.org?
Thanks.


Reply: <http://www.mnogosearch.org/board/message.php?id=21582>
b***@mnogosearch.org
2013-11-13 08:48:32 UTC
Permalink
Author: Laurent
Email:
Message:
Hi Alex,

Ok, I finally found the issue...

First, there was a:
Allow NoMatch Regex \.php$|\.cgi$|\.pl$

Activated. Because of it, mostly all URLs were acceptable.

This because this allow was before the disallow related to the servers.
This totally changed my approach of the indexing file.

Before, I was Allowing/Disallow specific wide patterns (*.suffix etc), than the disallow of URLs and then allow of URLs.
Now I disallow servers first, then allow/disallow wide patterns and finally server allows.

This strongly lowered by unsupported content-type statistics as well:-)

Thanks for your support !!

Reply: <http://www.mnogosearch.org/board/message.php?id=21584>

Loading...