[General] Webboard: exclude mime types

Discussion:

b***@mnogosearch.org

2016-10-12 08:41:27 UTC

Author: fabien
Email: ***@gmail.com
Message:
Hi all,

Is it possible to exclude certain mime types such as rss feeds ?

Thanks in advance,
Fabien.

Reply: <http://www.mnogosearch.org/board/message.php?id=21788>

b***@mnogosearch.org

2016-10-12 13:37:26 UTC

Permalink

Author: Alexander Barkov
Email:
Message:
Hi,

Post by b***@mnogosearch.org
Hi all,
Is it possible to exclude certain mime types such as rss feeds ?

This can be done using the NoIndexIf command:

http://www.mnogosearch.org/doc34/msearch-cmdref-noindexif.html

Put this command into indexer.conf to disallow a certain Content-Type:

NoIndexIf Content-Type application/rss+xml

Another option is to use NoIndexIf in a combination with a user defined section, to check raw content fragments:

http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-user-defined

The idea is to define a user section using a regex pattern to catch some known RSS text fragments, and then use NoIndexIf with this section.

Post by b***@mnogosearch.org
Thanks in advance,
Fabien.

Reply: <http://www.mnogosearch.org/board/message.php?id=21789>

b***@mnogosearch.org

2016-10-12 20:48:42 UTC

Permalink

Author: fabien
Email: ***@gmail.com
Message:
Thanks for your quick answer.

I tried to add the NoIndexIf but i cannot get it to work.

I used the indexer.conf default file, and added the two following lines at the end of that file :
Server http://www.wearethelous.com/feed/
NoIndexIf Content-Type application/rss+xml

I got the following log :

[71598]{--} Clearing
[71598]{--} Clearing done 0.01
[71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with '/etc/mnogosearch/indexer.conf'
[71600]{01} URL: http://www.wearethelous.com/feed/
[71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
[71600]{01} Allow by default
[71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Length: 67
[71600]{01} Response.Content-Type: text/plain
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 475
[71600]{01} Response.ResponseTime: 2261
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Status: 200
[71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
[71600]{01} Response.URL_ID: 1928115922
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.body:
[71600]{01} Response.Charset:
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Language:
[71600]{01} Response.Content-Length: 2337
[71600]{01} Response.Content-Type: application/rss+xml
[71600]{01} Response.crc32: 0
[71600]{01} Response.crc32old: 0
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
[71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
[71600]{01} Response.Hops: 0
[71600]{01} Response.ID: 5
[71600]{01} Response.ilinktext:
[71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.MaxDocPerSite: 0
[71600]{01} Response.MaxHops: 256
[71600]{01} Response.meta.description:
[71600]{01} Response.meta.keywords:
[71600]{01} Response.msg.from:
[71600]{01} Response.msg.subject:
[71600]{01} Response.msg.to:
[71600]{01} Response.PrevStatus: 0
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 2842
[71600]{01} Response.ResponseTime: 1455
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Server_id: -2050898686
[71600]{01} Response.Status: 200
[71600]{01} Response.title:
[71600]{01} Response.URL: http://www.wearethelous.com/feed/
[71600]{01} Response.url.file:
[71600]{01} Response.url.host:
[71600]{01} Response.url.path:
[71600]{01} Response.url.proto:
[71600]{01} Response.URL_ID: -2050898686
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Status: 200 OK
[71600]{01} Guesser: Lang: , Charset: utf-8
[71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 'application/rss+xml'
[71600]{01} Flushing word cache
[71600]{01} Flushing word cache done 0.00
[71600]{01} Done (4 seconds, 1 documents, 2842 bytes, 0.69 Kbytes/sec.)

I see that the section filter talks about the NoIndexIf filter that i added, but the url is still indexed.
So what can be wrong ?

Thanks in advance for your help.
Fabien.

Post by b***@mnogosearch.org
Hi,

Post by b***@mnogosearch.org
Hi all,
Is it possible to exclude certain mime types such as rss feeds ?

http://www.mnogosearch.org/doc34/msearch-cmdref-noindexif.html
NoIndexIf Content-Type application/rss+xml
http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-user-defined
The idea is to define a user section using a regex pattern to catch some known RSS text fragments, and then use NoIndexIf with this section.

Post by b***@mnogosearch.org
Thanks in advance,
Fabien.

Reply: <http://www.mnogosearch.org/board/message.php?id=21790>

b***@mnogosearch.org

2016-10-12 20:52:39 UTC

Permalink

Author: fabien
Email: ***@gmail.com
Message:
And to be more precise, i finally want to index only html pages and not all other types of data (css/js/pictures/pdf/rss/...) .

Fabien.

Post by b***@mnogosearch.org
Thanks for your quick answer.
I tried to add the NoIndexIf but i cannot get it to work.
Server http://www.wearethelous.com/feed/
NoIndexIf Content-Type application/rss+xml
[71598]{--} Clearing
[71598]{--} Clearing done 0.01
[71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with '/etc/mnogosearch/indexer.conf'
[71600]{01} URL: http://www.wearethelous.com/feed/
[71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
[71600]{01} Allow by default
[71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Length: 67
[71600]{01} Response.Content-Type: text/plain
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 475
[71600]{01} Response.ResponseTime: 2261
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Status: 200
[71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
[71600]{01} Response.URL_ID: 1928115922
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Length: 2337
[71600]{01} Response.Content-Type: application/rss+xml
[71600]{01} Response.crc32: 0
[71600]{01} Response.crc32old: 0
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
[71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
[71600]{01} Response.Hops: 0
[71600]{01} Response.ID: 5
[71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.MaxDocPerSite: 0
[71600]{01} Response.MaxHops: 256
[71600]{01} Response.PrevStatus: 0
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 2842
[71600]{01} Response.ResponseTime: 1455
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Server_id: -2050898686
[71600]{01} Response.Status: 200
[71600]{01} Response.URL: http://www.wearethelous.com/feed/
[71600]{01} Response.URL_ID: -2050898686
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Status: 200 OK
[71600]{01} Guesser: Lang: , Charset: utf-8
[71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 'application/rss+xml'
[71600]{01} Flushing word cache
[71600]{01} Flushing word cache done 0.00
[71600]{01} Done (4 seconds, 1 documents, 2842 bytes, 0.69 Kbytes/sec.)
I see that the section filter talks about the NoIndexIf filter that i added, but the url is still indexed.
So what can be wrong ?
Thanks in advance for your help.
Fabien.

Post by b***@mnogosearch.org
Hi,

Post by b***@mnogosearch.org
Hi all,
Is it possible to exclude certain mime types such as rss feeds ?

Post by b***@mnogosearch.org
Thanks in advance,
Fabien.

Reply: <http://www.mnogosearch.org/board/message.php?id=21791>

b***@mnogosearch.org

2016-10-13 04:23:03 UTC

Permalink

Author: Alexander Barkov

I tried the same thing, and it seems to work fine.
This page is not returned in search results.

If I remove the NoIndexIf command, this page IS returned by search results.

Note, indexer shows the URL in its log, because it still must
download this URL to know its content type.
But the fact that you can see the "SectionFilter:..." line in the log
tells that indexer marks it as "not for indexing" and thus stores no data into the underlying tables cachedcopy and bdicti, so "indexer --index" later does see it when creating the search index.

Post by b***@mnogosearch.org
[71598]{--} Clearing
[71598]{--} Clearing done 0.01
[71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with '/etc/mnogosearch/indexer.conf'
[71600]{01} URL: http://www.wearethelous.com/feed/
[71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
[71600]{01} Allow by default
[71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Length: 67
[71600]{01} Response.Content-Type: text/plain
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 475
[71600]{01} Response.ResponseTime: 2261
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Status: 200
[71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
[71600]{01} Response.URL_ID: 1928115922
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Request.Accept-Encoding: gzip,deflate,compress
[71600]{01} Request.Host: www.wearethelous.com
[71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
[71600]{01} Response.Connection: close
[71600]{01} Response.Content-Encoding: gzip
[71600]{01} Response.Content-Length: 2337
[71600]{01} Response.Content-Type: application/rss+xml
[71600]{01} Response.crc32: 0
[71600]{01} Response.crc32old: 0
[71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
[71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
[71600]{01} Response.Hops: 0
[71600]{01} Response.ID: 5
[71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
[71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; rel="https://api.w.org/"
[71600]{01} Response.MaxDocPerSite: 0
[71600]{01} Response.MaxHops: 256
[71600]{01} Response.PrevStatus: 0
[71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
[71600]{01} Response.ResponseSize: 2842
[71600]{01} Response.ResponseTime: 1455
[71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips mod_bwlimited/1.4
[71600]{01} Response.Server-Charset: utf-8
[71600]{01} Response.Server_id: -2050898686
[71600]{01} Response.Status: 200
[71600]{01} Response.URL: http://www.wearethelous.com/feed/
[71600]{01} Response.URL_ID: -2050898686
[71600]{01} Response.Vary: Accept-Encoding,User-Agent
[71600]{01} Response.X-Powered-By: PHP/5.5.29
[71600]{01} Response.X-Robots-Tag: noindex, follow
[71600]{01} Status: 200 OK
[71600]{01} Guesser: Lang: , Charset: utf-8
[71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 'application/rss+xml'
[71600]{01} Flushing word cache
[71600]{01} Flushing word cache done 0.00
[71600]{01} Done (4 seconds, 1 documents, 2842 bytes, 0.69 Kbytes/sec.)
I see that the section filter talks about the NoIndexIf filter that i added, but the url is still indexed.
So what can be wrong ?
Thanks in advance for your help.
Fabien.

Post by b***@mnogosearch.org
Hi,

Post by b***@mnogosearch.org
Hi all,
Is it possible to exclude certain mime types such as rss feeds ?

Post by b***@mnogosearch.org
Thanks in advance,
Fabien.

Reply: <http://www.mnogosearch.org/board/message.php?id=21792>

b***@mnogosearch.org

2016-10-13 04:27:00 UTC

Permalink

Author: Alexander Barkov

Post by b***@mnogosearch.org

I tried the same thing, and it seems to work fine.
This page is not returned in search results.
If I remove the NoIndexIf command, this page IS returned by search results.
Note, indexer shows the URL in its log, because it still must
download this URL to know its content type.
But the fact that you can see the "SectionFilter:..." line in the log
tells that indexer marks it as "not for indexing" and thus stores no data into the underlying tables cachedcopy and bdicti, so "indexer --index" later does see it when creating the search index.

Note, if you know that documents under certain location return application/rss+xml or some other not desired content type,
then consider using Disallow instead. In this case indexer will
not even download these documents.

NoIndexIf is rather for the cases when it's not possible to describe "bad" documents by their URL pattern.

<cut>

Reply: <http://www.mnogosearch.org/board/message.php?id=21793>

b***@mnogosearch.org

2016-10-13 04:31:39 UTC

Permalink

Author: Alexander Barkov

Post by b***@mnogosearch.org
And to be more precise, i finally want to index only html pages and not all other types of data (css/js/pictures/pdf/rss/...) .

Something like this should do the trick:

NoIndexIf NoMatch Content-Type text/html*

Additionally, try to use the Disallow command to reduce the number of URLs that indexer has actually to download.
See here for details:
http://www.mnogosearch.org/board/message.php?id=21793

Post by b***@mnogosearch.org
Fabien.

<cut>

Reply: <http://www.mnogosearch.org/board/message.php?id=21794>

b***@mnogosearch.org

2016-10-13 19:54:13 UTC

Permalink

Author: fabien
Email: ***@gmail.com
Message:
Hi,

I tried today the disallow statements, and it works like a charm ! :)
I can now exclude typical useless urls before they get downloaded by the indexer.

Thanks for your help and for your work !

Reply: <http://www.mnogosearch.org/board/message.php?id=21795>