Discussion:
[General] Webboard: Links without specific protocol
b***@mnogosearch.org
2017-01-23 12:16:24 UTC
Permalink
Author: Julien D.
Email: ***@clustaar.com
Message:
Hello,

I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when crawling with
links that don't use a specific protocol.

Let's take this example of a link from http://www.example.com/page-a.html :
<a href="//www.example.com/page-b.html">text</a>

Will be seen as : http://www.example.com/www.example.com/page-b.html
And of course will cause a 404 error.

Any idea on how to get the right links ?

Thanks.

Reply: <http://www.mnogosearch.org/board/message.php?id=21808>
b***@mnogosearch.org
2017-01-25 11:11:33 UTC
Permalink
Author: Alexander Barkov
Email:
Message:
Hello,
Post by b***@mnogosearch.org
Hello,
I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when crawling with
links that don't use a specific protocol.
<a href="//www.example.com/page-b.html">text</a>
Will be seen as : http://www.example.com/www.example.com/page-b.html
And of course will cause a 404 error.
Any idea on how to get the right links ?
Thanks.
The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.

In 3.4.x use regex_substr:
http://www.mnogosearch.org/doc34/msearch-templates.html#template-functions

In 3.3.x use the EREG template operator:
http://www.mnogosearch.org/doc33/msearch-templates-oper.html#templates-oper-misc


Reply: <http://www.mnogosearch.org/board/message.php?id=21809>

Loading...