Discussion:
[General] Webboard: Links without specific protocol
b***@mnogosearch.org
2017-01-31 15:57:10 UTC
Permalink
Author: Julien D.
Hello,
Hello,
I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when crawling
with
links that don't use a specific protocol.
<a href="//www.example.com/page-b.html">text</a>
Will be seen as : http://www.example.com/www.example.com/page-b.html
And of course will cause a 404 error.
Any idea on how to get the right links ?
Thanks.
The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.
http://www.mnogosearch.org/doc34/msearch-templates.html#template-
functions
http://www.mnogosearch.org/doc33/msearch-templates-
oper.html#templates-oper-misc
Hello Alexander,

Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to index
http://www.example.com/www.example.com/page-b.html (which does not exist)
instead of http://www.example.com/page-b.html

Can I prevent those 404 errors ?

Thanks !

Reply: <http://www.mnogosearch.org/board/message.php?id=21810>
b***@mnogosearch.org
2017-02-01 10:48:27 UTC
Permalink
Author: Alexander Barkov
Email:
Message:
Hello Julien,
Post by b***@mnogosearch.org
Hello,
Hello,
I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when crawling
with
links that don't use a specific protocol.
<a href="//www.example.com/page-b.html">text</a>
Will be seen as : http://www.example.com/www.example.com/page-b.html
And of course will cause a 404 error.
Any idea on how to get the right links ?
Thanks.
The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.
http://www.mnogosearch.org/doc34/msearch-templates.html#template-
functions
http://www.mnogosearch.org/doc33/msearch-templates-
oper.html#templates-oper-misc
Hello Alexander,
Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to index
http://www.example.com/www.example.com/page-b.html (which does not exist)
instead of http://www.example.com/page-b.html
Can I prevent those 404 errors ?
Thanks !
Oops. This is not supported yet, indeed. I thought it was.
It should be easy to add this. Which version are you using?


Reply: <http://www.mnogosearch.org/board/message.php?id=21811>
b***@mnogosearch.org
2017-02-01 14:14:36 UTC
Permalink
Author: Julien D.
Post by b***@mnogosearch.org
Hello Julien,
Post by b***@mnogosearch.org
Hello,
Hello,
I couldn't find any information on this subject.
As people start using HTTPS, I get more and more problems when
crawling
Post by b***@mnogosearch.org
Post by b***@mnogosearch.org
with
links that don't use a specific protocol.
Let's take this example of a link from http://www.example.com/page-
<a href="//www.example.com/page-b.html">text</a>
Will be seen as : http://www.example.com/www.example.com/page-
b.html
Post by b***@mnogosearch.org
Post by b***@mnogosearch.org
And of course will cause a 404 error.
Any idea on how to get the right links ?
Thanks.
The crawler stores full URLs in the database.
But you can remove the protocol at search time,
using the search template language functionality.
http://www.mnogosearch.org/doc34/msearch-templates.html#template-
functions
http://www.mnogosearch.org/doc33/msearch-templates-
oper.html#templates-oper-misc
Hello Alexander,
Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to
index
Post by b***@mnogosearch.org
Post by b***@mnogosearch.org
http://www.example.com/www.example.com/page-b.html (which does not
exist)
Post by b***@mnogosearch.org
Post by b***@mnogosearch.org
instead of http://www.example.com/page-b.html
Can I prevent those 404 errors ?
Thanks !
Oops. This is not supported yet, indeed. I thought it was.
It should be easy to add this. Which version are you using?
Hello Alexander,

I currently use 3.4.1.

Is there a new release I am not aware of ?

Thank you for your quick answers !

Reply: <http://www.mnogosearch.org/board/message.php?id=21812>
b***@mnogosearch.org
2017-02-06 07:07:31 UTC
Permalink
Author: Alexander Barkov
Email:
Message:
<cut>
Post by b***@mnogosearch.org
Hello Alexander,
I currently use 3.4.1.
Is there a new release I am not aware of ?
Thank you for your quick answers !
No, 3.4.1 is the latest.


Reply: <http://www.mnogosearch.org/board/message.php?id=21815>
b***@mnogosearch.org
2017-02-06 19:37:28 UTC
Permalink
Author: Alexander Barkov
Email:
Message:
Hello Julien,
Post by b***@mnogosearch.org
Hello Alexander,
Thanks for the answer.
However, the problem occurs on the indexing phase : the crawler tries to index
http://www.example.com/www.example.com/page-b.html (which does not exist)
instead of http://www.example.com/page-b.html
Can I prevent those 404 errors ?
Thanks !
I have added support for protocol-relative URLs into the next release 3.4.2. I hope to make it available for download this week.

Note, the database structure is slightly different in 3.4.2 vs 3.4.1,
so full re-crawling will be needed. Hope it won't be a serious problem.


Reply: <http://www.mnogosearch.org/board/message.php?id=21816>

Loading...