Webboard: In/out links and fetching time for each page + xpath

Discussion:

b***@mnogosearch.org

2013-12-04 20:40:18 UTC

Author: Mamadoo
Email: ***@gmail.com
Message:
Hi there,

Is it possible to obtain these informations after having crawled a website :
- Fetching / downloading time of each page
- Total in and out links (from the website structure itself)

Would it be possible to add xpath support instead of regex for Sections ?
Using a plugin or natively.

Many thanks !

Reply: <http://www.mnogosearch.org/board/message.php?id=21597>

b***@mnogosearch.org

2013-12-05 18:09:27 UTC

Permalink

Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
Hi,

Post by b***@mnogosearch.org
Hi there,
- Fetching / downloading time of each page
- Total in and out links (from the website structure itself)

This is possible in mnogosearch-3.4.0, which is in pre-alpha stage at the moment. If you'd like to give it a try, please download it from here:
http://www.mnogosearch.org/Download/mnogosearch-3.4.0.tar.gz
(note, this is not the final 3.4.0).

- See the ResponseTime special purpose section here:
http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-special

- The structure of the table "links" has changed.
It now can store all links between the pages.
Please see here how to configure it:
http://www.mnogosearch.org/doc34/msearch-cmdref-collectlinks.html

Post by b***@mnogosearch.org
Would it be possible to add xpath support instead of regex for Sections ?
Using a plugin or natively.

I guess you need this is for XML files.

XPath is currently not possible. We could take advantage
of libxml2 to add XPath support. But this needs some
development efforts.

Post by b***@mnogosearch.org
Many thanks !

Reply: <http://www.mnogosearch.org/board/message.php?id=21598>

b***@mnogosearch.org

2013-12-05 18:20:43 UTC

Permalink

Author: Alexander Barkov
Email: ***@mnogosearch.org
Message:
<skip>

Post by b***@mnogosearch.org
I guess you need this is for XML files.
XPath is currently not possible. We could take advantage
of libxml2 to add XPath support. But this needs some
development efforts.

Btw, simple extraction from a given XML tag is supported
in 3.3.x, with help of the Section command.

For example:

<xml>
<a>
<b>I want to extract this</b>
</a>
</xml>

A command like this will do the trick:

Section xml.a.b 10 128

Reply: <http://www.mnogosearch.org/board/message.php?id=21599>

b***@mnogosearch.org

2013-12-06 10:43:23 UTC

Permalink

Author: Mamadoo
Email: ***@gmail.com
Message:
For fetching time, ok thanks ! Great news !
For the in / out links per page, any chance you add this one day ?

For xpath, thanks but no, it's not for XML parsing.
I would need it, for example, to scrap specific content on my pages.

Reply: <http://www.mnogosearch.org/board/message.php?id=21600>

b***@mnogosearch.org

2013-12-06 11:15:34 UTC

Permalink

Author: Alexander Barkov

Post by b***@mnogosearch.org
For fetching time, ok thanks ! Great news !
For the in / out links per page, any chance you add this one day ?

As I said in the previous message, in 3.3.4
*ALL* in/out links can be collected into the table "links".
It's trivial to count incoming and outgoing links
for any URL by using a simple SQL query.

Post by b***@mnogosearch.org
For xpath, thanks but no, it's not for XML parsing.
I would need it, for example, to scrap specific content on my pages.

XPath is a query language to address to various parts of an XML document. It assumes a well-formed XML value.
It does not work for an arbitrary HTML file.

Reply: <http://www.mnogosearch.org/board/message.php?id=21601>

b***@mnogosearch.org

2013-12-06 11:38:11 UTC

Permalink

Author: Mamadoo
Email: ***@gmail.com
Message:
Many thanks

I use Xpath everyday to find content on xHTML content and it works pretty well.

Thank you so much for your answers.

Any idea of when the 3.4 could be released ?

Reply: <http://www.mnogosearch.org/board/message.php?id=21602>

b***@mnogosearch.org

2013-12-09 13:07:10 UTC

Permalink

Author: Alexander Barkov

Post by b***@mnogosearch.org
Many thanks
I use Xpath everyday to find content on xHTML content and it works pretty well.

xHTML is a valid XML. So XPath should work.

Post by b***@mnogosearch.org
Thank you so much for your answers.
Any idea of when the 3.4 could be released ?

Around January 2014, if everything goes fine.

Reply: <http://www.mnogosearch.org/board/message.php?id=21605>