Discussion:
[General] Extra hit with SQL query and word position in the original file
Teijo
2017-03-23 21:24:25 UTC
Permalink
Hello,

If I search given word with search.cgi, I get correct number of occurences.

But if I do it with SQL (no matter in mysql or sqlite3), they show extra
occurence. For example, if a given word is in a given original file
twice, they tell that there are three occurences. SQL query is almost
the same one found in Mnogosearch's manual, except that I am using only
one word:

SELECT url.url, count(*) AS RANK FROM dict, url WHERE
url.rec_id=dict.url_id AND dict.word IN ('word') GROUP BY url.url ORDER
BY rank DESC;

I'd like to know (by SQL query) position of word in the original file
(to use filepos function). There is at least coord column in dict table.
Coord contains section id and word's position in relationship to
section, if I have understood correctly. How to extract the relative
position from coord, or is the position information elsewhere in
database? If I disabled all sections, would coord actually contain the
absolute position?

I'm using "single mode" as to database.

Best regards,

Teijo
Alexander Barkov
2017-03-24 01:59:47 UTC
Permalink
Hello Teijo,
Post by Teijo
Hello,
If I search given word with search.cgi, I get correct number of occurences.
But if I do it with SQL (no matter in mysql or sqlite3), they show extra
occurence. For example, if a given word is in a given original file
twice, they tell that there are three occurences. SQL query is almost
the same one found in Mnogosearch's manual, except that I am using only
SELECT url.url, count(*) AS RANK FROM dict, url WHERE
url.rec_id=dict.url_id AND dict.word IN ('word') GROUP BY url.url ORDER
BY rank DESC;
I'd like to know (by SQL query) position of word in the original file
(to use filepos function). There is at least coord column in dict table.
Coord contains section id and word's position in relationship to
section, if I have understood correctly. How to extract the relative
position from coord, or is the position information elsewhere in
database? If I disabled all sections, would coord actually contain the
absolute position?
I'm using "single mode" as to database.
Coord is a 32 bit number.

- The highest 8 bits are section ID (e.g. title, body, etc,
according to Section commands in indexer.conf)

- The lowest 24 bits are position inside this section.

- The last hit inside each combination (url_id,word,secno) is the
section length (i.e. the total number of words in this section on)
in this document.


This MySQL query return the information in a readable form:

SELECT url_id,word,coord>>24 AS secno,coord&0xFFFFFF AS pos FROM dict
WHERE word='mnogosearch' ORDER BY secno,pos;

+--------+-------------+-------+-----+
| url_id | word | secno | pos |
--------+-------------+-------+-----+
| 1 | mnogosearch | 1 | 1 |
| 1 | mnogosearch | 1 | 14 |
| 1 | mnogosearch | 1 | 28 |
| 1 | mnogosearch | 1 | 42 |
| 1 | mnogosearch | 1 | 76 |
| 1 | mnogosearch | 1 | 77 |
| 1 | mnogosearch | 1 | 85 |
| 1 | mnogosearch | 1 | 105 | <- section 1 length
| 1 | mnogosearch | 2 | 1 |
| 1 | mnogosearch | 2 | 6 | <- section 2 length
| 1 | mnogosearch | 3 | 54 |
| 1 | mnogosearch | 3 | 69 | <- section 3 length
| 1 | mnogosearch | 4 | 1 |
| 1 | mnogosearch | 4 | 11 | <- section 4 length
| 1 | mnogosearch | 8 | 2 |
| 1 | mnogosearch | 8 | 4 | <- section 8 length
+--------+-------------+-------+-----+


Lines that are not marked as "section X length" are actual word hits.
Post by Teijo
Best regards,
Teijo
_______________________________________________
General mailing list
http://lists.mnogosearch.org/listinfo/general
Teijo
2017-03-24 13:45:02 UTC
Permalink
Hello,

Thank you very much for this information! I'm about to apply it to one
of my subdomains.

Best regards,

Teijo
Post by Alexander Barkov
Hello Teijo,
Post by Teijo
Hello,
If I search given word with search.cgi, I get correct number of occurences.
But if I do it with SQL (no matter in mysql or sqlite3), they show extra
occurence. For example, if a given word is in a given original file
twice, they tell that there are three occurences. SQL query is almost
the same one found in Mnogosearch's manual, except that I am using only
SELECT url.url, count(*) AS RANK FROM dict, url WHERE
url.rec_id=dict.url_id AND dict.word IN ('word') GROUP BY url.url ORDER
BY rank DESC;
I'd like to know (by SQL query) position of word in the original file
(to use filepos function). There is at least coord column in dict table.
Coord contains section id and word's position in relationship to
section, if I have understood correctly. How to extract the relative
position from coord, or is the position information elsewhere in
database? If I disabled all sections, would coord actually contain the
absolute position?
I'm using "single mode" as to database.
Coord is a 32 bit number.
- The highest 8 bits are section ID (e.g. title, body, etc,
according to Section commands in indexer.conf)
- The lowest 24 bits are position inside this section.
- The last hit inside each combination (url_id,word,secno) is the
section length (i.e. the total number of words in this section on)
in this document.
SELECT url_id,word,coord>>24 AS secno,coord&0xFFFFFF AS pos FROM dict
WHERE word='mnogosearch' ORDER BY secno,pos;
+--------+-------------+-------+-----+
| url_id | word | secno | pos |
--------+-------------+-------+-----+
| 1 | mnogosearch | 1 | 1 |
| 1 | mnogosearch | 1 | 14 |
| 1 | mnogosearch | 1 | 28 |
| 1 | mnogosearch | 1 | 42 |
| 1 | mnogosearch | 1 | 76 |
| 1 | mnogosearch | 1 | 77 |
| 1 | mnogosearch | 1 | 85 |
| 1 | mnogosearch | 1 | 105 | <- section 1 length
| 1 | mnogosearch | 2 | 1 |
| 1 | mnogosearch | 2 | 6 | <- section 2 length
| 1 | mnogosearch | 3 | 54 |
| 1 | mnogosearch | 3 | 69 | <- section 3 length
| 1 | mnogosearch | 4 | 1 |
| 1 | mnogosearch | 4 | 11 | <- section 4 length
| 1 | mnogosearch | 8 | 2 |
| 1 | mnogosearch | 8 | 4 | <- section 8 length
+--------+-------------+-------+-----+
Lines that are not marked as "section X length" are actual word hits.
Post by Teijo
Best regards,
Teijo
_______________________________________________
General mailing list
http://lists.mnogosearch.org/listinfo/general
Loading...