Discussion:
Webboard: Indexing email with multiple file type attachments
g***@mnogosearch.org
2012-01-19 23:42:36 UTC
Permalink
Author: dsbcpas
Email: ***@dsbcpas.com
Message:
We would like to index email saved in a single file (versus mbox for example) which contain multiple file types. Specifically, we save and file emails with a .eml suffix which our pop3 email client (Thunderbird) can read. As I understand it, they are basically, a file type with nested attachments identified with mime headers.



Since mnogosearch appears to select parsers based upon file suffix type rather then mine type, it seems to be necessary to separate each type into separate files before indexing.



Any ideas how one might index each imbedded mime type and also reference the original email URL rather then the various imbedded files?



My preliminary idea: use mpack package. That package includes munpack which unpacks messages by MIME header and outputs to a separate file which is named as imbedded which normally includes the correct mime type suffix as saved. The Text part of message is a bit trickier, it can either ignore or output to files with no suffix. But I still have no idea how to pipe this into mnogosearch.



mpack and munpack is available at http://ftp.andrew.cmu.edu/pub/mpack/



All ideas welcome. The solution might be a good addition to the documentation.





Reply: <http://www.mnogosearch.org/board/message.php?id=21397>
g***@mnogosearch.org
2012-01-23 14:17:03 UTC
Permalink
Author: Alexander Barkov
Post by g***@mnogosearch.org
We would like to index email saved in a single file (versus mbox for example) which contain multiple file types. Specifically, we save and file emails with a .eml suffix which our pop3 email client (Thunderbird) can read. As I understand it, they are basically, a file type with nested attachments identified with mime headers.
Since mnogosearch appears to select parsers based upon file suffix type rather then mine type, it seems to be necessary to separate each type into separate files before indexing.
Any ideas how one might index each imbedded mime type and also reference the original email URL rather then the various imbedded files?
You can try to write an external program

which will break the letters into parts,

execute various converters for the parts

and then collect output data from all the parts

into a single output stream, so indexer will just

think of it as of a single file.
Post by g***@mnogosearch.org
My preliminary idea: use mpack package. That package includes munpack which unpacks messages by MIME header and outputs to a separate file which is named as imbedded which normally includes the correct mime type suffix as saved. The Text part of message is a bit trickier, it can either ignore or output to files with no suffix. But I still have no idea how to pipe this into mnogosearch.
mpack and munpack is available at http://ftp.andrew.cmu.edu/pub/mpack/
All ideas welcome. The solution might be a good addition to the documentation.
Reply: <http://www.mnogosearch.org/board/message.php?id=21399>
g***@mnogosearch.org
2012-01-23 20:18:05 UTC
Permalink
Author: dsbcpas
Email: ***@dsbcpas.com
Message:
Thank you for the reply, that makes sense and does not sound too difficult, though a bit of a duplication of what mnogosearch already accomplishes but based upon file suffix.



Would you recommend anyone may be with a bit of familiarity with mnogosearch willing code that for us?



Reply: <http://www.mnogosearch.org/board/message.php?id=21401>
g***@mnogosearch.org
2012-01-25 12:10:46 UTC
Permalink
Author: Alexander Barkov
Post by g***@mnogosearch.org
Thank you for the reply, that makes sense and does not sound too difficult, though a bit of a duplication of what mnogosearch already accomplishes but based upon file suffix.
Suppose we decide to implement built-in support for multi-part messages in mnoGoSearch.



I'm curious about the following:



1. What should be displayed in excerpts in search results?

a. Excerpt only from the "regular" message body part.



b. Excerpts from all message parts indexer was able to parse,

which contain the searched words,

with every excerpt abiding the ExcerptSize setting

individually.



So for example, if a message has 10 attachments,

there will be 11 excerpts (10 attachments + body).



c. Or collect a single excerpt by looping through the body

and all attachments, until the cumulative size of the

excerpt does not rich ExcerptSize.



2. How should "cached copy" look like?



Probably it should include all message parts,

with some separator (e.g. <hr>) between the parts.



<skip>



Reply: <http://www.mnogosearch.org/board/message.php?id=21403>
g***@mnogosearch.org
2012-01-25 17:06:26 UTC
Permalink
Author: dsbcpas
Email: ***@dsbcpas.com
Message:
By item:



1 c - Single excerpt; Excellent solution if possible. Since implementation for indexing is bundling attachments into one file, I guess it would require some type of header coding in between each attachment. Comment on option 1 a - Since there is always the ability of a user to separately save an attachment if a separate excerpt is required, I think it's equally appropriate to assume that the Regular message body part would be the primary excerpt.



2. We don't cache copies.



I would like to contribute monetarily to the effort, what might the cost be to implement?

Reply: <http://www.mnogosearch.org/board/message.php?id=21405>
g***@mnogosearch.org
2012-01-26 09:11:14 UTC
Permalink
Author: Alexander Barkov
Post by g***@mnogosearch.org
1 c - Single excerpt; Excellent solution if possible. Since implementation for indexing is bundling attachments into one file, I guess it would require some type of header coding in between each attachment. Comment on option 1 a - Since there is always the ability of a user to separately save an attachment if a separate excerpt is required, I think it's equally appropriate to assume that the Regular message body part would be the primary excerpt.
2. We don't cache copies.
I would like to contribute monetarily to the effort, what might the cost be to implement?
We can do it under terms of out "extended monthly support".

Please contact me to ***@mnogosearch.org.



Reply: <http://www.mnogosearch.org/board/message.php?id=21407>
g***@mnogosearch.org
2012-01-27 14:02:15 UTC
Permalink
Author: dsbcpas
Email: ***@dsbcpas.com
Message:
Your email server replied - Deferred: Connection refused by mail.mnogo.ru.

Reply: <http://www.mnogosearch.org/board/message.php?id=21411>
g***@mnogosearch.org
2012-01-27 14:10:35 UTC
Permalink
Author: Alexander Barkov
Post by g***@mnogosearch.org
Your email server replied - Deferred: Connection refused by mail.mnogo.ru.
Sorry. The mail system was down most likely for some reasons.

It's working now. Please try again.

Reply: <http://www.mnogosearch.org/board/message.php?id=21413>

Loading...