Textual analysis of spam emails from spam archives - spam

I trying to implement an anti-spam engine using probabilistic approach. The very first step is to analyse and do some research on types of words and their frequency in spams. So I wrote a very simple program in Java to filter out words from spam. I break the entire text file into lines and lines into words by using "split("\W")" (\W for space).
I downloaded spam archives and thought I will easily analyse or scan these txt files using this application. But soon, I got caught into a major problem! The text files contain HTML tags, links, headers of email, blah-blah...
Now I am wondering about how to tackle this? Shall I use an html parser or strengthen my logic of analysing these files?
The answer mainly depends on whether I will be faced with the same problem in the implementation phase? What do current spam filters do?

The mail envelope is standard although invisible part of emails. Without these headers, the message will not reach you. There is no need to write parsing logic by yourself when standard libraries do the job.
from email import message_from_string
mailfd = open("mfile_path_to_message").read()
message = message_from_string(mailfd)
print message.get("from")
If the your messages are in unix mbox format, Mailbox library will be helpful. For parsing rich text like HTML, BeautifulSoup is among better options.

Related

Python Gmail Attachment

After reading this question I ran the script:
How can I download all emails with attachments from Gmail?.
For some reason, not all my labels or messages with attachments are downloading. I think it's either there is a problem with identifying the labels or that I have seven gmail accounts that I sent out. For example my main one is example1#gmail.com but I can send mail as example_2#gmail.com from example1#gmail.com
I did find gmailbackup (another python module) and have it running currently. The problem is that it doesn't download attachments that I've been able to find in the documentation and in the program. I am trying to jerry rig these two source codes together, but I am having problems in reading and understanding the script of gmailbackup.
So the problem was with the m.select() function. The m.list() function does not make a list for you to query for each label.
So I added these lines:
m.list()
inboxes = list(m.list())
for item in inboxes[1]:
mailbox = item.split('"')[3]
m.select(mailbox) # here you a can choose a mail box like INBOX instead
There are probably better and more efficient ways of doing this, however I was in a hurry. Using re and matches module would probably be great to build in.
Edit:
There is a second way of applying a filter that adds the label attached to anything attached and downloading from that label.
Edit 2
Gmail has a limit to downloads per day, so if you are dealing with large emails you might want to track what you have already downloaded.

Website Input/Output using a premade .exe (Need direction)

I am trying to build a part of a website which takes in a text passage as input, and outputs the same text passage, except with the definition of each word appearing when the user rolls over (or clicks) any given word. I have a pre-made .exe file which maps input words to their definitions (takes in words from standard input and outputs the definition to standard output).
My problem, then, is to run user input through the .exe file on the website's server, then put the output back onto the page. It seems like a fairly trivial problem, but I have no idea where to start.
So my questions are: is this project even possible? If so, what languages/tools do I need to be able to use in order to implement it? Are there keywords that describe what I'm talking about that I could use to look up tutorials/solutions on the Web?
I have rudimentary knowledge of PHP, HTML, and Javascript, but so little experience that I can't judge whether (and how) they can be used to approach this problem.
Note: I do not have access to the .exe source, so I must use the .exe itself as my input-output mechanism.
With AJAX and PHP, you can do accomplish this with minimal effort.
JavaScript's AJAX features would send the word you input to the PHP page, and from the PHP page, you can run the external exe file with the sent word as an argument (sanitize it, please. People can inject code which will explode your servers!):
<?php
$word = $_POST['input_word']; // MAKE SURE YOU SANITIZE THIS. If you don't, system security goes down the toilet.
exec('myprogram.exe ' + $word, $output_array);
print_r($output_array);
?>

How do I create a plain-text version of an HTML email?

We are doing EDMs and we're doing it manually. However, this time, we only have 1 big image and some text at the bottom for the EDM (the image is like a christmas card). It goes directly to spam but there is no reason for us to add more text at the bottom. One way I've read is to add a "text-only" version.
But how?
Are there other ways to lower the spam score?
You have not said how you are creating your email, that will have a bearing on any answers you might get re inserting a text only email. You could load your image to a website and have a text only email with a URL link to it. How you do that depends on your authoring tool.
WRT lowering your spam score, have a look at http://www.mailingcheck.com this is a free service to let you test the spam scores of your email

Automating book citation search

I have a list of books listed by their titles in a text file. I want to write a script which can use a web service like Google scholar or amazon to search for the books and return me a xml or bibtex file with citation info for each book.
Which programming tools can I use for this kind of automated search ?
Python would be my recommendation.
Get names from the text file, simple file reading
Construct a REST URL request to google's book API
http://books.google.com/books/feeds/volumes?q=Elizabeth+Bennet&start-index=21&max-results=10
Simple python code to get data from this URL (may need an API key, would advise using urllib2 with error handling rather than urllib)
Sample code,
import urllib
url = 'http://foo.api.request'
data = urllib.urlopen(url).read()
See the return schemas for this API (you can use the XML however you like).
See BibTeXML for conversion between the two formats.
HTH
I think it could be useful if you specify what kind of script you want to write!
Anyway... you could do some low level work and write your own HttpRequest for google and amazon or you could just rely on their API for example: http://code.google.com/apis/books/
There is a great project which does something similar what you want to do, it's called Shelves. It's written for Android but should give you some ideas how to handle your requests. Instead of downloading some citations it's downloading the cover.
http://code.google.com/p/shelves/
Just as a quick side note, saving your books in a xml file could be an option as well. In some cases it makes parsing them easier.

Any way to get SharePoint to STOP converting attachment file names to all lower case on inbound emails?

I have a document library that is successfully receiving and saving attachments from inbound emails. The problem is that it is taking my nice, mixed-case file names and converting them to all lower case. I have looked on the Internet for a solution and only found a bunch of "me too" people with the same issue. Is there any way to control this behavior? Since the file name becomes the link text it would be nice if SharePoint left it alone.
It's a known bug according to:
http://social.technet.microsoft.com/Forums/en-US/sharepointgeneral/thread/0daedbf4-ddd1-4dd9-b7f0-70c03dc4afc4/
(links to the update packages are there, too)

Resources