Spark Email Processing - apache-spark

Spark Email Processing - apache-spark

We are developing a big data solution in which one requirement is to process incoming emails. The technology stack is not finalized yet but mostly we might go with Sendmail as MTA and Procmail as MDA. We are open to any other very efficient solution.
These emails are essentially carry data in attachments and are not meant for end user, so the email flow ends with Spark processing.
My first thought was it would be great if there was a message queuing system such as Apache-Kafka which could accept emails as messages and then provide them to the client such as Spark on demand but it seems that sort of technology/approach is not available in any of the message brokering systems.
This means we would have to receive emails via SMTP MTA and then extract the information from the MDA.
We could use Procmail to extract the contents of the email and the attachments and put them in a folder per email and then scan the folders and process them in spark.
Alternatively if Spark has any plugins which could pull in emails from an MDA and break it down into it's attachments it would make life much simpler.
If there is any other smarter solution it would be welcome.
So the fundamental question is what technology is available for channelizing emails through Spark for processing. Connectors etc.

Mailgun or Sendgrid incoming email processing is so easy that I could hardly imagine any alternative for a new, especially big, system. I only played with them, but my impression was that my any actual or potential (billions of emails) problem related to emails is solved for good. Not related to Spark, those system just post email content as http POST request to a URL you provide.
Sendgrid used to incorrectly parse encoding, their support ignored my emails and eventually deleted a ticket without solving the problem. Mailgun always returns UTF8 regardless of original encoding. Manual MIME parsing is such a grandiose task itself so it is better to use existing solutions, unless emails are generated by a computer. But even then, IaaS services are so much cheaper than developer time.

Related

trigger a .sh script when a specific subject email is received

Anyway, I have a script that I want to run whenever I receive an email on gmail. And if possible a subject specific email. is such a thing possible and if so, what programs do I need to allow it.

You can't instruct gmail to trigger an external script for you. I think you've got a few basic choices. In order of increasing difficulty and complexity:
1) Configure a gmail filter to deliver your desired messages to a special folder. Write a script to poll that folder, download (or delete or mark as read) messages it finds there, and then launch your local script. Set up a cron on your local machine to run the script every few minutes. You can poll the folder with IMAP or the GMAIL API. IMAP is probably easier. This will be tricky with shell, you're better of with Python, PHP, or similar.
2) Configure a gmail filter to forward your desired messages to an address on a mail server that you control. Use procmail or similar to intercept the incoming messages and launch your script.
3) Set up an account at Mailgun and configure the emails so they get delivered there directly. (Or forward from gmail as in #2.) Configure Mailgun to launch an API request when it receives messages. Build an API handler to receive the request. Launch your process from your API handler.

I have never done it, but I guess the first thing you should do is to take a look at the Google's Gmail API...
What is the Gmail API?
The Gmail API gives you flexible, RESTful access to the user's inbox,
with a natural interface to Threads, Messages, Labels, Drafts, and
History.
It seems to fit what you want - at least, without knowing the details of what you want to do.
The Gmail API can be used in a variety of different applications,
including, typically:
Read-only mail extraction, indexing, and backup
Label management
(add/remove labels)
Automated or programmatic message sending
You can use several programming languages - maybe the trick is using your programming language of choice to write a wrapper for the .sh script... I hope this helps!

Accepting image files via email from any address.

I am trying to build a service where anybody can send an image file from an email address/client and process it. Think about the service a bit like Flickr showing the image in a dashboard that comes via emails
From a usability standpoint this mechanic offers great deal of advantage but I want to understand the security consequences of such an action.Some concerns are:
I need to validate all these files as images
People can probably send a file with an exploit/code that can likely
be a problem. But in my case I am mostly going to do a file open and
save and let the browser show the image
Am I taking the right approach here? Are there serious consequences that I should be of?

Things you should do and take into consideration.
Make sure your mail server is configured for virus scanning, keep it up to date. That'll be the first line of defense.
When the email comes in, attempt to process the image in a known rock solid library.
Be aware that many emails contain multiple images, some of which may have nothing at all to do with the one they are sending. For example, our company emails all include our logo at the bottom. I'm not exactly sure what the solution is here, but you'll want to take it into consideration.
Different email clients handle image attachments, well, differently. Sometimes it's as a normal attachment, sometimes it's embedded in the body. Even within the same client an image might be handled differently depending on if they sent the email as plaint text with attachments or HTML mail.
People will test your system. They'll send .js files, they'll send images whose headers are jacked in order to overflow your image processing library...
Consider enforcing certain email restrictions such as SPF checks.
Be prepared to receive images that are absolutely huge. Today's cameras take very large photos and a lot of people don't know what crop or resize means. You might consider setting a cap of 15MB or larger per email coming into your server. Then, in combination with #2 above, auto resizing images down to something a bit more acceptable.
Determine the mechanism you actually want to use to notify the user of any issues. Bear in mind that this mechanism is subject to abuse. For example, consider a spam message sent to your machine with reply-to headers going to a victim.
If you are using .net, see this for a possible way to confirm a file is an image: How can I determine if a file is an image file in .NET?

I'm not saying this is 100% secure (can you ever be 100% secure?) but here is something that you can try:
Lets say that you have an alias on your postfix (or whatever mail system) that redirects incoming emails to a php/bash/python script for further processing.
The first thing I would do is use an image manipulation library (say imagemagick) and convert all incoming files to a .png format or whatever, and only proceed further with your logic if the conversion is successful.
This way, if someone sends you any malicious attachments (php exploit, jar's, swf's, anything) the conversion will fail, and hence it will be disregarded by your system.
Edit: ImageMagick has the "identify" command which does exactly what you want.

Emails could be easily spoofed as well, which means I can send an email from an email address which doesn't belong to me.
This might help also: Secure way to upload image in PHP ...

automation of tasks - email using web application

I have a web application that monitors farms in certain areas. Right now I am having a problem of performing automation with some of the tasks.
Users of the web application can send reports or checkins using keywords. If the reports or checkins correspond to certain keywords, for example "alert", I need the web application to send an alert to the user via email using that web application. But that alert must be sent two weeks after the date of the report received, and to that particular user only.
Would it be possible to use cron to perform this? If not, can anyone suggest me a workaround?

A possible approach you might consider is to store an entry in a database for each of these reminder emails you need to send, at the time your user does whatever action in your application that determines the need to send that email exists. Include the recipient, the date to be sent, and the email content as content you store for each entry. Schedule a single cron job to run periodically to process these database records by due date, and populate an email template to be sent out. You can then either delete the database records, or a better option, include a column that indicates they were sent and mark them as sent.
It would help to provide which technology stack you're operating on and what the application is developed in. Others might be able to point you to technology specific approaches or pre-built plugins/extensions that already do this for the situation you're in, to help you avoid the need to write your own code for the solution.

Script to check whether all mails replied in Lotus Notes

We use Lotusnotes 6.5 as email client. We wil have around 1600+ mails for 9 hrs. If a mail not checked , we have face serious issues with our client. Can any script can be written to check whether all mails are checked and replied?
Update:
We have already tried moving the mails to another folder.But has this mailbox handled by team of persons, we noticed lot of human error happening like moving a unread mail, sometimes they would have read mail but forget to reply it etc.etc.
So I was looking out for a script solution, will your other options. Also one more thing we do is we cc our mailbox mail id for all outgoing emails to have a track of all replied mails, will this could help in any way to find out which mails was missed?

If you need to track unread marks, I second the aforementioned nsftools solution, which works in Domino 7.x too. However, this is very much Notes ID-dependent. A folder would be better.
Note that 6.5x is well out of support, and that Domino 7.x officially died this week: use something at least vaguely modern!

There's an easier non-programmatic way. Just move the email from the inbox into another folder once the email has been responded to. That is more reliable than any programmatic solution, and keeps your inbox tidy (which will certainly be necessary if you get nearly 200 emails per hour!)
That said, here are some other ideas.
Determining if the document was read
Unread marks are not your friend here, unless you'll be accessing the mail file from the same client. Also they tend to get out of sync and would likely prove unreliable at some point, especially given the number of incoming emails. Instead you'll need to have some information that is saved within the individual mail document, such as the last accessed property or a custom item you manage via scripts/formulas.
You can see if an email has been read by checking the Last Accessed property of the mail document. According to IBM's technote (https://www-304.ibm.com/support/docview.wss?uid=swg21086670), the property will be updated when the document is read.
You could write a script in the QueryOpen event that stamps a value on the document and saves it.
Determining if the email was responded to
First off, I'd suggest you save all sent emails in case you need a record of what was sent to the client. That won't give you a way to see which emails have not been responded to, however.
Instead you could add script to the reply action within the memo form. When someone click's reply it could update the current memo, stamping an item on it to say who replied and at what time, for instance. Then you can create a view to show any emails that don't have that item, and another view to show emails that do grouped by who responded. The second view could even show how many emails each person responded to, something that might be used as a measurement of performance perhaps.

"Unread mark" checking is not exposed in the API.
I did find 2 links, this one is a basic implementation, where as this link does have more robust code and is implemented as an object in LotusScript. It should be compatible with Notes 6.5+.
I found the second link through nsftools website which has lots of great snippets that solve various problems. You should at least be able to detect if a mail has been read or not. Note that it requires making API level calls. You should be able to create a new script library and copy/paste the code into it.

Best Way To Receive Email Website

I am developing a website -- in the prototype stage, soon to be alpha. I will provide an email address to each account that allows the user to deposit stuff -- not a real email account, just an endpoint for sending things to the site. Many sites provide this kind of service nowadays. I think the first one I saw was Photobucket, which let's you send photos as email attachments.
My question is, what is the best way to implement this kind of service?
In my prototype, I have written a POP3 client which fetches all newly delivered mail (currently from a test Gmail account). My service processes each new mail and attachments, and immediately removes it from the email server.
I could certainly outsource to an email service with POP3 and be done with it. The problem is cost. Most services I have seen provide much more than I need, and they charge per account. I expect to have many accounts and low traffic volume.
So I'm leaning towards hosting email receipt myself. I am open to Windows or Linux. The code that processes incoming emails runs on Windows, but I have other services running on Linux. I have seen a number of open source and free email servers, such as hMailServer and MailEnable (Windows) and qmail, Postfix and exim (Linux).
I guess I have a slight preference towards Linux because of lower hosting costs, but if a Windows service can provide cleaner integration, that might be worth it. As far as features, I would like to have some spam filtering, but it's is not a huge priority. POP3 is adequate for retrieval, but a more direct API would be nice. I will need some kind of API for programmatically provisioning new accounts.
All suggestions are appreciated. Do you know how others implement this kind of service?
UPDATE: I ended up using hMailServer, which is a free mail server that runs on Windows. It seems to be quite mature and robust. It has a COM interop library which makes accessing emails, accounts, etc. from my .NET server app very easy indeed.

If you're going the host-your-own-email-server route, I would probably just use POSTFIX and pipe all your email to a PHP script, which processes the email.
Here's a quick'n dirty tutorial on setting up the email pipe if you're using cPanel:
http://kb.siteground.com/article/How_to_pipe_an_email_to_a_PHP_script.html
If not, here's how to do it:
http://answers.google.com/answers/threadview?id=562518

The bottom line is, you need to have an open SMTP connection to accept email. If you have your own server, then you can install a SMTP server on the machine. Usually, you have filesystem access to the location the email files are placed. Be sure to select a SMTP server that allows this, and that the email are in a format that you can parse.
Then, you can just monitor the file location for incoming emails.
If you can't pipe your emails (using the Postfix suggestion), and you don't have your own server (for example, on a shared hosting plan), then you will need to query a POP3 or IMAP mailbox server for your emails, and parse them accordingly.

I wanted to get emails in real time so I worked out my own solution with google app engine. I basically made a small dedicated google app engine app to receive and POST emails to my main site. That way I could avoid having to set up an email server.
You can check out Emailization (a little weekend project I did to do it for you), or you this small GAE app that should do the trick.
I kinda explained it more on another question.
Hope that helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string