How convert an mbox to a JSON structure? - python-3.x

I am trying to convert an mbox to a JSON structure suitable for import into MongoDB i.e.
I am using mining social web second edition mailbox chapter but its not working properly.
I am trying to convert an mbox to a JSON structure suitable for import into MongoDB i.e.
I am using mining social web second edition mailbox chapter but its not working properly.
import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse
MBOX = 'resources/ch06-mailboxes/data/enron.mbox'
OUT_FILE = MBOX + '.json'
def cleanContent(msg):
# Decode message from "quoted printable" format, but first
# re-encode, since decodestring will try to do a decode of its own
msg = quopri.decodestring(msg.encode('utf-8'))
# Strip out HTML tags, if any are present.
# Bail on unknown encodings if errors happen in BeautifulSoup.
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))
# There's a lot of data to process, and the Pythonic way to do it is with a
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object
# serialization.
class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
# The To, Cc, and Bcc fields, if present, could have multiple items.
# Note that not all of these fields are necessarily defined.
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part = {}
if part.get_content_maintype() != 'text':
print >> sys.stderr, "Skipping MIME content in JSONification
({0})".format(part.get_content_maintype())
continue
json_part['contentType'] = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)
# Finally, convert date from asctime to milliseconds since epoch using the
# $date descriptor so it imports "natively" as an ISODate object in MongoDB
then = parse(json_msg['Date'])
millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
json_msg['Date'] = {'$date' : millis}
return json_msg
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport
f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg != None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()
print "All done"
getting error:
80 # for easy import into MongoDB via mongoimport
81
---> 82 f = open(OUT_FILE, 'w')
83 for msg in gen_json_msgs(mbox):
84 if msg != None:
IOError: [Errno 13] Permission denied: 'resources/ch06-mailboxes/data/enron.mbox.json'

The code you mentioned became obsolete in Third Edition of Mining Social Web
I tried making a workable script that not just converts MBOX to JSON, but even extracts the Attachments to usable formats.
Link to the repo -
https://github.com/PS1607/mbox-to-json
Read the README file for usage instructions.

It seems that your problem is related to user permissions instead of Python. Line 82 tries to open a file in the "data" folder, but permission was denied. You should try executing your script using the sudo command from a terminal:
sudo python3 <your script name>
This should take care of the error you pointed out.
PS: Python 3 uses print as a function; line 88 should read
print('All done')

Related

how to use Google Cloud Translate API for translating bulk data?

I have a csv file of several thousands of rows in multiple languages and I am thinking of using google cloud translate API to translate foreign language text into English. I have used a simple code to find out if everything works properly and the code is running smoothly.
from google.cloud import translate_v2 as translate
from time import sleep
from tqdm.notebook import tqdm
import multiprocessing as mp
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "file path.py"
translate_client = translate.Client()
text = "Good Morning, My Name is X."
target ="ja"
output = translate_client.translate(text, target_language=target)
print(output)
I want to now import csv file (using pandas) and translate the text and save the output as a csv file. But don't know how should I do that. Most of the examples I found stop at translating sample text just like above.
Can anyone suggest how can I do this?
To translate the text in csv file and save the output in same CSV file using Google Cloud Translation API, you can use below code:
import csv
from pathlib import Path
def translate_text(target, text):
"""Translates text into the target language.
Target must be an ISO 639-1 language code.
See https://g.co/cloud/translate/v2/translate-reference#supported_languages
"""
import six
from google.cloud import translate_v2 as translate
translate_client = translate.Client()
if isinstance(text, six.binary_type):
text = text.decode("utf-8")
# Text can also be a sequence of strings, in which case this method
# will return a sequence of results for each text.
result = translate_client.translate(text, target_language=target)
# print(u"Text: {}".format(result["input"]))
# print(u"Translation: {}".format(result["translatedText"]))
# print(u"Detected source language: {}".format(result["detectedSourceLanguage"]))
return result["translatedText"]
def main(input_file, translate_to):
"""
Translate a text file and save as a CSV file
using Google Cloud Translation API
"""
input_file_path = Path(input_file)
target_lang = translate_to
output_file_path = input_file_path.with_suffix('.csv')
with open(input_file_path) as f:
list_lines = f.readlines()
total_lines = len(list_lines)
with open(output_file_path, 'w') as csvfile:
my_writer = csv.writer(csvfile, delimiter=',', quotechar='"')
my_writer.writerow(['id', 'original_text', 'translated_text'])
for i, each_line in enumerate(list_lines):
line_id = f'{i + 1:04}'
original_text = each_line.strip('\n') # Strip for the writer(*).
translated_text = translate_text(
target=target_lang,
text=each_line)
my_writer.writerow([line_id, original_text, translated_text]) # (*)
# Progress monitor, non-essential.
print(f"""
{line_id}/{total_lines:04}
{original_text}
{translated_text}""")
if __name__ == '__main__':
origin_file = input('Input text file? >> ')
output_lang = input('Output language? >> ')
main(input_file=origin_file,
translate_to=output_lang)
Example:
Translated text in input file to target language “es”, the output got stored in the same csv file.
Input:
new.csv
How are you doing,Is everything fine there
Do it today
Output:
new.csv
id,original_text,translated_text
0001,"How are you doing,Is everything fine there",¿Cómo estás? ¿Está todo bien allí?
0002,Do it today,Hazlo hoy

How to parse eml file and extract meta-data informations

I have an eml file with some attachments. I want to read text content in eml file and I want to extract meta-data information like(sender, from, cc, bcc, subject). Also I want to download the attachments as well. With the help of the below code I am only able to extract information/ text content in the body of the email.
import email
from email import policy
from email.parser import BytesParser
import glob
file_list = glob.glob('*.eml') # returns list of files
with open(file_list[2], 'rb') as fp: # select a specific email file from the list
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
print(text)
There was module name emaildata which was available for Python 2 did the job.
Extracting MetaData Informations
import email
from emaildata.metadata import MetaData
message = email.message_from_file(open('message.eml'))
extractor = MetaData(message)
data = extractor.to_dict()
print data.keys()
Extracting Attachment Information
import email
from emaildata.attachment import Attachment
message = email.message_from_file(open('message.eml'))
for content, filename, mimetype, message in Attachment.extract(message):
print filename
with open(filename, 'w') as stream:
stream.write(content)
# If message is not None then it is an instance of email.message.Message
if message:
print "The file {0} is a message with attachments.".format(filename)
But this library is now deprecated and is of now use. Is there any other library that could extract the meta-data and attachment related information?
Meta-data information could be accessed using below code in Python 3.x
from email import policy
from email.parser import BytesParser
with open(eml_file, 'rb') as fp:
msg = BytesParser(policy=policy.default).parse(fp)
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])
Remaining header informations could be accessed using msg.keys()
For downloading attachments from an eml file you can use the below code:
import sys
import os
import os.path
from collections import defaultdict
from email.parser import Parser
eml_mail = 'your eml file'
output_dir = 'mention the directory where you want the files to be download'
def parse_message(filename):
with open(filename) as f:
return Parser().parse(f)
def find_attachments(message):
"""
Return a tuple of parsed content-disposition dict, message object
for each attachment found.
"""
found = []
for part in message.walk():
if 'content-disposition' not in part:
continue
cdisp = part['content-disposition'].split(';')
cdisp = [x.strip() for x in cdisp]
if cdisp[0].lower() != 'attachment':
continue
parsed = {}
for kv in cdisp[1:]:
key, val = kv.split('=')
if val.startswith('"'):
val = val.strip('"')
elif val.startswith("'"):
val = val.strip("'")
parsed[key] = val
found.append((parsed, part))
return found
def run(eml_filename, output_dir):
msg = parse_message(eml_filename)
attachments = find_attachments(msg)
print ("Found {0} attachments...".format(len(attachments)))
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
for cdisp, part in attachments:
cdisp_filename = os.path.normpath(cdisp['filename'])
# prevent malicious crap
if os.path.isabs(cdisp_filename):
cdisp_filename = os.path.basename(cdisp_filename)
towrite = os.path.join(output_dir, cdisp_filename)
print( "Writing " + towrite)
with open(towrite, 'wb') as fp:
data = part.get_payload(decode=True)
fp.write(data)
run(eml_mail, output_dir)
Have a look at: ParsEML it bulk extracts attachments from all eml files in a directory (originally from Stephan Hügel). And i used a modified version of MeIOC to easily extract all metadata in json format; if you want i can share that to.

Unicode characters to Turkish characters

(Edit: my original question is posted here, but the issue has been resolved and the code below is correct). I am looking for advice on how to convert Unicode characters to Turkish characters. The following code (posted online) scrapes tweets for an individual user and outputs a csv file, but the Turkish characters come out as in Unicode characters, i.e. \xc4. I am using Python 3 on a mac.
import sys
default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
import tweepy #https://github.com/tweepy/tweepy
import csv
import string
import print
#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text)] for tweet in alltweets]
write the csv
with open('%s_tweets.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
pass in the username of the account you want to download
get_all_tweets("")
The csv module docs recommend you specify the encoding when you open the file. (and also that you use newline='' so the CSV module can do its own handling for newlines). Don't encode Unicode strings when writing rows.
import csv
with open('test.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['id','created_at','text'])
writer.writerows([[123, 456, 'Äβç']])

How to display proper output when using re.findall() in python?

I made a python script to get the latest stock price from the Yahoo Finance.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext);
print(price);
It works smoothly but when I output the price it comes out like this[b'1,217.04']
This might be a petty issue to ask but I'm new to python scripting so please help me if you can.
I want to get rid of 'b'. If I remove 'b' from b'<span id="yfs_l84_goog">"then it shows this error.
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I want the output to be just
1,217.04
b'' is a syntax for bytes literals in Python. It is how you could define byte sequences in Python source code.
What you see in the output is the representation of the single bytes object inside the price list returned by re.findall(). You could decode it into a string and print it:
>>> for item in price:
... print(item.decode()) # assume utf-8
...
1,217.04
You could also write bytes directly to stdout e.g., sys.stdout.buffer.write(price[0]).
You could use an html parser instead of a regex to parse html:
#!/usr/bin/env python3
import cgi
from html.parser import HTMLParser
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q?s=GOOG'
def is_price_tag(tag, attrs):
return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog'
class Parser(HTMLParser):
"""Extract tag's text content from html."""
def __init__(self, html, starttag_callback):
HTMLParser.__init__(self)
self.contents = []
self.intag = None
self.starttag_callback = starttag_callback
self.feed(html)
def handle_starttag(self, tag, attrs):
self.intag = self.starttag_callback(tag, attrs)
def handle_endtag(self, tag):
self.intag = False
def handle_data(self, data):
if self.intag:
self.contents.append(data)
# download and convert to Unicode
response = urlopen(url)
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html = response.read().decode(params['charset'])
# parse html (extract text from the price tag)
content = Parser(html, is_price_tag).contents[0]
print(content)
Check whether yahoo provides API that doesn't require web scraping.
Okay after searching for a while. I found the solution.Works fine for me.
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>');
price = pattern.findall(str(htmltext));
print(price);

Telnet.read.until function doesn't work

Trying to telnet to a Brocade router and the python script is sending out error.... Not sure what is wrong here. Have tried a debug but cant make it working. I believe it's prompt issue. I appreciate if anyone has suggestion how to get it work.
Note: This is Python 3.0
import getpass
import sys
import telnetlib
HOST = "1.1.1.1"
user = "admin"
password = "password"
port = "23"
telnet = telnetlib.Telnet(HOST)
telnet.read_until("sw0 login:,3")
telnet.write(admin + "\r")
if password:
telnet.read_until("Password: ")
telnet.write(password + "\n")
tn.write("term len 0" + "\n")
telnet.write("sh ver br\n")
telnet.write("exit\n")
ERROR:
Traceback (most recent call last):
File "C:\Users\milan\Desktop\telnetNew.py", line 13, in <module>
telnet.read_until("Username :,3")
File "C:\Python33\lib\telnetlib.py", line 299, in read_until
return self._read_until_with_select(match, timeout)
File "C:\Python33\lib\telnetlib.py", line 352, in _read_until_with_select
i = self.cookedq.find(match)
TypeError: Type str doesn't support the buffer API
This is my prompt after logging manually using telnet port 23 and this is what i expected command to work.
_________________________________________________________________________________
Network OS (sw0)
xxxxxxxxxx
sw0 login: xxxxx
Password: xxxxx
WARNING: The default password of 'admin' and 'user' accounts have not been changed.
Welcome to the Brocade Network Operating System Software
admin connected from 10.100.131.18 using console on sw0
sw0# sh ver
sw0#
In looking at the docs, it appears that telnetlib want a bytestr, not a Str. so try this., which should convert everything to bytes as opposed to Str
import sys
import telnetlib
HOST = "1.1.1.1"
user = "admin"
password = "password"
port = "23"
telnet = telnetlib.Telnet(HOST,port)
telnet.read_until(b"login: ")
telnet.write(admin.encode('ascii') + b"\n")
telnet.read_until(b"Password: ")
telnet.write(password.encode('ascii') + b"\n")
tn.write(b"term len 0\n")
telnet.write(b"sh ver br\n")
telnet.write(b"exit\n")
--edit-- I installed python and tried this against one of my routers. Changing the username/password to my credentials I was able to login fine. ( I removed the password check and the getpass as they where not being used, being that your code has them hardcoded). It looks like you copied the 2.x example, but the 3.x requires the buffer compatible ones without the b" i Get this
Traceback (most recent call last):
File "foo.py", line 5, in <module>
telnet.read_until("login: ")
File "/usr/local/Cellar/python32/3.2.5/Frameworks/Python.framework/Versions/3.2/lib/python3.2/telnetlib.py", line 293, in read_until
return self._read_until_with_poll(match, timeout)
File "/usr/local/Cellar/python32/3.2.5/Frameworks/Python.framework/Versions/3.2/lib/python3.2/telnetlib.py", line 308, in _read_until_with_poll
i = self.cookedq.find(match)
TypeError: expected an object with the buffer interface
with the b" I get
[~] /usr/local/bin/python3.2 foo.py
b'\r\n\r\nThis computer system including all related equipment, network devices\r\n(specifically including Internet access), are provided only for\r\nauthorized use. All computer
which shows it is working.. what errors are you getting now
import telnetlib , socket
class TELNET(object):
def __init__(self):
self.tn = None
self.username = "root"
self.password = "12345678"
self.host = "10.1.1.1"
self.port = 23
self.timeout = 5
self.login_prompt = b"login: "
self.password_prompt = b"Password: "
def connect(self):
try :
self.tn = telnetlib.Telnet(self.host,self.port,self.timeout)
except socket.timeout :
print("TELNET.connect() socket.timeout")
self.tn.read_until(self.login_prompt, self.timeout)
self.tn.write(self.username.encode('ascii') + b"\n")
if self.password :
self.tn.read_until(self.password_prompt,self.timeout)
self.tn.write(self.password.encode('ascii') + b"\n")
def write(self,msg):
self.tn.write(msg.encode('ascii') + b"\n")
return True
def read_until(self,value):
return self.tn.read_until(value)
def read_all(self):
try :
return self.tn.read_all().decode('ascii')
except socket.timeout :
print("read_all socket.timeout")
return False
def close(self):
self.tn.close()
return True
def request(self,msg):
self.__init__()
self.connect()
if self.write(msg) == True :
self.close()
resp = self.read_all()
return resp
else :
return False
telnet = TELNET()
#call request function
resp = telnet.request('ps www') # it will be return ps output
print(resp)
That code works on python3
Most of these answers are great in-depth explanations and I found them a little hard to follow. The short answer is that the Sample code is wrong on the Python 3x site.
to fix this, instead of
' telnet.read_until("Username :,3")'
use
' telnet.read_until(b"Username :,3")'
Like the other answers say, the problem is that Telnetlib expects byte strings. In Python 2 the str type is a byte string (a string of binary data) whereas it's a unicode string in Python 3 and a new bytes type that represents binary data in the same was as str did in Python 2.
The upshot of this is that you need to convert your str data into bytes data when using Telnetlib in Python 3. I have some code that works in both Python2 and Python3 that I thought worth sharing.
class Telnet(telnetlib.Telnet,object):
if sys.version > '3':
def read_until(self,expected,timeout=None):
expected = bytes(expected, encoding='utf-8')
received = super(Telnet,self).read_until(expected,timeout)
return str(received, encoding='utf-8')
def write(self,buffer):
buffer = bytes(buffer, encoding='utf-8')
super(Telnet,self).write(buffer)
def expect(self,list,timeout=None):
for index,item in enumerate(list):
list[index] = bytes(item, encoding='utf-8')
match_index, match_object, match_text = super(Telnet,self).expect(list,timeout)
return match_index, match_object, str(match_text, encoding='utf-8')
You then instantiate the new Telnet class instead of telnetlib.Telnet. The new class overrides methods in Telnetlib to perform conversion from str to bytes and bytes to str so that the problem just goes away.
Of course, I've only overriden the methods that I was using, but it should be clear from the example how to extend the idea to other methods. Basically, you need to use
bytes(str_string, encoding='utf-8')
and
str(bytes_string, encoding='utf-8')
My code is valid Python 2 and 3 so will work unmodified with either interpreter. When in Python2 the code that overrides the methods is skipped over, effectively ignored. If you don't care for backwards compatibility you can remove the condition that checks for the Python version.

Resources