Unable to Parse Email in Python - python-3.x

I have set of .msg files stored in E:/ drive that I have to read and extract some information from it.
For that i am using the below code in Python 3.6.
from email.parser import Parser
with open("E:\Downloads\Test1.msg",encoding="ISO-8859-1") as fp:
headers = Parser().parse(fp)
print('To: %s' % headers['To'])
print('From: %s' % headers['From'])
print('Subject: %s' % headers['subject'])
In the output I am getting as below.
To: None
From: None
Subject: None
Process finished with exit code 0
I am not getting the actual values in To, FROM and subject fields.
Any thoughts why it is not printing the actual values?
My sample .msg file looks like below.
From: Bournemouth.wmt#gmail.com
To: Francis.dell#gmail.com
Subject: orderid: ord1234, circtid: cr1234
Charges:
Annual Charge - 10
Excess Charges - 5
From this message I am trying to extract order id, circuit id from subject and charges from mail body.
Output1:
Thanks

This is the body of the file that you posted on pastebin for us.
From: ratankumar.shivratri#TechM.com <ratankumar.shivratri#TechM.com>
Sent: Thursday, January 4, 2018 11:58 AM
To: Ratankumar Shivratri
Subject: Cct Id: ONE211, eCo order No: 1CTRP
Charges:
Annual rental - 2,125.00
Maintenance charge - 0.00
Regards
Ratan.
I've been able to obtain data from the headers using the following code.
>>> from email.parser import Parser
>>> p = Parser()
>>> msg = p.parse(open('ratan.msg'))
>>> msg['To']
'Ratankumar Shivratri'
>>> msg['From']
'ratankumar.shivratri#TechM.com <ratankumar.shivratri#TechM.com>'
>>> msg['Subject']
'Cct Id: ONE211, eCo order No: 1CTRP\n '
So that much works.
The next problem I foresee is that the format of the subject headers seems to be inconsistent across messages. For instance, in the message in your question, the subject header is 'orderid: ord1234, circtid: cr1234' but in this message it's 'Cct Id: ONE211, eCo order No: 1CTRP'. You want to be able to recover 'order id, circuit id' from messages but these items don't appear in every message.
If they did you could probably ferret them out with a regex.

Related

Avoiding cartesian when adding unique classifier to a list in python 3

I have 5 .csv files I am importing and all contain emails:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"])
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"])
Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
usecols=["Email"])
While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
The total on that last line is 166,749
Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.
I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).
Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.
Any help would be greatly appreciated.
I'll give it a try with some made-up data:
import pandas as pd
fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
print(S1)
gives
Handle Value
0 paul#mail.com Donors
1 max#mail.com Donors
0 paul#mail.com Activists
1 annie#mail.com Activists
You can group by Handle and then pick any Value you like, e.g. the first:
for handle, group in S1.groupby('Handle'):
print(handle, group.reset_index().loc[0, 'Value'])
gives
annie#mail.com Activists
max#mail.com Donors
paul#mail.com Donors
or collect all roles of a person:
for handle, group in S1.groupby('Handle'):
print(handle, group.Value.unique())
gives
annie#mail.com ['Activists']
max#mail.com ['Donors']
paul#mail.com ['Donors' 'Activists']

How to fetch only parts of json file in python3 requests module

So, I am writing a program in Python to fetch data from google classroom API using requests module. I am getting the full json response from the classroom as follows :
{'announcements': [{'courseId': '#############', 'id': '###########', 'text': 'This is a test','state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/##########/p/###########', 'creationTime': '2021-04-11T10:25:54.135Z', 'updateTime': '2021-04-11T10:25:53.029Z', 'creatorUserId': '###############'}, {'courseId': '############', 'id': '#############', 'text': 'Hello everyone', 'state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/#############/p/##################', 'creationTime': '2021-04-11T10:24:30.952Z', 'updateTime': '2021-04-11T10:24:48.880Z', 'creatorUserId': '##############'}, {'courseId': '##################', 'id': '############', 'text': 'Hello everyone', 'state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/##############/p/################', 'creationTime': '2021-04-11T10:23:42.977Z', 'updateTime': '2021-04-11T10:23:42.920Z', 'creatorUserId': '##############'}]}
I was actually unable to convert this into a pretty format so just pasting it as I got it from the http request. What I actually wish to do is just request the first few announcements (say 1, 2, 3 whatever depending upon the requirement) from the service while what I'm getting are all the announcements (as in the sample 3 announcements) that had been made ever since the classroom was created. Now, I believe that fetching all the announcements might make the program slower and so I would prefer if I could get only the required ones. Is there any way to do this by passing some arguments or anything? There are a few direct functions provided by google classroom however I came across those a little later and have already written everything using the requests module which would require changing a lot of things which I would like to avoid. However if unavoidable I would go that route as well.
Answer:
Use the pageSize field to limit the number of responses you want in the announcements: list request, with an orderBy parameter of updateTime asc.
More Information:
As per the documentation:
orderBy: string
Optional sort ordering for results. A comma-separated list of fields with an optional sort direction keyword. Supported field is updateTime. Supported direction keywords are asc and desc. If not specified, updateTime desc is the default behavior. Examples: updateTime asc, updateTime
and:
pageSize: integer
Maximum number of items to return. Zero or unspecified indicates that the server may assign a maximum.
So, let's say you want the first 3 announcements for a course, you would use a pageSize of 3, and an orderBy of updateTime asc:
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
service = build('classroom', 'v1', credentials=creds)
asc = "updateTime asc"
pageSize = 3
# Call the Classroom API
results = service.courses().announcements().list(pageSize=3, orderBy=asc ).execute()
or an HTTP request example:
GET https://classroom.googleapis.com/v1/courses/[COURSE_ID]/announcements
?orderBy=updateTime%20asc
&pageSize=2
&key=[YOUR_API_KEY] HTTP/1.1
Authorization: Bearer [YOUR_ACCESS_TOKEN]
Accept: application/json
References:
Method: announcements.list | Classroom API | Google Developers

How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation

' '.join(token_list) does not reconstruct the original text in cases with multiple whitespaces and punctuation in a row.
For example:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizerSpaCy = Tokenizer(nlp.vocab)
context_text = 'this is a test \n \n \t\t test for \n testing - ./l \t'
contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text)
spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj]
reconstruct = ' '.join(spaCy_toks)
reconstruct == context_text
>False
Is there an established way of reconstructing original text from spaCy tokens?
Established answer should work with this edge case text (you can directly get the source from clicking the 'improve this question' button)
" UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\n\n RELEASE IN PART\n B5, B6\n\n\n\n\nFrom: H <hrod17#clintonemail.com>\nSent: Monday, July 23, 2012 7:26 AM\nTo: 'millscd #state.gov'\nCc: 'DanielJJ#state.gov.; 'hanleymr#state.gov'\nSubject Re: S speech this morning\n\n\n\n Waiting to hear if Monica can come by and pick up at 8 to take to Josh. If I don't hear from her, can you send B5\nsomeone else?\n\n Original Message ----\nFrom: Mills, Cheryl D [MillsCD#state.gov]\nSent: Monday, July 23, 2012 07:23 AM\nTo: H\nCc: Daniel, Joshua J <Daniel1.1#state.gov>\nSubject: FW: S speech this morning\n\nSee below\n\n B5\n\ncdm\n\n Original Message\nFrom: Shah, Rajiv (AID/A) B6\nSent: Monday, July 23, 2012 7:19 AM\nTo: Mills, Cheryl D\nCc: Daniel, Joshua.'\nSubject: S speech this morning\n\nHi cheryl,\n\nI look fwd to attending the speech this morning.\n\nI had one last minute request - I understand that in the final version there is no reference to the child survival call to\naction, but their is a reference to family planning efforts. Could you and josh try to make sure there is some specific\nreference to the call to action?\n\nAlso, in terms of acknowledgements it would be good to note torn friedan's leadership as everyone is sensitive to our ghi\ntransition and we want to continue to send the usaid-pepfar-cdc working together public message. I don't know if he is\nthere, but wanted to flag.\n\nLook forward to it.\n\nRaj\n\n\n\n\n UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\x0c"
You can very easily accomplish this by changing two lines in your code:
spaCy_toks = [i.text + i.whitespace_ for i in contextSpaCyToksSpaCyObj]
reconstruct = ''.join(spaCy_toks)
Basically, each token in spaCy knows whether it is followed by whitespace or not. So you call token.whitespace_ instead of joining them on space by default.

Dictionary text file Python

text
Donald Trump:
791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
Join me live in Springfield, Ohio!
Lit
<<<EOT
781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
<<<EOT
def read(text):
with open(text,'r') as f:
for line in f:
Is there a way that i can separate each information for the candidates So for example for Donald Trump it should be
[
[Donald Trump],
[791697302519947264[[791697302519947264,1477604720,'Ohio USA','Twitter for iPhone',5251,18951895], 'Join['Join me live in Springfield, Ohio! Lit']Lit']],
[781619038699094016[[781619038699094016,1475201875,'United States','Twitter for iPhone',31968,1724617246], 'While['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...']']]
]
The format of the file is the following:
ID,DATE,LOCATION,SOURCE,FAVORITE_COUNT,RETWEET_COUNT text(the tweet)
So basically after the 6 headings, everything after that is a tweet till '<<
Also is there a way i can do this for every candidate in the file
I'm not sure why you need a multi-dimensional list (I would pick tuples and dictionaries if possible) but this seems to produce the output you asked for:
>>> txt = """Donald Trump:
... 791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
... Join me live in Springfield, Ohio!
... Lit
... <<<EOT
... 781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
... While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
... <<<EOT
... Another Candidate Name:
... 12312321,123123213,New York USA, Twitter for iPhone,123,123
... This is the tweet text!
... <<<EOT"""
>>>
>>>
>>> buffer = []
>>> tweets = []
>>>
>>> for line in txt.split("\n"):
... if not line.startswith("<<<EOT"):
... buffer.append(line)
... else:
... if buffer[0].strip().endswith(":"):
... tweets.append([buffer.pop(0).rstrip().replace(":", "")])
... metadata = buffer.pop(0).split(",")
... tweet = [" ".join(line for line in buffer).replace("\n", " ")]
... tweets.append([metadata, tweet])
... buffer = []
...
>>>
>>> from pprint import pprint
>>>
>>> pprint(tweets)
[['Donald Trump'],
[['791697302519947264',
'1477604720',
'Ohio USA',
'Twitter for iPhone',
'5251',
'1895'],
['Join me live in Springfield, Ohio! Lit']],
[['781619038699094016',
'1475201875',
'United States',
'Twitter for iPhone',
'31968',
'17246'],
['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney... ']],
['Another Candidate Name'],
[['12312321',
'123123213',
'New York USA',
' Twitter for iPhone',
'123',
'123'],
['This is the tweet text!']]]
>>>
I am not quite understanding... but here is my example to read a file line by line then add that line to a string of text to post to twitter.
candidates = open("FILEPATH WITH DOUBLE \") #example "C:\\users\\fox\\desktop\\candidates.txt"
for candidate in candidates():
candidate = candidate.rstrip('\n') #removes new line(this is mandatory)
#next line post means post to twitter
post("propaganda here " + candidate + "more propaganda)
note for every line in that file this code will post to twitter
ex.. 20 lines means twenty twitter posts

How to search tweets from an id to another id

I'm trying to get tweets using TwitterSearch in Python3.
So basically I want to get all tweets between these 2 IDs.
748843914254249984 ->760065085616250880
These 2 IDs are from the
Fri Jul 01 11:41:16 +0000 2016 to Mon Aug 01 10:50:12 +0000 2016
So here's the code I made.
crawl.py
#!/usr/bin/python3
# coding: utf-8
from TwitterSearch import *
import datetime
def crawl():
try:
tso = TwitterSearchOrder()
tso.set_keywords(["keyword"])
tso.set_since_id(748843914254249984)
tso.set_max_id(760065085616250880)
ACCESS_TOKEN = xxx
ACCESS_SECRET = xxx
CONSUMER_KEY = xxx
CONSUMER_SECRET = xxx
ts = TwitterSearch(
consumer_key = CONSUMER_KEY,
consumer_secret = CONSUMER_SECRET,
access_token = ACCESS_TOKEN,
access_token_secret = ACCESS_SECRET
)
for tweet in ts.search_tweets_iterable(tso):
print(tweet['id_str'], '-', tweet['created_at'])
except TwitterSearchException as e:
print( e )
if __name__ == '__main__':
crawl()
I'm not very familiar with Twitter API and searching with it. But this code should do the job.
But it's giving :
760058064816988160 - Mon Aug 01 10:22:18 +0000 2016
[...]
760065085616250880 - Mon Aug 01 10:50:12 +0000 2016
Many, many times... Like I got the same lines over and over again instead of getting everything between my two IDs.
So I'm not getting any of the July tweets, any idea why ?
TL;DR
Remove the tso.set_max_id(760065085616250880) line.
Explanation (as far as I understand it)
I have found your problem in the TwitterSearch Docs:
"The only parameter with a default value is count with 100. This is because it is the maximum of tweets returned by this very Twitter API endpoint."
If I check this in your code by creating a search URL, I get:
tso.create_search_url()
#?q=Vuitton&since_id=748843914254249984&count=100&max_id=760065085616250880
which contains count=100 (meaning it will get the first page of 100 tweets). And, in contrast with removing the set_since_id and set_max_id which also has count=100 and retrieves many more tweets, it stops at 100 tweets.
set_since_id without set_max_id works, the other way around doesn't. So removing the max_id=760065085616250880 from the search URL resulted in the results you want.
If anyone can explain why set_max_id is not working along, please edit my answer.

Resources