I really need your help in solving a problem! Apparently, my knowledge is not sufficient to find a solution.
So, I have some msg files that I have already created and saved. Now I need to write a function that can help me create pdfs from msg files (there will be many of them).
I'd be very grateful for your help!
Posting the solution which worked for me (as asked by Amey P Naik). As mentioned I tried multiple modules but only extract_msg worked for the case in hand. I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:
# 1). Import the required modules and setup working directory
import extract_msg
import os
import pandas as pd
direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
ext = '.msg' #type of files in the folder to be read
# 2). Create separate folder by email name and extract data
def content_extraction(directory,extension):
for mail in os.listdir(directory):
try:
if mail.endswith(extension):
msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.
content_extraction(direct,ext)
#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory
#rather it will extract the information from .msg files, you can use a loop instead
#to directly import data from the files saved on sub-folders.
def DataImporter(directory, extension):
my_list = []
for i in os.listdir(direct):
try:
if i.endswith(ext):
msg = extract_msg.Message(i)
my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
global df
df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
print(df.shape[0],' rows imported')
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass
DataImporter(direct,ext)
Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.
If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.
Related
I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")
I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.
I was able to parse through the emails body that is present in a particular directory but it is trying to read all the threads the email has. The code I used to read the files from a directory is as follows. How to get only the top 3 threads present in an email.
#reading multiple .msg files using python
from pathlib import Path
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
# Assuming E:\emails\ is the directory containg files
for p in Path(r'E:\emails\').iterdir():
if p.is_file() and p.suffix == '.msg':
msg = outlook.OpenSharedItem(p)
print(msg.Body)
print('-------------------------------')
The Outlook object model doesn't provide any method or property for that. You need to parse the message body on your own. I'd suggest using regular expressions for that.
I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')
You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.
A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.
I have a bucket with various files. I am only interested in pulling files that begin with the word 'member' and storing each member file in a list to be concated further into a dataframe.
Currently I am pulling data like this:
import boto3
my_bucket = s3.Bucket('my-bucket')
obj = s3.Object('my-bucket','member')
file_content = obj.get()['Body'].read().decode('utf-8')
df = pd.read_csv(file_content)
How ever this is only pulling the member file. I have member files that look like this 'member_1229013','member_2321903' etc.
How can I read in all the 'member' files, save the data in a list so I can concat later. All column names are the same in all csv's
You can only download/access one object per API call.
I normally recommend downloading the objects to a local directory, and then accessing them as normal local files. Here is an example of how to download an object from Amazon S3:
import boto3
s3 = boto3.client('s3')
s3.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
See: download_file() documentation
If you want to read multiple files, you will first need to obtain a listing of the files (eg with list_objects_v2(), and then access each object individually.
One tip for boto3... There are two ways to make calls: via a Resource (eg using s3.Object() or s3.Bucket()) or via a Client, which passes everything as parameters.