I would like to read mails from microsoft outlook using python and run the script using a databricks cluster.
I'm using win32com on my local machine and able to read emails. However, when i try to install the same package on databricks , it seems to throw an error saying
DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pywin32, --disable-pip-version-check) exited with code 1. ERROR: Could not find a version that satisfies the requirement pywin32
ERROR: No matching distribution found for pywin32
sample code is as follows
import win32com.client
import pandas as pd
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI").Folders
emails_list = [
'xyz#outlook.com'
]
subjects = []
categories = []
body_content = []
names = []
for id, name in enumerate(emails_list):
folder = outlook(name)
#print('Accessing email - ' , folder)
inbox = folder.Folders("Inbox")
message = inbox.Items
message = message.GetFirst()
body_content.append(message.Body)
subjects.append(message.Subject)
categories.append(message.Categories)
names.append(name)
df = pd.DataFrame(list(zip(names,subjects,categories,body_content)),
columns=['names','subjects','categories','body_content'])
df.head(3)
Databricks clusters are using Linux (specifically, Ubuntu Linux), so you can't use COM library that is designed for Windows. Potentially you can access your emails in the Office 365 using IMAP protocol, or something like that (see docs). Python has built-in imaplib library that could be used for that purpose, for example, like in the following article.
Related
I am using pytrends library in google colab, the problem is, whenever I restart my runtime the results are different. Here is my code
!pip install pytrends
from pytrends.request import TrendReq
wom = 'today 1-m'
geo = 'GB'
key_word = '401k'
pytrend = TrendReq(hl='en-US',tz=-360)
pytrend.build_payload([key_word], timeframe=wom, geo=geo)
wtrends = pytrend.interest_over_time()
print(wtrends)
This code always give me same results when I run it on my local machine using anaconda. You can verify the results by going to the google trends website and by selecting the region to United Kindom.
Databricks SQL Supports downloading the result set of a SQL query to a local document (csv, excel, etc.)
I'd like to implement a feature allowing users to run scheduled queries, then plug the result set into a predefined excel templates (containing a bunch of macros) to be sent to users by email.
Unfortunately, I haven't been able to find an API that would allow me to write the custom logic to do something like this. I feel like there might be another implementation using live tables or a custom notebook, however I haven't been able to put together the pieces.
What implementation could i use to produce this feature?
am giving answer here as workaround solution as I don't see direct solution from databricks notebook .
Step-01 : Writing your content into any DBFS location . ref : link
df_spark.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(anydbfspath)
Step-02 : Reading respective location and send as email through python
# Import smtplib for the actual sending function
import smtplib
# Import the email modules we'll need
from email.message import EmailMessage
# Open the plain text file whose name is in textfile for reading.
with open(textfile) as fp:
# Create a text/plain message
msg = EmailMessage()
msg.set_content(fp.read())
# me == the sender's email address
# you == the recipient's email address
msg['Subject'] = f'The contents of {textfile}'
msg['From'] = me
msg['To'] = you
# Send the message via our own SMTP server.
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
Step-03 : Make this notebook in job cluster with schedule .
I was able to parse through the emails body that is present in a particular directory but it is trying to read all the threads the email has. The code I used to read the files from a directory is as follows. How to get only the top 3 threads present in an email.
#reading multiple .msg files using python
from pathlib import Path
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
# Assuming E:\emails\ is the directory containg files
for p in Path(r'E:\emails\').iterdir():
if p.is_file() and p.suffix == '.msg':
msg = outlook.OpenSharedItem(p)
print(msg.Body)
print('-------------------------------')
The Outlook object model doesn't provide any method or property for that. You need to parse the message body on your own. I'd suggest using regular expressions for that.
I try to retrieve historical financial Data from iex or morningstar. For this I use the Following Code.
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2019, 1, 10)
facebook = web.DataReader("FB", 'morningstar', start, end)
print(facebook.head())
Unfortunatly I get the error message:
NotImplementedError: data_source='morningstar' is not implemented
or
ValueError: The IEX Cloud API key must be provided either through the
api_key variable or through the environment variable IEX_API_KEY
depending on which of both sources I use.
I tried to
pip uninstall pandas-datareader
pip install pandas-datareader
several times and also restarted the kernel but nothing changes. Was there any change to this APIs or am I doing anything wrong?
From the documentation:
You need to obtain the IEX_API_KEY from IEX and pass it to os.environ["IEX_API_KEY"]. (https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#remote-data-iex)
I don't know if the IEX API still works.
The morningstar is not implemented. The following data sources (at the time of writing) are:
Tiingo
IEX
Alpha Vantage
Enigma
Quandl
St.Louis FED (FRED)
Kenneth French’s data library
World Bank
OECD
Eurostat
Thrift Savings Plan
Nasdaq Trader symbol definitions
Stooq
MOEX
You must provide an API Key when using IEX. You can do this using
os.environ["IEX_API_KEY"] = "pk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
or by exporting the key before starting the IPython session.
You can visit iexcloud.io, after creating a student account you will get an API key for free.
I am with trouble listing blobs from a specific container
I am using the oficial code, in Python, to list:
from azure.storage.blob import BlockBlobService
account_name = 'xxxx'
account_key = 'xxxx'
container_name = 'yyyyyy'
block_blob_service = BlockBlobService(account_name=account_name,
account_key=account_key)
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
I have received the error:
raise AzureException(ex.args[0])
AzureException: can only concatenate str (not "tuple") to str
The version of azure storage related packages installed are:
azure-mgmt-storage 2.0.0
azure-storage-blob 1.4.0
azure-storage-common 1.4.0
I tried to run the same code of yours with my account, it works fine without any issue. Then, according to the error information, I also tried to reproduce it, as below.
Test 1. When I tried to run the code '123'+('A','B') in Python 3.7, I got the similar issue as the figure below.
Test 2. When ran the same code in Python 3.6, the error information is different.
Test 3. When in Python 2 (just on WSL), the same issue is like in Python 3.7
So I guess you were using Python 3.7 or 2 to run your code, and the issue was caused by using + symbol to concat a string with a tuple at other where of your codes. Please try to check carefully or update your post for more details about the debug information includes line number and its codes for helping to analyze.