Databricks SQL Supports downloading the result set of a SQL query to a local document (csv, excel, etc.)
I'd like to implement a feature allowing users to run scheduled queries, then plug the result set into a predefined excel templates (containing a bunch of macros) to be sent to users by email.
Unfortunately, I haven't been able to find an API that would allow me to write the custom logic to do something like this. I feel like there might be another implementation using live tables or a custom notebook, however I haven't been able to put together the pieces.
What implementation could i use to produce this feature?
am giving answer here as workaround solution as I don't see direct solution from databricks notebook .
Step-01 : Writing your content into any DBFS location . ref : link
df_spark.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(anydbfspath)
Step-02 : Reading respective location and send as email through python
# Import smtplib for the actual sending function
import smtplib
# Import the email modules we'll need
from email.message import EmailMessage
# Open the plain text file whose name is in textfile for reading.
with open(textfile) as fp:
# Create a text/plain message
msg = EmailMessage()
msg.set_content(fp.read())
# me == the sender's email address
# you == the recipient's email address
msg['Subject'] = f'The contents of {textfile}'
msg['From'] = me
msg['To'] = you
# Send the message via our own SMTP server.
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
Step-03 : Make this notebook in job cluster with schedule .
Related
I would like to read mails from microsoft outlook using python and run the script using a databricks cluster.
I'm using win32com on my local machine and able to read emails. However, when i try to install the same package on databricks , it seems to throw an error saying
DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pywin32, --disable-pip-version-check) exited with code 1. ERROR: Could not find a version that satisfies the requirement pywin32
ERROR: No matching distribution found for pywin32
sample code is as follows
import win32com.client
import pandas as pd
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI").Folders
emails_list = [
'xyz#outlook.com'
]
subjects = []
categories = []
body_content = []
names = []
for id, name in enumerate(emails_list):
folder = outlook(name)
#print('Accessing email - ' , folder)
inbox = folder.Folders("Inbox")
message = inbox.Items
message = message.GetFirst()
body_content.append(message.Body)
subjects.append(message.Subject)
categories.append(message.Categories)
names.append(name)
df = pd.DataFrame(list(zip(names,subjects,categories,body_content)),
columns=['names','subjects','categories','body_content'])
df.head(3)
Databricks clusters are using Linux (specifically, Ubuntu Linux), so you can't use COM library that is designed for Windows. Potentially you can access your emails in the Office 365 using IMAP protocol, or something like that (see docs). Python has built-in imaplib library that could be used for that purpose, for example, like in the following article.
I really need your help in solving a problem! Apparently, my knowledge is not sufficient to find a solution.
So, I have some msg files that I have already created and saved. Now I need to write a function that can help me create pdfs from msg files (there will be many of them).
I'd be very grateful for your help!
Posting the solution which worked for me (as asked by Amey P Naik). As mentioned I tried multiple modules but only extract_msg worked for the case in hand. I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:
# 1). Import the required modules and setup working directory
import extract_msg
import os
import pandas as pd
direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
ext = '.msg' #type of files in the folder to be read
# 2). Create separate folder by email name and extract data
def content_extraction(directory,extension):
for mail in os.listdir(directory):
try:
if mail.endswith(extension):
msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.
content_extraction(direct,ext)
#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory
#rather it will extract the information from .msg files, you can use a loop instead
#to directly import data from the files saved on sub-folders.
def DataImporter(directory, extension):
my_list = []
for i in os.listdir(direct):
try:
if i.endswith(ext):
msg = extract_msg.Message(i)
my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
global df
df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
print(df.shape[0],' rows imported')
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass
DataImporter(direct,ext)
Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.
If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.
I'm trying to do a csv file with all informations about merge requests merged between two tags. I'm trying to get this kind of information for each merge request:
UID; ID; TITLE OF MR; REPOSITORIES; STATUS; MILESTONE; ASSIGNED; CREATION-DATE; MERGED-DATE; LABEL; URL.
For now I have a command that get all merge requests merged between two tags with some informations and put it in csv file:
git log --merges --first-parent master --pretty=format:"%aD;%an;%H;%s;%b" TagA..TagB --shortstat >> MRList.csv
How can I get the other informations? I saw in the git log api only options in my command but I can't find others.
Thank you for your help !
I've written a small Python script to do this. Usage:
git log --pretty="format:%H" <start>..<end> | python collect.py
Script:
#!/usr/bin/env python
import sys
import requests
endpoint = 'https://gitlab.com'
project_id = '4242'
mrs = set()
for line in sys.stdin:
hash = line.rstrip('\n')
r = requests.get(endpoint + '/api/v4/projects/' + project_id + '/repository/commits/' + hash + '/merge_requests')
for mr in r.json():
if mr['id'] in mrs:
continue
mrs.add(mr['id'])
print('!{} {} ({})'.format(mr['iid'], mr['title'], mr['web_url']))
For now, this is not yet between two tags, but you have, with GitLab 13.6 (November 2020):
Export merge requests as a CSV
Many organizations are required to document changes (merge requests) and the data surrounding those transactions such as who authored the MR, who approved it, and when that change was merged into production. Although not an exhaustive list, it highlights the recurring theme of traceability and the need to export this data from GitLab to serve an audit or other regulatory requirement.
Previously, you would need to use GitLab’s merge requests API to compile this data using custom tooling. Now, you can click one button and receive a CSV file that contains the necessary chain of custody information you need.
See Documentation and Issue.
I was able to parse through the emails body that is present in a particular directory but it is trying to read all the threads the email has. The code I used to read the files from a directory is as follows. How to get only the top 3 threads present in an email.
#reading multiple .msg files using python
from pathlib import Path
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
# Assuming E:\emails\ is the directory containg files
for p in Path(r'E:\emails\').iterdir():
if p.is_file() and p.suffix == '.msg':
msg = outlook.OpenSharedItem(p)
print(msg.Body)
print('-------------------------------')
The Outlook object model doesn't provide any method or property for that. You need to parse the message body on your own. I'd suggest using regular expressions for that.
I've been trying to find a way to read and write data between Pandas and Google sheets for a while now. I found the library df2gspread which seems perfect for the job. Been spending a while now trying to get it to work.
As instructed, I used the Google API console to create my client secrets file and saved it as ~/.gdrive_private. Now, I'm trying to download the contents of a Google spreadsheet as follows:
workbook = [local filepath to workbook in Google Drive folder]
df = g2d.download(workbook, 'Sheet1', col_names = True, row_names = True)
When I run this, it is successfully opening a browser window asking to give my app access to my Google sheets. However, when I click allow, an iPython error is coming up:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/samlilienfeld/.oauth/drive.json'
What is this file supposed to contain? I've tried creating the folder and including my client secrets again there as drive.json, but this does not work.
I did a work around for the time being by passing a pre-authenticated credential file to the g2d call.
I made a gist here (for Python2x but should work for 3x) to save the credential file by passing the secret file (basically ~/.gdrive_private) and the resulting authenticated credential filename to save.
Use the above gist in an standalone script with appropriate filenames and run it from a terminal console. A browser window will open to perform the OAuth authentication via Google, and should give you a token which you can copy paste into the terminal prompt. Here's a quick example:
from gdrive_creds import create_creds
# Copy Paste whatever shows up in the browser in the console.
create_creds('./.gdrive_private', './authenticated_creds')
You can then use the file to authenticate for df2gspread calls.
Once you create the cred file using the gist method, try something like this to get access to your GDrive:
from oauth2client.file import Storage
from df2gspread import gspread2df as g2d
# Read the cred file
creds = Storage('./authenticated_creds').get()
# Pass it to g2df (Trimmed for brevity)
workbook = [local filepath to workbook in Google Drive folder]
df = g2d.download(workbook, 'Sheet1', col_names = True, credentials=creds)
df.head()
This worked for me.
Here the two functioning ways as of 2019:
1.DateFrame data to Google sheet:
#Import libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# Connection to googlesheet
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# From dataframe to google sheet
from df2gspread import df2gspread as d2g
# Configure the connection
scope = ['https://spreadsheets.google.com/feeds']
# Add the JSON file you downloaded from Google Cloud to your working directory
# the JSON file in this case is called 'service_account_gs.json' you can rename as you wish
credentials =ServiceAccountCredentials.from_json_keyfile_name('service_account_gs.json',
scope
)
# Authorise your Notebook with credentials just provided above
gc = gspread.authorize(credentials)
# The spreadsheet ID, you see it in the URL path of your google sheet
spreadsheet_key = '1yr6LwGQzdNnaonn....'
# Create the dataframe within your notebook
df = pd.DataFrame({'number': [1,2,3],'letter': ['a','b','c']})
# Set the sheet name you want to upload data to and the start cell where the upload data begins
wks_name = 'Sheet1'
cell_of_start_df = 'A1'
# upload the dataframe
d2g.upload(df,
spreadsheet_key,
wks_name,
credentials=credentials,
col_names=True,
row_names=False,
start_cell = cell_of_start_df,
clean=False)
print ('Successfully updated')
2.Google sheet to DataFrame
from df2gspread import gspread2df as g2d
df = g2d.download(gfile='1yr6LwGQzdNnaonn....',
credentials=credentials,
col_names=True,
row_names=False)
df
It seems like this issue was because /User/***/.oauth folder wasn't created automatically by oauth2client package (e.g. issue). One of possible solutions is to create this folder manually or you can update df2gspread, issue should be fixed in last version.