i wrote this python script to search for unseen mail in a mailbox, download xlsx attachment, make some modification on it and then post them to another service.
All is working perfect with just one issue:
In the original xlsx file there is a column named "zona" containing the italian two letter string for the province.
If this value is "NA" (the value of the province of NAPLES) when
saving the resultant xlsx files has blank cell instead of NA.
is NA a reserved word and if yes, there is a way to quote it?
import os,email,imaplib,socket,requests
import pandas as pd
mail_user = os.environ.get('MAIL_USER')
mail_password = os.environ.get('MAIL_PASS')
mail_server = os.environ.get('MAIL_SERVER')
detach_dir = '.'
url=<removed url>
if mail_user is None or mail_password is None or mail_server is None:
print ('VARIABILI DI AMBIENTE NON DEFINITE')
exit(1)
try:
with imaplib.IMAP4_SSL(mail_server) as m:
try:
m.login(mail_user,mail_password)
m.select("INBOX")
resp, items = m.search(None, "UNSEEN")
items = items[0].split()
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1] # getting the mail content
mail = email.message_from_bytes(email_body) # parsing the mail content to get a mail object
if mail.get_content_maintype() != 'multipart':
continue
for part in mail.walk():
if part.get_content_maintype() == 'multipart':
continue
if part.get('Content-Disposition') is None:
continue
filename = part.get_filename()
if filename.endswith('.xlsx'):
att_path = os.path.join(detach_dir, filename)
fp = open(att_path, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
xl = pd.ExcelFile(att_path)
df1 = xl.parse(sheet_name=0)
df1 = df1.replace({'\'':''}, regex=True)
df1.loc[df1['Prodotto'] == 'SP_TABLETA_SAMSUNG','Cod. ID.'] = 'X'
df1.loc[df1['Prodotto'] == 'AP_TLC','Cod. ID.'] = 'X'
df1.loc[df1['Prodotto'] == 'APDCMB00003','Cod. ID.'] = 'X'
df1.loc[df1['Prodotto'] == 'APDCMB03252','Cod. ID.'] = 'X'
writer = pd.ExcelWriter(att_path, engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Foglio1', index=False)
writer.save()
uf = {'files': open(att_path, 'rb')}
http.client.HTTPConnection.debuglevel = 0
r = requests.post(url, files=uf)
print (r.text)
except imaplib.IMAP4_SSL.error as e:
print (e)
exit(1)
except imaplib.IMAP4.error:
print ("Errore di connessione al server")
exit(1)
It seems that Pandas is treating the NA value as a NaN and therefore, when you write to excel it writes this value as '' by default (see docs).
You can pass na_rep='NA' to the to_excel() function in order to write it out as a string;
df1.to_excel(writer, sheet_name='Foglio1', index=False, na_rep='NA')
But as a precaution keep an eye out as any other NaN values present in your df will also be written to the excel file as 'NA'.
Reading the docs link post by #Matt B. i found this solution:
df1 = xl.parse(sheet_name=0, keep_default_na=False, na_values=['_'])
If i understand well only _ are interpreted as "not avalaible"
Related
I wrote the following code to create dataframes from files saved in sharefile. It works perfectly for excel files, but fails for csv files with the error EmptyDataError: No columns to parse from file.
tblname = 'test'
fPth = r'Z:\Favorites\test10 (Group D - Custom EM&V)\8 PII\16 - Project Selection Plan\QC\Data\test.csv'
sht = 'Gross_Data'
shtStart = 0
fType = 'csv'
fitem = sfsession.get_io_version(fPth)
if fitem is None:
print(f'Could not create sharefile item for {fPth}')
else:
try:
if fType == 'csv':
df = pd.read_csv(fitem.io_data, header = shtStart)
elif fType == 'excel':
df = pd.read_excel(fitem.io_data, sheet_name = sht, header = shtStart)
else:
pass
print(f'Data import COMPLETE for {fPth}: {str(datetime.now())}')
except:
print(f'Data import FAILED for {fPth}')
logging.critical(f'Data import FAILED for {fPth}')
If I replace fitem.io_data with fPth in df = pd.read_csv, the code works, but I can't use that as a permanent solution. Any suggestions?
Also sfsession is a sharefile session and get_io_version(fPth) gets the token and downloads all the file properties include its data.
Thanks.
An adaptation of this solution worked for me:
StringIO and pandas read_csv
I added fitem.io_data.seek(0) before the df = ... line
Closing the question.
I'm writing string entries to a tab-delimited file in Python 3. The code that I use to save the content is:
savedir = easygui.diropenbox()
savefile = input("Please type the filename (including extension): ")
file = open(os.path.join(savedir, savefile), "w", encoding="utf-8")
file.write("Number of entities not found: " + str(missing_count) + "\n")
sep = "\t"
for entry in entities:
file.write(entry[0]+"\t")
for item in entry:
file.write(sep.join(item[0]))
file.write("\t")
file.write("\n")
file.close()
The file saves properly. There are no errors or warnings sent to the terminal. When I open the file, I find an extra column has been saved to the file.
Query | Extra | Name
Abu-Jamal, Mumia | A | Mumia Abu-Jamal
Anderson, Walter | A | Walter Inglis Anderson
Anderson, Walter | A | Walter Inglis Anderson
I've added vertical bars between the tabs for clarity; they don't normally appear there. As well, I have removed a few columns at the end. The column between the vertical bars is not supposed to be there. The document that is saved to file is longer than three lines. On each line, the extra column is the first letter of the Query column. Hence, we have A's in these three examples.
entry[0] corresponds exactly to the value in the Query column.
sep.join(item[0]) corresponds exactly to columns 3+.
Any idea why I would be getting this extra column?
Edit: I'm adding the full code for this short script.
# =============================================================================
# Code to query DBpedia for named entities.
#
# =============================================================================
import requests
import xml.etree.ElementTree as et
import csv
import os
import easygui
import re
# =============================================================================
# Default return type is XML. Others: json.
# Classes are: Resource (general), Place, Person, Work, Species, Organization
# but don't include resource as one of the
# =============================================================================
def urlBuilder(query, queryClass="unknown", returns=10):
prefix = 'http://lookup.dbpedia.org/api/search/KeywordSearch?'
#Selects the appropriate QueryClass for the url
if queryClass == 'place':
qClass = 'QueryClass=place'
elif queryClass == 'person':
qClass = 'QueryClass=person'
elif queryClass == 'org':
qClass = 'QueryClass=organization'
else:
qClass = 'QueryClass='
#Sets the QueryString
qString = "QueryString=" + str(query)
#sets the number of returns
qHits = "MaxHits=" + str(returns)
#full url
dbpURL = prefix + qClass + "&" + qString + "&" + qHits
return dbpURL
#takes a xml doc as STRING and returns an array with the name and the URI
def getdbpRecord(xmlpath):
root = et.fromstring(xmlpath)
dbpRecord = []
for child in root:
temp = []
temp.append(child[0].text)
temp.append(child[1].text)
if child[2].text is None:
temp.append("Empty")
else:
temp.append(findDates(child[2].text))
dbpRecord.append(temp)
return dbpRecord
#looks for a date with pattern: 1900-01-01 OR 01 January 1900 OR 1 January 1900
def findDates(x):
pattern = re.compile('\d{4}-\d{2}-\d{2}|\d{2}\s\w{3,9}\s\d{4}|\d{1}\s\w{3,9}\s\d{4}')
returns = pattern.findall(x)
if len(returns) > 0:
return ";".join(returns)
else:
return "None"
#%%
# =============================================================================
# Build and send get requests
# =============================================================================
print("Please select the CSV file that contains your data.")
csvfilename = easygui.fileopenbox("Please select the CSV file that contains your data.")
lookups = []
name_list = csv.reader(open(csvfilename, newline=''), delimiter=",")
for name in name_list:
lookups.append(name)
#request to get the max number of returns from the user.
temp = input("Specify the maximum number of returns desired: ")
if temp.isdigit():
maxHits = temp
else:
maxHits = 10
queries = []
print("Building queries. Please wait.")
for search in lookups:
if len(search) == 2:
queries.append([search[0], urlBuilder(query=search[0], queryClass=search[1], returns=maxHits)])
else:
queries.append([search, urlBuilder(query=search, returns=maxHits)])
responses = []
print("Gathering responses. Please wait.")
for item in queries:
response = requests.get(item[1])
data = response.content.decode("utf-8")
responses.append([item[0], data])
entities = []
missing_count = 0
for item in responses:
temp = []
if len(list(et.fromstring(item[1]))) > 0:
entities.append([item[0], getdbpRecord(item[1])])
else:
missing_count += 1
print("There are " + str(missing_count) + " entities that were not found.")
print("Please select the destination folder for the results of the VIAF lookup.")
savedir = easygui.diropenbox("Please select the destination folder for the results of the VIAF lookup.")
savefile = input("Please type the filename (including extension): ")
file = open(os.path.join(savedir, savefile), "w", encoding="utf-8")
file.write("Number of entities not found: " + str(missing_count) + "\n")
sep = "\t"
for entry in entities:
file.write(entry[0]+"\t")
for item in entry:
file.write(sep.join(item[0]))
file.write("\t")
file.write("\n")
file.close()
I am new to programming and probably there is an answer to my question somewhere like here, the closest i found after searching for days. Most of the info deals with existing csvs or hardcoding data. I am trying to make the program create data every time it runs and work on that so a little stumped here.
The Problem:
I can't seem to get python to attach serial nos to each entry when i run the program am making to log my study blocks. It has various fields following are two of them:
Date Time
12-03-2018 11:30
Following is the code snippet:
d= ''
while d == '':
d = input('Date:')
try:
valid_date = dt.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print('Please input date in YYYY-MM-DD format.')
t= ''
while t == '':
t = input('Time:')
try:
valid_time = dt.strptime(t, '%H:%M')
except ValueError:
d = ''
print('Please input time in HH:MM format.')
header = csv.DictWriter(outfile, fieldnames= ['UID', 'Date', 'Time', 'Topic', 'Objective', 'Why', 'Summary'], delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL )
header.writeheader()
log_input = csv.writer(outfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t, topic, objective, why, summary])
outfile.close()
df = pd.read_csv('E:\Coursera\HSU\python\pom_blocks_log.csv')
df = pd.read_csv('E:\pom_blocks_log.csv')
df = df.reset_index()
df.columns[0] = 'UID'
df['UID'] = df.index
print (df)
I get the following error when i run the program with the df block:
TypeError: Index does not support mutable operations
I new to python and don't really know how to work with data structures, so i am building small programs to learn. Any help is highly appreciated and apologies if this is a duplicate, please point me to the right direction.
So, i figured it out. Following is the process i followed:
I save the CSV file using the csv module.
I load the CSV file in pandas as dataframe.
What this does is, it allows me to append user entries to the CSV every time the program is run and then i can load it as a dataframe and use pandas to manipulate the data accordingly. Then i added a generator to clean the lines off the delimiter character ',' so that it could be loaded as a dataframe for string columns where ',' is accepted as a valid input. Maybe this is a round about approach but, it works.
Following is the code:
import csv
from csv import reader
from datetime import datetime
import pandas as pd
import numpy as np
with open(r'E:\Coursera\HSU\08_programming\trLog_df.csv','a', encoding='utf-8') as csvfile:
# Date
d = ''#input("Date:")
while d == '':
d = input('Date: ')
try:
valid_date = datetime.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print("Incorrect data format, should be YYYY-MM-DD")
# Time
t = ''#input("Date:")
while t == '':
t = input('Time: ')
try:
valid_date = datetime.strptime(t, '%H:%M')
except ValueError:
t = ''
print("Incorrect data format, should be HH:MM")
log_input = csv.writer(csvfile, delimiter= ',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t])
# Function to clean lines off the delimter ','
def merge_last(file_name, merge_after_col=7, skip_lines=0):
with open(file_name, 'r') as fp:
for i, line in enumerate(fp):
if i < 2:
continue
spl = line.strip().split(',')
yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:2]))
# Generator to clean the lines
gen = merge_last(r'E:\Coursera\HSU\08_programming\trLog_df.csv', 1)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)
df.head()
print(df)
If anybody has a better solution, it would be enlightening to know how to do it with efficiency and elegance.
Thank you for reading.
I am trying to run some code that will get data from a website(ques) and compare it to a CSV(ans) file to get a result.
def ans(ques):
ans = False
csvfile = open('database.csv', 'r')
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
if ques == row[0]:
ans = row[1]
break
if ques == row[1]:
ans = row[0]
break
return(ans)
CSV file:
QUESTION,ANSWER
1+1=?,It's 2
When i run this code i get this:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to
Unicode - interpreting them as being unequal
However when i remove the ' it works as expected. Any help?
I have made a code that opens a .csv file and takes a user input to filter it to a new list. What I am having trouble with is saving this new list to a .csv file properly.
This is my code:
#author: Joakim
from pprint import pprint
import csv
with open ('Change_in_Unemployment_2008-2014.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV) #Removing header
result = []
found = False
user_input = input("Please enter a full/partial NUTS code to filter by: ")
for row in readCSV:
if row[0].startswith(user_input):
result.append(row)
found = True
if found == False:
print("There are no registered NUTS codes containing your input.. Please try again")
if found == True:
print("\n Successfully found ", len(result), "entries!""\n")
pprint (result)
#store data in a new csv file
Stored_path = "C:\data_STORED.csv"
file = open(Stored_path, 'w')
writer = csv.writer(file)
writer.writerow(["NUTS CODE", " Municipality", " value"])
for i in range(len(result)):
new_row = result[i]
NUTS_CODE = new_row[0]
Municipality = new_row[1]
Value = new_row[2]
writer.writerow([NUTS_CODE, Municipality])
csvfile.close()
If one runs my code with an input of : PL, one gets this list:
[['PL11', 'odzkie', '2.2'],
['PL12', 'Mazowieckie', '1.2'],
['PL21', 'Maopolskie', '2.9'],
['PL22', 'Slaskie', '2'],
['PL31', 'Lubelskie', '1.1'],
['PL32', 'Podkarpackie', '5.8'],
['PL33', 'Swietokrzyskie', '2.6'],
['PL34', 'Podlaskie', '2.7'],
['PL41', 'Wielkopolskie', '1.6'],
['PL42', 'Zachodniopomorskie', '-1.1'],
['PL43', 'Lubuskie', '1.8'],
['PL51', 'Dolnoslaskie', '0'],
['PL52', 'Opolskie', '1.3'],
['PL61', 'Kujawsko-Pomorskie', '1.6'],
['PL62', 'Warminsko-Mazurskie', '2.4'],
['PL63', 'Pomorskie', '3.1']]'
Now I would like to store this neatly into a new .csv file, but when I use the code above, I only get a couple of values repeated throughout.
What is my error?