Choose encoding when converting to Sqlite database - python-3.x

I am converting Mbox files to Sqlite db. I do not arrive to encode the db file into utf-8.
The Python console displays the following message when converting to db:
Error binding parameter 1 - probably unsupported type.
When I visualize my data on DB Browser for SQlite, special characters don't appear and the � symbol shows up instead.
I first convert .text files to Mbox files with the following function:
def makeMBox(fIn,fOut):
if not os.path.exists(fIn):
return False
if os.path.exists(fOut):
return False
out = open(fOut,"w")
lineNum = 0
# detect encoding
readsource = open(fIn,'rt').__next__
#fInCodec = tokenize.detect_encoding(readsource)[0]
fInCodec = 'UTF-8'
for line in open(fIn,'rt', encoding=fInCodec, errors="replace"):
if line.find("From ") == 0:
if lineNum != 0:
out.write("\n")
lineNum +=1
line = line.replace(" at ", "#")
out.write(line)
out.close()
return True
Then, I convert to sqlite db:
for k in dates:
db = sqlite_utils.Database("Courriels_Sqlite/Echanges_Discussion.db")
mbox = mailbox.mbox("Courriels_MBox/"+k+".mbox")
def to_insert():
for message in mbox.values():
Dionyversite = dict(message.items())
Dionyversite["payload"] = message.get_payload()
yield Dionyversite
try:
db["Dionyversite"].upsert_all(to_insert(), alter = True, pk = "Message-ID")
except sql.InterfaceError as e:
print(e)
Thank you for your help.

I found how to fix it:
def to_insert():
for message in mbox.values():
Dionyversite = dict(message.items())
Dionyversite["payload"] = message.get_payload(decode = True)
yield Dionyversite
``
As you can see, I add `decode = True` inside `get_payload`of the `to_insert`function.

Related

Why is only half my data being passed into my dictionary?

When I run this script I can verify that it loops through all of the values, but not all of them get passed into my dictionary
file = open('path', 'rb')
readFile = PyPDF2.PdfFileReader(file)
lineData = {}
totalPages = readFile.numPages
for i in range(totalPages):
pageObj = readFile.getPage(i)
pageText = pageObj.extractText
newTrans = re.compile(r'Jan \d{2,}')
for line in pageText(pageObj).split('\n'):
if newTrans.match(line):
newValue = re.split(r'Jan \d{2,}', line)
newValueStr = ' '.join(newValue)
newKey = newTrans.findall(line)
newKeyStr = ' '.join(newKey)
print(newKeyStr + newValueStr)
lineData[newKeyStr] = newValueStr
print(len(lineData))
There are 80+ data pairs but when I run this the dict only gets 37
Well, duplicate keys, maybe? Try to make lineData = [] and append there: lineData.append({newKeyStr:newValueStr} and then check how many records you get.

Error inserting an email string in sqlite3 and python Tkinter

I am writing a simple program to update a basic db based on data entered on a simple GUI. I'm using string formatting but keep getting an error trying to enter an email address , which I know should be surrounded with double-quotes. I'm sure the solution is simple- I just don't know what it is!
def update_rec():
# Connect to the db
conn = sqlite3.connect("address_book.db")
# create a cursor
c = conn.cursor()
fields = ["f_name", "s_name", "mob", "email"]
# Check which textboxes have data
update_txt = ""
update_field = ""
rec_no = str(id_no.get())
if len(f_name.get()) > 0:
update_txt = f_name.get()
update_field = fields[0]
elif len(s_name.get()) > 0:
update_txt = s_name.get()
update_field = fields[1]
elif len(mob.get()) > 0:
update_txt = mob.get()
update_field = fields[2]
elif len(email.get()) > 0:
update_txt = email.get()
update_field = fields[3]
else:
update_txt = ""
update_field = ""
c.execute("""UPDATE address_book SET {0} = ? WHERE {1} = ?""".format(update_field, update_txt), rec_no)
conn.commit()
conn.close()
I keep getting this error:
c.execute("""UPDATE address_book SET {0} = ? WHERE {1} = ?""".format(update_field, update_txt), rec_no)
sqlite3.OperationalError: near "#gmail": syntax error
What needs to be supplied to .format() is getting confused with what needs to be passed to c.execute().
Do it in two steps so it's easier to understand.
You need to tell us what rec_field should be. It's probably something like id or address_book_id or ?
rec_field = 'id' # you know what this should be...
qry = """UPDATE address_book
SET {0} = ?
WHERE {1} = ?;""".format(update_field, rec_field)
c.execute(qry, (update_txt,rec_no))

How to handle blank line,junk line and \n while converting an input file to csv file

Below is the sample data in input file. I need to process this file and turn it into a csv file. With some help, I was able to convert it to csv file. However not fully converted to csv since I am not able to handle \n, junk line(2nd line) and blank line(4th line). Also, i need help to filter transaction_type i.e., avoid "rewrite" transaction_type
{"transaction_type": "new", "policynum": 4994949}
44uu094u4
{"transaction_type": "renewal", "policynum": 3848848,"reason": "Impressed with \n the Service"}
{"transaction_type": "cancel", "policynum": 49494949, "cancel_table":[{"cancel_cd": "AU"}, {"cancel_cd": "AA"}]}
{"transaction_type": "rewrite", "policynum": 5634549}
Below is the code
import ast
import csv
with open('test_policy', 'r') as in_f, open('test_policy.csv', 'w') as out_f:
data = in_f.readlines()
writer = csv.DictWriter(
out_f,
fieldnames=[
'transaction_type', 'policynum', 'cancel_cd','reason'],lineterminator='\n',
extrasaction='ignore')
writer.writeheader()
for row in data:
dict_row = ast.literal_eval(row)
if 'cancel_table' in dict_row:
cancel_table = dict_row['cancel_table']
cancel_cd= []
for cancel_row in cancel_table:
cancel_cd.append(cancel_row['cancel_cd'])
dict_row['cancel_cd'] = ','.join(cancel_cd)
writer.writerow(dict_row)
Below is my output not considering the junk line,blank line and transaction type "rewrite".
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with
the Service"
cancel,49494949,"AU,AA",
Expected output
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with the Service"
cancel,49494949,"AU,AA",
Hmm I try to fix them but I do not know how CSV file work, but my small knoll age will suggest you to run this code before to convert the file.
txt = {"transaction_type": "renewal",
"policynum": 3848848,
"reason": "Impressed with \n the Service"}
newTxt = {}
for i,j in txt.items():
# local var (temporar)
lastX = ""
correctJ = ""
# check if in J is ascii white space "\n" and get it out
if "\n" in f"b'{j}'":
j = j.replace("\n", "")
# for grammar purpose check if
# J have at least one space
if " " in str(j):
# if yes check it closer (one by one)
for x in ([j[y:y+1] for y in range(0, len(j), 1)]):
# if 2 spaces are consecutive pass the last one
if x == " " and lastX == " ":
pass
# if not update correctJ with new values
else:
correctJ += x
# remember what was the last value checked
lastX = x
# at the end make J to be the correctJ (just in case J has not grammar errors)
j = correctJ
# add the corrections to a new dictionary
newTxt[i]=j
# show the resoult
print(f"txt = {txt}\nnewTxt = {newTxt}")
Termina:
txt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with \n the Service'}
newTxt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with the Service'}
Process finished with exit code 0

load .npy file from google cloud storage with tensorflow

i'm trying to load .npy files from my google cloud storage to my model i followed this example here Load numpy array in google-cloud-ml job
but i get this error
'utf-8' codec can't decode byte 0x93 in
position 0: invalid start byte
can you help me please ??
here is sample from the code
Here i read the file
with file_io.FileIO(metadata_filename, 'r') as f:
self._metadata = [line.strip().split('|') for line in f]
and here i start processing on it
if self._offset >= len(self._metadata):
self._offset = 0
random.shuffle(self._metadata)
meta = self._metadata[self._offset]
self._offset += 1
text = meta[3]
if self._cmudict and random.random() < _p_cmudict:
text = ' '.join([self._maybe_get_arpabet(word) for word in text.split(' ')])
input_data = np.asarray(text_to_sequence(text, self._cleaner_names), dtype=np.int32)
f = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[0]))
linear_target = tf.Variable(initial_value=np.load(f), name='linear_target')
s = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[1])))
mel_target = tf.Variable(initial_value=np.load(s), name='mel_target')
return (input_data, mel_target, linear_target, len(linear_target))
and this is a sample from the data sample
This is likely because your file doesn't contain utf-8 encoded text.
Its possible, you may need to initialize the file_io.FileIO instance as a binary file using mode = 'rb', or set binary_mode = True in the call to read_file_to_string.
This will cause data that is read to be returned as a sequence of bytes, rather than a string.

Adding binary header

I have a binary data file I would like to append a header to using python. Below is the code I have to create the header but I am unsure on how to add it to the test.dat file.
import struct
import os
from struct import *
date = 20151027
version = 1
datatype = str.encode('P')
indextype = str.encode('I')
recct = int(os.path.getsize("H:\\test\\test.dat")/16)
delim = str.encode(' ')
filler = str.encode(' ')
delta = 'F'
pdate = pack('l', date)
pversion = pack('h', version)
pdatatype = pack('>s', datatype)
pindextype = pack('>s', indextype)
precct = pack('l', recct)
pdelim = pack('s', delim)
pfiller = pack('<2s', filler)
header = pdate+pversion+pdatatype+pindextype+precct,pdelim,pfiller
Read the file in, then write the file out with the header. Be sure to use binary mode:
with open(r'H:\test\test.dat','rb') as f:
data = f.read()
with open(r'H:\test\test.dat','wb') as f:
f.write(header + data)
Also, you can pack in one statement:
header = struct.pack('lhssls2s',date,version,datatype,indextype,recct,delim,filler)
str.encode('P') is an odd way of saying 'P'.encode() or just b'P'.

Resources