How to handle blank line,junk line and \n while converting an input file to csv file - python-3.x

Below is the sample data in input file. I need to process this file and turn it into a csv file. With some help, I was able to convert it to csv file. However not fully converted to csv since I am not able to handle \n, junk line(2nd line) and blank line(4th line). Also, i need help to filter transaction_type i.e., avoid "rewrite" transaction_type
{"transaction_type": "new", "policynum": 4994949}
44uu094u4
{"transaction_type": "renewal", "policynum": 3848848,"reason": "Impressed with \n the Service"}
{"transaction_type": "cancel", "policynum": 49494949, "cancel_table":[{"cancel_cd": "AU"}, {"cancel_cd": "AA"}]}
{"transaction_type": "rewrite", "policynum": 5634549}
Below is the code
import ast
import csv
with open('test_policy', 'r') as in_f, open('test_policy.csv', 'w') as out_f:
data = in_f.readlines()
writer = csv.DictWriter(
out_f,
fieldnames=[
'transaction_type', 'policynum', 'cancel_cd','reason'],lineterminator='\n',
extrasaction='ignore')
writer.writeheader()
for row in data:
dict_row = ast.literal_eval(row)
if 'cancel_table' in dict_row:
cancel_table = dict_row['cancel_table']
cancel_cd= []
for cancel_row in cancel_table:
cancel_cd.append(cancel_row['cancel_cd'])
dict_row['cancel_cd'] = ','.join(cancel_cd)
writer.writerow(dict_row)
Below is my output not considering the junk line,blank line and transaction type "rewrite".
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with
the Service"
cancel,49494949,"AU,AA",
Expected output
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with the Service"
cancel,49494949,"AU,AA",

Hmm I try to fix them but I do not know how CSV file work, but my small knoll age will suggest you to run this code before to convert the file.
txt = {"transaction_type": "renewal",
"policynum": 3848848,
"reason": "Impressed with \n the Service"}
newTxt = {}
for i,j in txt.items():
# local var (temporar)
lastX = ""
correctJ = ""
# check if in J is ascii white space "\n" and get it out
if "\n" in f"b'{j}'":
j = j.replace("\n", "")
# for grammar purpose check if
# J have at least one space
if " " in str(j):
# if yes check it closer (one by one)
for x in ([j[y:y+1] for y in range(0, len(j), 1)]):
# if 2 spaces are consecutive pass the last one
if x == " " and lastX == " ":
pass
# if not update correctJ with new values
else:
correctJ += x
# remember what was the last value checked
lastX = x
# at the end make J to be the correctJ (just in case J has not grammar errors)
j = correctJ
# add the corrections to a new dictionary
newTxt[i]=j
# show the resoult
print(f"txt = {txt}\nnewTxt = {newTxt}")
Termina:
txt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with \n the Service'}
newTxt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with the Service'}
Process finished with exit code 0

Related

Why are extra blank lines generated while writing to file?

I want to find specific lines in a file, add a string to end of that line, and then update the file, but the updated file has extra blank lines between the lines.
def Reading_Logging(PacketName, PacketTot, PacketNum):
try:
with open("C:\\Users\\Shakib\\Desktop\\test.txt", "r+") as f:
content = f.read().splitlines()
#print(names_list)
for i, l in enumerate(content):
tnow = datetime.datetime.now()
linesplit = l.split(',')
if linesplit[0] == PacketName and linesplit[1] == PacketTot and linesplit[2] == PacketNum:
content[i] = content[i].replace(content[i], content[i] + ',' + str(tnow))
with open("C:\\Users\\Shakib\\Desktop\\newtest.txt", "w") as f:
f.write('\n'.join(content))
I expect the following output without blank lines, but this is my real output:
ZoYt,97,0,3.394531,2019-07-27 14:40:27.671415,2019-07-27 19:22:48.824541
ZoYt,97,1,3.000977,2019-07-27 14:40:27.701415
ZoYt,97,2,1.879883,2019-07-27 14:40:27.731415
ZoYt,97,3,3.681641,2019-07-27 14:40:27.753415
ZoYt,97,4,1.069336,2019-07-27 14:40:27.760416
ZoYt,97,5,1.094727,2019-07-27 14:40:27.773417
ZoYt,97,6,3.077148,2019-07-27 14:40:27.787417
ZoYt,97,7,1.015625,2019-07-27 14:40:27.798418
ZoYt,97,8,3.765625,2019-07-27 14:40:27.813419
ZoYt,97,9,2.797852,2019-07-27 14:40:27.823419
ZoYt,97,10,3.860352,2019-07-27 14:40:27.837420
ZoYt,97,11,3.179688,2019-07-27 14:40:27.849421

I need to clean seismological events from a text file

The question here is related to the same type of file I asked another question about, almost one month ago (I need to split a seismological file so that I have multiple subfiles).
My goal now is to delete events which in their first line contain the string 'RSN 3'. So far I have tried editing the aforementioned question's best answer code like this:
with open(refcatname) as fileContent:
for l in fileContent:
check_rsn_3 = l[45:51]
if check_rsn_3 == "RSN 3":
line = l[:-1]
check_event = line[1:15]
print(check_event, check_rsn_3)
if not check_rsn_3 == "RSN 3":
# Strip white spaces to make sure is an empty line
if not l.strip():
subFile.write(
eventInfo + "\n"
) # Add event to the subfile
eventInfo = "" # Reinit event info
eventCounter += 1
if eventCounter == 700:
subFile.close()
fileId += 1
subFile = open(
os.path.join(
catdir,
"Paquete_Continental_"
+ str(fileId)
+ ".out",
),
"w+",
)
eventCounter = 0
else:
eventInfo += l
subFile.close()
Expected results: Event info of earthquakes with 'RSN N' (where N≠3)
Actual results: First line of events with 'RSN 3' is deleted but not the remaining event info.
Thanks in advance for your help :)
I'd advise against checking if the string is at an exact location (e.g. l[45:51]) since a single character can mess that up, you can instead check if the entire string contains "RSN 3" with if "RSN 3" in l
With the line = l[:-1] you only get the last character of the line, so the line[1:15] won't work since it's not an array.
But if you need to delete several lines, you could just check if the current line contains "RSN 3", and then read line after line until one contains "RSN " while skipping the ones in between.
skip = False
for line in fileContent:
if "RSN 3" in line:
skip = True
continue
if "RSN " in line and "RSN 3" not in line:
skip = False
# rest of the logic
if skip:
continue
This way you don't even parse the blocks whose first line contains "RSN 3".

How to swap string and save

I am having problems saving the file i modifyed basicly i need to replace in original file string called DTC_5814_removing and switch_data and save it as a seperate file how would i do that, so here is what program basicly does, it opens eeprom file, then searches for a string between two strings and groups it, then counts the data and by that given data searches for other string that is between two strings and modyfies data,basicly the code works i have a question how is the best way to save that as a seperate file, filesave function currently has no functin
here is the code:
import re
#checking the structures counting
file = open ("eeprom", "rb") .read().hex()
filesave = open("eepromMOD", "wb")
DTC_data = re.search("ffff30(.*)100077", file)
DTC_data_final = print (DTC_data.group(1))
#finds string between two strings in 2nd line of eeprom file
switch_data = re.search("010607(.*)313132", file)
switch_data_final = print (switch_data.group(1))
#finds string betwenn two strings in 3rd line of eeprom file
DTC_data_lenght = (len(DTC_data.group(1)))
#lenght of the whole DTC_data group
DTC_312D = re.search("ffff30(.*)312d", file)
DTC_3036 = re.search("ffff30(.*)3036", file)
DTC_5814 = re.search("ffff30(.*)5814", file)
#searching for DTC 312D
DTC_312D_lenght = (len(DTC_312D.group(1))+4)
DTC_312D_lenght_start =(len(DTC_312D.group(1)))
DTC_3036_lenght = (len(DTC_3036.group(1))+4)
DTC_3036_lenght_start =(len(DTC_3036.group(1)))
DTC_5814_lenght = (len(DTC_5814.group(1))+4)
DTC_5814_lenght_start =(len(DTC_5814.group(1)))
#confirming the lenght of the DTC table
if DTC_312D_lenght <= DTC_data_lenght and DTC_312D_lenght%4==0 :
#If dtc lenght shorter than whole table and devidable by 4
print("Starting DTC removal")
#Printing for good count
switch_data_lenght = (len(switch_data.group(1)))
#Counting switch data table
DTC_312D_removing = switch_data.group(1)[:DTC_312D_lenght_start] + "0000" + switch_data.group(1)[DTC_312D_lenght:]
#Read from data group (data[:define start] + "mod to wish value" + data[define end]
print(DTC_312D_removing)
else:
print("DTC non existant or incorrect")
if DTC_3036_lenght <= DTC_data_lenght and DTC_3036_lenght%4==0 :
#If dtc lenght shorter than whole table and devidable by 4
print("Starting DTC removal")
#Printing for good count
switch_data_lenght = (len(switch_data.group(1)))
#Counting switch data table
DTC_3036_removing = DTC_312D_removing[:DTC_3036_lenght_start] + "0000" + switch_data.group(1)[DTC_3036_lenght:]
#Read from data group (data[:define start] + "mod to wish value" + data[define end]
print(DTC_3036_removing)
else:
print("DTC non existant or incorrect")
if DTC_5814_lenght <= DTC_data_lenght and DTC_5814_lenght%4==0 :
#If dtc lenght shorter than whole table and devidable by 4
print("Starting DTC removal")
#Printing for good count
switch_data_lenght = (len(switch_data.group(1)))
#Counting switch data table
DTC_5814_removing = DTC_3036_removing[:DTC_5814_lenght_start] + "0000" + switch_data.group(1)[DTC_5814_lenght:]
#Read from data group (data[:define start] + "mod to wish value" + data[define end]
print(DTC_5814_removing)
else:
print("DTC non existant or incorrect")
Solved with
File_W = file.replace(switch_data.group(1), DTC_5814_removing)
File_WH = binascii.unhexlify(File_W)
filesave.write(File_WH)
filesave.close()

Unknown column added in user input form

I have a simple data entry form that writes the inputs to a csv file. Everything seems to be working ok, except that there are extra columns being added to the file in the process somewhere, seems to be during the user input phase. Here is the code:
import pandas as pd
#adds all spreadsheets into one list
Batteries= ["MAT0001.csv","MAT0002.csv", "MAT0003.csv", "MAT0004.csv",
"MAT0005.csv", "MAT0006.csv", "MAT0007.csv", "MAT0008.csv"]
#User selects battery to log
choice = (int(input("Which battery? (1-8):")))
def choosebattery(c):
done = False
while not done:
if(c in range(1,9)):
return Batteries[c]
done = True
else:
print('Sorry, selection must be between 1-8')
cfile = choosebattery(choice)
cbat = pd.read_csv(cfile)
#Collect Cycle input
print ("Enter Current Cycle")
response = None
while response not in {"Y", "N", "y", "n"}:
response = input("Please enter Y or N: ")
cy = response
#Charger input
print ("Enter Current Charger")
response = None
while response not in {"SC-G", "QS", "Bosca", "off", "other"}:
response = input("Please enter one: 'SC-G', 'QS', 'Bosca', 'off', 'other'")
if response == "other":
explain = input("Please explain")
ch = response + ":" + explain
else:
ch = response
#Location
print ("Enter Current Location")
response = None
while response not in {"Rack 1", "Rack 2", "Rack 3", "Rack 4", "EV001", "EV002", "EV003", "EV004", "Floor", "other"}:
response = input("Please enter one: 'Rack 1 - 4', 'EV001 - 004', 'Floor' or 'other'")
if response == "other":
explain = input("Please explain")
lo = response + ":" + explain
else:
lo = response
#Voltage
done = False
while not done:
choice = (float(input("Enter Current Voltage:")))
modchoice = choice * 10
if(modchoice in range(500,700)):
vo = choice
done = True
else:
print('Sorry, selection must be between 50 and 70')
#add inputs to current battery dataframe
log = pd.DataFrame([[cy,ch,lo,vo]],columns=["Cycle", "Charger", "Location", "Voltage"])
clog = pd.concat([cbat,log], axis=0)
clog.to_csv(cfile, index = False)
pd.read_csv(cfile)
And I receive:
Out[18]:
Charger Cycle Location Unnamed: 0 Voltage
0 off n Floor NaN 50.0
Where is the "Unnamed" column coming from?
There's an 'unnamed' column coming from your csv. The reason most likely is that the lines in your input csv files end with a comma (i.e. your separator), so pandas interprets that as an additional (nameless) column. If that's the case, check whether your lines end with your separator. For example, if your files are separated by commas:
Column1,Column2,Column3,
val_11, val12, val12,
...
Into:
Column1,Column2,Column3
val_11, val12, val12
...
Alternatively, try specifying the index column explicitly as in this answer. I believe some of the confusion stems from pandas concat reordering your columns .

How to convert cmudict-0.7b or cmudict-0.7b.dict in to FST format to use it with phonetisaurus?

I am looking for a simple procedure to generate FST (finite state transducer) from cmudict-0.7b or cmudict-0.7b.dict, which will be used with phonetisaurus.
I tried following set of commands (phonetisaurus Aligner, Google NGramLibrary and phonetisaurus arpa2wfst) and able to generate FST but it didn't work. I am not sure where I did a mistake or miss any step. I guess very first command ie phonetisaurus-align, is not correct.
phonetisaurus-align --input=cmudict.dict --ofile=cmudict/cmudict.corpus --seq1_del=false
ngramsymbols < cmudict/cmudict.corpus > cmudict/cmudict.syms
/usr/local/bin/farcompilestrings --symbols=cmudict/cmudict.syms --keep_symbols=1 cmudict/cmudict.corpus > cmudict/cmudict.far
ngramcount --order=8 cmudict/cmudict.far > cmudict/cmudict.cnts
ngrammake --v=2 --bins=3 --method=kneser_ney cmudict/cmudict.cnts > cmudict/cmudict.mod
ngramprint --ARPA cmudict/cmudict.mod > cmudict/cmudict.arpa
phonetisaurus-arpa2wfst-omega --lm=cmudict/cmudict.arpa > cmudict/cmudict.fst
I tried fst with phonetisaurus-g2p as follows:
phonetisaurus-g2p --model=cmudict/cmudict.fst --nbest=3 --input=HELLO --words
But it didn't return anything....
Appreciate any help on this matter.
It is very important to keep dictionary in the right format. Phonetisaurus is very sensitive about that, it requires word and phonemes to be tab separated, spaces would not work then. It also does not allow pronunciation variant numbers CMUSphinx uses like (2) or (3). You need to cleanup dictionary with simple python script for example before feeding it into phonetisaurus. Here is the one I use:
#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Split the list on train and test sets"
print
print "Usage: traintest.py file split_count"
exit()
infile = open(sys.argv[1], "r")
outtrain = open(sys.argv[1] + ".train", "w")
outtest = open(sys.argv[1] + ".test", "w")
cnt = 0
split_count = int(sys.argv[2])
for line in infile:
items = line.split()
if items[0][-1] == ')':
items[0] = items[0][:-3]
if items[0].find("_") > 0:
continue
line = items[0] + '\t' + " ".join(items[1:]) + '\n'
if cnt % split_count == 3:
outtest.write(line)
else:
outtrain.write(line)
cnt = cnt + 1

Resources