I need a faster way with logging function to parse this special type of data in CSV file - python-3.x

I have a file with below data format
<aqr>a=769 b="United States" c=02/04/2019 d=01:03:23
<aqr>a=798 b="India" c=02/04/2019 d=01:03:23 e="Non existent"
So basically all the lines have multiple columns but the columns are not fixed and no header is there. So need to create column header from the data itself. Like for the example above a,b,c,d and e will be column header.
I have made a code which does the job but I am looking for a more faster way and with logging facility.
By far my logic is to remove unwanted data at the beginning, then get the data in a dictionary and turn it into a dataframe.
result = defaultdict(list)
with open('testfiles/test.csv', 'r') as file:
pardic = { }
new_list = []
final_list = []
for line in file.read().splitlines():
rule0 = line.strip("<aqr>")
rule0 = '~'.join(shlex.split(rule0))
y = rule0.split('~')
for word in y:
x = word.split('=')
result[x[0]].append(x[1])
data = pd.DataFrame.from_dict(result, orient='index')
data = data.T
The result is fine. I just need a faster solution to this.

Related

How to create subrows of a row in Python

I want to insert data into a dataframe like the image below with only CSV module in Python.
Is there any way to split rows this way?
You should think in terms of what is a csv file rather than the python csv module.
CSV files are nothing more than text representations of flat tables, therefore your sub-categories and sub-totals require separate rows.
If you want to create an object with a list of <sub-category, sub-total> pairs you have to parse the rows accordingly.
First you read a category and a total frequency and create the new category object, then until category stays the same you can add <sub-category, sub-total> pairs to its sub-categories list.
With the assumptions that category is unique and that there is an header row, you could try something like this:
import csv
with open('cats.csv', mode='r') as csv_file:
fieldnames = ['category', 'total', 'sub-category', 'sub-total']
csv_reader = csv.DictReader(csv_file, fieldnames=fieldnames)
lastCat = ""
nextCat = ""
row = next(csv_reader) # I'm skipping the first line
row = next(csv_reader, '')
while True:
if row == '':
break
nextCat = row['category']
lastCat = nextCat
newCategory = Category.fromCSV(row) # This is just an example
while nextCat == lastCat:
newCategory.addData(row)
row = next(csv_reader, '')
if row == '':
break
nextCat = row['category']
I didn't test my code, so I don't recommend you to use as something more than a suggestion

Compare 2 CSV files (encoded = "utf8") keeping data format

I have 2 stock lists (New and Old). How can I compare it to see what items have been added and what had been removed (happy to add them to 2 different files added and removed)?
so far I have tired along the lines of looking row by row.
import csv
new = "new.csv"
old = "old.csv"
add_file = "add.csv"
remove_file = "remove.csv"
with open(new,encoding="utf8") as new_read, open(old,encoding="utf8") as old_read:
new_reader = csv.DictReader(new_read)
old_reader = csv.DictReader(old_read)
for new_row in new_reader :
for old_row in old_reader:
if old_row["STOCK CODE"] == new_row["STOCK CODE"]:
print("found")
This works for 1 item. if I add an *else: * it just keeps printing that until its found. So it's not an accurate way of comparing the files.
I have 5k worth of rows.
There must be a better way to add the differences to the 2 different files and keep the same data structure at the same time ?
N.B i have tired this link Python : Compare two csv files and print out differences
2 minor issues:
1. the data structure is not kept
2. there is not reference to the change of location
You could just read the data into memory and then compare.
I used sets for the codes in this example for faster lookup.
import csv
def get_csv_data(file_name):
data = []
codes = set()
with open(file_name, encoding="utf8") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
data.append(row)
codes.add(row['STOCK CODE'])
return data, codes
def write_csv(file_name, data, codes):
with open(file_name, 'w', encoding="utf8", newline='') as csv_file:
headers = list(data[0].keys())
writer = csv.DictWriter(csv_file, fieldnames=headers)
writer.writeheader()
for row in data:
if row['STOCK CODE'] not in codes:
writer.writerow(row)
new_data, new_codes = get_csv_data('new.csv')
old_data, old_codes = get_csv_data('old.csv')
write_csv('add.csv', new_data, old_codes)
write_csv('remove.csv', old_data, new_codes)

Write into CSV only if row does not exist

I am saving an array into a csv file, using this code:
def save_annotations(**kwargs):
ann = request.get_json()
print(ann)
filename = ann[3].split('.')[0]
run_id = ann[4]
run_number = ann[4].split('/')[0]
exp_id = ann[4].split('/')[1]
ann_type = ann[2]
if ann_type == 'wrongDetection':
with open(f"/code/data/mlruns/{run_number}/{exp_id}/wrong_annotations_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
else:
with open(f"/code/data/mlruns/{run_number}/{exp_id}/new_detections_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
However, I don't want repeated rows in my csv file. I only want to write to csv if ann[0] and ann[1] are not in the csv already.
What would be the best approach to do this?
kind regards
One way to do this would be to collect the already existing values in a set, and check new values to see if they are in the set before processing. You would need a set for each csv file.
For example:
def build_set(filename):
with open(filename, 'r') as f:
reader = csv.reader(f)
# Skip header row, if necessary
next(reader)
return {tuple(row[0:2]) for row in reader}
Then in your function you could do:
if tuple(ann[0:2]) in set_for_this_file:
continue
set_for_this_file.add(tuple(ann[0:2]))
# write data to file
Building the sets would require reading through all the csv files each time the program is executed, which might be inefficient if the files were large and/or numerous.
A more efficient approach might be to store the data in a database table, with columns for ann[0], ann[1], anntype, exp_id, run_mumber and run_id. Add a unique constraint over these columns and you would have the same functionality.

Python list() vs append()

I'm trying to create a list of lists from a csv file.
Row 1 of CSV is a line describing the data source
Row 2 of CSV is the header
Row 3 of CSV is where the data starts
There are two ways I can go about it but I don't know why they're different.
First is the using list() and for some reason the result of this ignores row 1 and row 2 of the CSV.
data = []
with open(datafile,'rb') as f:
for line in f:
data = list(csv.reader(f, delimiter = ','))
return (name, data)
Whereas if I use .append(), I'd have to use .next() to skip row 2
data = []
with open(datafile,'rb') as f:
file = csv.reader(f, delimiter = ',')
next(file)
for line in file:
data.append(line)
return (name, data)
Why does list() ignores the row with all the header whereas append() doesn't?
Actually, this is not related to python's list() or append(), it is related to the logic you have used in the first snippet.
The program is not skipping the header, it is replacing it.
For every line in the loop, you are assigning a new value to data. So it is a new list , as it overwrites everything that was there previously.
Correct code :
data = []
with open(datafile,'rb') as f:
next(f)
for line in f:
data.extend(line.split(","))
return (name, data)
This will just extend the existing list with a new list that is passed as an argument, and there is no problem with 2nd snippet.

Sort excel worksheet using python

I have an excel sheet like this:
I would like to output the data into an excel file like this:
Basically, for the common elements in column 2,3,4, I want to contacenate the values in the 5th column.
Please suggest, how could I do this ?
The easiest way to approach an issue like this is exporting the spreadsheet to CSV first, in order to help ensure that it imports correctly into Python.
Using a defaultdict, you can create a dictionary that has unique keys and then iterate through lines adding the final column's values to a list.
Finally you can write it back out to a CSV format:
from collections import defaultdict
results = defaultdict(list)
with open("in_file.csv") as f:
header = f.readline()
for line in f:
cols = line.split(",")
key = ",".join(cols[0:4])
results[key].append(cols[4])
with open("out_file.csv", "w") as f:
f.write(header)
for k, v in results.iteritems():
line = '{},"{}",\n'.format(k, ", ".join(v))
f.write(line)

Resources