Compare 2 CSV files (encoded = "utf8") keeping data format - python-3.x

I have 2 stock lists (New and Old). How can I compare it to see what items have been added and what had been removed (happy to add them to 2 different files added and removed)?
so far I have tired along the lines of looking row by row.
import csv
new = "new.csv"
old = "old.csv"
add_file = "add.csv"
remove_file = "remove.csv"
with open(new,encoding="utf8") as new_read, open(old,encoding="utf8") as old_read:
new_reader = csv.DictReader(new_read)
old_reader = csv.DictReader(old_read)
for new_row in new_reader :
for old_row in old_reader:
if old_row["STOCK CODE"] == new_row["STOCK CODE"]:
print("found")
This works for 1 item. if I add an *else: * it just keeps printing that until its found. So it's not an accurate way of comparing the files.
I have 5k worth of rows.
There must be a better way to add the differences to the 2 different files and keep the same data structure at the same time ?
N.B i have tired this link Python : Compare two csv files and print out differences
2 minor issues:
1. the data structure is not kept
2. there is not reference to the change of location

You could just read the data into memory and then compare.
I used sets for the codes in this example for faster lookup.
import csv
def get_csv_data(file_name):
data = []
codes = set()
with open(file_name, encoding="utf8") as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
data.append(row)
codes.add(row['STOCK CODE'])
return data, codes
def write_csv(file_name, data, codes):
with open(file_name, 'w', encoding="utf8", newline='') as csv_file:
headers = list(data[0].keys())
writer = csv.DictWriter(csv_file, fieldnames=headers)
writer.writeheader()
for row in data:
if row['STOCK CODE'] not in codes:
writer.writerow(row)
new_data, new_codes = get_csv_data('new.csv')
old_data, old_codes = get_csv_data('old.csv')
write_csv('add.csv', new_data, old_codes)
write_csv('remove.csv', old_data, new_codes)

Related

How to create subrows of a row in Python

I want to insert data into a dataframe like the image below with only CSV module in Python.
Is there any way to split rows this way?
You should think in terms of what is a csv file rather than the python csv module.
CSV files are nothing more than text representations of flat tables, therefore your sub-categories and sub-totals require separate rows.
If you want to create an object with a list of <sub-category, sub-total> pairs you have to parse the rows accordingly.
First you read a category and a total frequency and create the new category object, then until category stays the same you can add <sub-category, sub-total> pairs to its sub-categories list.
With the assumptions that category is unique and that there is an header row, you could try something like this:
import csv
with open('cats.csv', mode='r') as csv_file:
fieldnames = ['category', 'total', 'sub-category', 'sub-total']
csv_reader = csv.DictReader(csv_file, fieldnames=fieldnames)
lastCat = ""
nextCat = ""
row = next(csv_reader) # I'm skipping the first line
row = next(csv_reader, '')
while True:
if row == '':
break
nextCat = row['category']
lastCat = nextCat
newCategory = Category.fromCSV(row) # This is just an example
while nextCat == lastCat:
newCategory.addData(row)
row = next(csv_reader, '')
if row == '':
break
nextCat = row['category']
I didn't test my code, so I don't recommend you to use as something more than a suggestion

Write into CSV only if row does not exist

I am saving an array into a csv file, using this code:
def save_annotations(**kwargs):
ann = request.get_json()
print(ann)
filename = ann[3].split('.')[0]
run_id = ann[4]
run_number = ann[4].split('/')[0]
exp_id = ann[4].split('/')[1]
ann_type = ann[2]
if ann_type == 'wrongDetection':
with open(f"/code/data/mlruns/{run_number}/{exp_id}/wrong_annotations_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
else:
with open(f"/code/data/mlruns/{run_number}/{exp_id}/new_detections_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
However, I don't want repeated rows in my csv file. I only want to write to csv if ann[0] and ann[1] are not in the csv already.
What would be the best approach to do this?
kind regards
One way to do this would be to collect the already existing values in a set, and check new values to see if they are in the set before processing. You would need a set for each csv file.
For example:
def build_set(filename):
with open(filename, 'r') as f:
reader = csv.reader(f)
# Skip header row, if necessary
next(reader)
return {tuple(row[0:2]) for row in reader}
Then in your function you could do:
if tuple(ann[0:2]) in set_for_this_file:
continue
set_for_this_file.add(tuple(ann[0:2]))
# write data to file
Building the sets would require reading through all the csv files each time the program is executed, which might be inefficient if the files were large and/or numerous.
A more efficient approach might be to store the data in a database table, with columns for ann[0], ann[1], anntype, exp_id, run_mumber and run_id. Add a unique constraint over these columns and you would have the same functionality.

I read a line on a csv file and want to know the item number of a word

The header line in my csv file is:
Number,Name,Type,Manufacturer,Material,Process,Thickness (mil),Weight (oz),Dk,Orientation,Pullback distance (mil),Description
I can open it and read the line, with no problems:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
for row in csv_reader:
But I want to find out what the item number is for the "Dk".
The problem is that not only can the items be in any order as decided by the user in a different application. There can also be up to 25 items in the line.
How do I quickly determine what item is "Dk" so I can write Dk = (row[i]) for it and extract it for all the data after the header.
I have tried this below on each of the potential 25 items and it works, but it seems like a waste of time, energy and my ocd.
while True:
try:
if (row[0]) == "Dk":
DkColumn = 0
break
elif (row[1]) == "Dk":
DkColumn = 1
break
...
elif (row[24]) == "Dk":
DkColumn = 24
break
else:
f.write('Stackup needs a "Dk" column.')
break
except:
print ("Exception occurred")
break
Can't you get the index of the column (using list.index()) that has the value Dk in it? Something like:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
# Store the header
headers = next(csv_reader, None)
# Get the index of the 'Dk' column
dkColumnIndex = header.index('Dk')
for row in csv_reader:
# Access values that belong to the 'Dk' column
rowDkValue = row[dkColumnIndex]
print(rowDkValue)
In the code above, we store the first line of the CSV in as a list in headers. We then search the list to find the index of the item that has the value of 'Dk'. That will be the column index.
Once we have that column index, we can then use it in each row to access the particular index, which will correspond to the column which Dk is the header of.
Use pandas library to save your order and have access to each column by typing:
row["column_name"]
import pandas as pd
dataframe = pd.read_csv(
"",
cols=["Number","Name","Type" ....])
for index, row in df.iterrows():
# do something
If I understand your question correctly, and you're not interested in using pandas (as suggested by Mikey - you sohuld really consider his suggestion, however), you should be able to do something like the following:
with open('CS_Data/_AD_LayersTest.csv','r') as infile:
csv_reader = csv.reader(infile, delimiter=',')
header = next(csv_reader)
col_map = {col_name: idx for idx, col_name in enumerate(header)}
for row in csv_reader:
row_dk = row[col_map['Dk']]
One solution would be to use pandas.
import pandas as pd
df=pd.read_csv('CS_Data/_AD_LayersTest.csv')
Now you can access 'Dk' easily as long as the file is read correctly.
dk=df['Dk']
and you can access individual values of dk like
for i in range(0,10):
temp_var=df.loc('Dk',i)
or however you want to access those indexes.

Python CSV not writing data to file

I am running into a wall with this. I am new to writing CSV files with python and have been reading lots of different posts on the topic, but now I ran into a wall with this and could use a little help.
import csv
#headers from the read.csv file that I wan't to parse and write to the new file.
headers = ['header1', 'header5', 'header6', 'header7']
#open the write.csv file to write the data to
with open("write.csv", 'wb') as csvWriter:
writer = csv.writer(csvWriter)
#open the main data file that I want to parse data out of and write to write.csv
with open('reading.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',' )
csvList = list(readCSV)
#finds where the position of the data I want to pull out and write to write.csv
itemCode = csvList[0].index(headers[0])
vendorName = csvList[0].index(headers[1])
supplierID = csvList[0].index(headers[2])
supplierItemCode = csvList[0].index(headers[3])
for row in readCSV:
writer.writerow([row[itemCode], row[vendorName], row[supplierID], row[supplierItemCode]])
csvWriter.close()
---UPDATE---
I made the changes suggested and tried commenting out the following part of the code & changing 'wb' to 'w' and the program worked. However, I don't understand why, and how do I set this up so that I can list the header I want to pull out?
csvList = list(readCSV)
itemCode = csvList[0].index(headers[0])
vendorName = csvList[0].index(headers[1])
supplierID = csvList[0].index(headers[2])
supplierItemCode = csvList[0].index(headers[3])
Here is my updated code:
headers = ['header1', 'header5', 'header6', 'header7']
#open the write.csv file to write the data to
with open("write.csv", 'wb') as csvWriter, open('reading.csv') as csvfile:
writer = csv.writer(csvWriter)
readCSV = csv.reader(csvfile, delimiter=',' )
"""csvList = list(readCSV)
#finds where the position of the data I want to pull out and write to write.csv
itemCode = csvList[0].index(headers[0])
vendorName = csvList[0].index(headers[1])
supplierID = csvList[0].index(headers[2])
supplierItemCode = csvList[0].index(headers[3])"""
for row in readCSV:
writer.writerow([row[0], row[27], row[28], row[29]])
It looks like you want to write a subset of columns to a new file. This problem is simpler with DictReader/DictWriter. Note the correct use of open when using Python 3.x. Your attempt was using the Python 2.x way.
import csv
# headers you want in the order you want
headers = ['header1','header5','header6','header7']
with open('write.csv','w',newline='') as csvWriter,open('read.csv',newline='') as csvfile:
writer = csv.DictWriter(csvWriter,fieldnames=headers,extrasaction='ignore')
readCSV = csv.DictReader(csvfile)
writer.writeheader()
for row in readCSV:
writer.writerow(row)
Test data:
header1,header2,header3,header4,header5,header6,header7
1,2,3,4,5,6,7
11,22,33,44,55,66,77
Output:
header1,header5,header6,header7
1,5,6,7
11,55,66,77
if you want to access both writer under the same block,you should do something like this
with open("write.csv", 'wb') as csvWriter,open('reading.csv') as csvfile:
writer = csv.writer(csvWriter)
readCSV = csv.reader(csvfile, delimiter=',' )
csvList = list(readCSV)
#finds where the position of the data I want to pull out and write to write.csv
itemCode = csvList[0].index(headers[0])
vendorName = csvList[0].index(headers[1])
supplierID = csvList[0].index(headers[2])
supplierItemCode = csvList[0].index(headers[3])
for row in readCSV:
writer.writerow([row[itemCode], row[vendorName], row[supplierID], row[supplierItemCode]])
csvWriter.close()
The with open() as csvWriter: construct handles closing of the supplied file once you exit the block. So once you get down to writer.writerow, the file is already closed.
You need to enclose the entire expression in the with open block.
with open("write.csv", 'wb') as csvWriter:
....
#Do all writing within this block
....

Can't store the scraped results in third and fourth column in a csv file

I've written a script which is scraping Address and Phone number of certain shops based on Name and Lid. The way it is searching is that It takes Name and Lid stored in column A and Column B respectively from a csv file. However, after fetching the result based on the search, I expected the parser to put that results in column C and column D respectively as it is shown in the second Image. At this point, I got stuck. I don't know how to manipulate Third and Fourth column using reading or writing method so that the data should be placed there. I'm trying with this now:
import csv
import requests
from lxml import html
Names, Lids = [], []
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
How my csv file looks like now:
My desired output is something like:
You can do it like this. Create a fresh output csv file whose header is based on the input csv, with the addition of the two columns. When you read a csv row it's available as a dictionary, in this case called entry. You can add the new values to this dictionary from the stuff you've gleaned on the 'net. Then write each newly created row out to file.
import csv
import requests
from lxml import html
with open("mytu.csv", "r") as f, open('new_mytu.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
#~ for title in titles:
title = titles[0]
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)

Resources