Python: How to sum a column in a CSV file while skipping the header row - python-3.x

Trying to sum a column in a csv file that has a header row at the top. I'm trying to use this for loop but it's just return zero. Any thoughts?
CSVFile = open('Data103.csv')
CSVReader = csv.reader(CSVFile) #you don't pass a file name directly to csv.reader
CSVDataList = list(CSVReader) #stores the csv file as a list of lists
print(CSVDataList[0][16])
total = 0
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
print (total)
Here is what the data sample looks like in txt:
Value,Value,Value, "15,500.05", 00.00, 00.00
So the items are deliminted by , except in the case where they need an escape then it's "". It's a pretty standard file with a header row and about 1k lines of data across 18 columns.

You might want to use Pandas.
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
column_sum = df['column_name'].sum()

It seems that you've over-indented the line that does the sum. It should be like this:
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
Otherwise you'll only sum the values for the first row, which is exactly the one you want to skip.
EDIT:
Since you said the previous change doesn't work, I'd suggest working with the excellent Python lib called rows.
With the following CSV (fruits.csv):
id,name,amount
1,apple,3
2,banana,6
3,pineapple,2
4,lemon,5
You can access columns directly by their name instead of their index:
import rows
data = rows.import_from_csv('fruits.csv')
for fruit_data in data:
print(fruit_data.name, fruit_data.amount)
# output:
# apple 3
# banana 6
# pineapple 2
# lemon 5
NEW EDIT:
After you've provided the data, I believe in your case you could do something like:
import rows
data = rows.import_from_csv('Data103.csv')
print(data.field_names[16]) # prints the field name
total = 0
for row in data:
value = row.<column_name>
value = value.replace(',', '') # remove commas
total += float(value)
print (total)

Related

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

How to create subrows of a row in Python

I want to insert data into a dataframe like the image below with only CSV module in Python.
Is there any way to split rows this way?
You should think in terms of what is a csv file rather than the python csv module.
CSV files are nothing more than text representations of flat tables, therefore your sub-categories and sub-totals require separate rows.
If you want to create an object with a list of <sub-category, sub-total> pairs you have to parse the rows accordingly.
First you read a category and a total frequency and create the new category object, then until category stays the same you can add <sub-category, sub-total> pairs to its sub-categories list.
With the assumptions that category is unique and that there is an header row, you could try something like this:
import csv
with open('cats.csv', mode='r') as csv_file:
fieldnames = ['category', 'total', 'sub-category', 'sub-total']
csv_reader = csv.DictReader(csv_file, fieldnames=fieldnames)
lastCat = ""
nextCat = ""
row = next(csv_reader) # I'm skipping the first line
row = next(csv_reader, '')
while True:
if row == '':
break
nextCat = row['category']
lastCat = nextCat
newCategory = Category.fromCSV(row) # This is just an example
while nextCat == lastCat:
newCategory.addData(row)
row = next(csv_reader, '')
if row == '':
break
nextCat = row['category']
I didn't test my code, so I don't recommend you to use as something more than a suggestion

CSV manipulation problem. A little complex and would like the solution to not be using pandas

CSV file:
Acct,phn_1,phn_2,phn_3,Name,Consent,zipcode
1234,45678,78906,,abc,NCN,10010
3456,48678,,78976,def,NNC,10010
Problem:
Based on consent value which is for each of the phones (in 1st row: 1st N is phn_1, C for phn_2 and so on) I need to retain only that phn column and move the remaining columns to the end of the file.
The below is what I have. My approach isn't that great is what I feel. I'm trying to get the id of the individual Ns and Cs, get the id and map it with the phone (but I'm unable to iterate through the phn headers and compare the id's of the Ns and Cs)
with open('file.csv', 'rU') as infile:
reader = csv.DictReader(infile) data = {} for row in reader:
for header, value in row.items():
data.setdefault(header, list()).append(value) # print(data)
Consent = data['Consent']
for i in range(len(Consent)):
# print(list(Consent[i]))
for idx, val in enumerate(list(Consent[i])):
# print(idx, val)
if val == 'C':
#print("C")
print(idx)
else:
print("N")
Could someone provide me with the solution for this?
Please Note: Do not want the solution to be by using pandas.
You’ll find my answer in the comments of the code below.
import csv
def parse_csv(file_name):
""" """
# Prepare the output. Note that all rows of a CSV file must have the same structure.
# So it is actually not possible to put the phone numbers with no consent at the end
# of the file, but what you can do is to put them at the end of the row.
# To ensure that the structure is the same on all rows, you need to put all phone numbers
# at the end of the row. That means the phone number with consent is duplicated, and that
# is not very efficient.
# I chose to put the result in a string, but you can use other types.
output = "Acct,phn,Name,Consent,zipcode,phn_1,phn_2,phn_3\n"
with open(file_name, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Search the letter “C” in “Consent” and get the position of the first match.
# Add one to the result because the “phn_×” keys are 1-based and not 0-based.
first_c_pos = row["Consent"].find("C") + 1
# If there is no “C”, then the “phn” key is empty.
if first_c_pos == 0:
row["phn"] = ""
# If there is at least one “C”, create a key string that will take the values
# phn_1, phn_2 or phn_3.
else:
key = f"phn_{first_c_pos}"
row["phn"] = row[key]
# Add the current row to the result string.
output += ",".join([
row["Acct"], row["phn"], row["Name"], row["Consent"],
row["zipcode"], row["phn_1"], row["phn_2"], row["phn_3"]
])
output += "\n"
# Return the string.
return(output)
if __name__ == "__main__":
output = parse_csv("file.csv")
print(output)

Sorting csv data by a column of numerical values

I have a csv file that has 10 columns and about 7,000 rows. I have to sort the data based on the 4th column (0, 1, 2, 3), which I know is column #3 with 0 based counting. The column has a header and the data in this column is numeric values. The largest value in this column is: 7548375288960, so that row should be at the top of my results.
My code is below. Something interesting is happening. If I change the reverse=True to reverse=False then the 15 rows printed to the screen are the correct ones based on me manually sorting the csv file in Excel. But when I set reverse=True they are not the correct ones. Below are the first 4 that my print statement puts out:
999016759.26
9989694.0
99841428.0
998313048.0
Here is my code:
def display():
theFile = open("MyData.csv", "r")
mycsv = csv.reader(theFile)
sort = sorted(mycsv, key=operator.itemgetter(3), reverse=True)
for row in islice(sort, 15):
print(row)
Appreciate any help!
OK, I got this solved. A couple of things:
The data in the column, while only containing numerical values, was in string format. To overcome this I did the following in the function that was generating the csv file.
concatDf["ColumnName"] = concatDf["ColumnName"].astype(float)
This converted all of the strings to floats. Then in my display function I changed the sort line of code to the following:
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
Then I got a different error that I realized was trying to covert the header from string to float and this was impossible. To overcome that I added the following line:
next(theFile, None)
This is what the function looks like now:
def display():
theFile = open("MyData.csv", "r")
next(theFile, None)
reader = csv.reader(theFile, delimiter = ",", quotechar='"')
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
for row in islice(sort, 15):
print(row)

I read a line on a csv file and want to know the item number of a word

The header line in my csv file is:
Number,Name,Type,Manufacturer,Material,Process,Thickness (mil),Weight (oz),Dk,Orientation,Pullback distance (mil),Description
I can open it and read the line, with no problems:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
for row in csv_reader:
But I want to find out what the item number is for the "Dk".
The problem is that not only can the items be in any order as decided by the user in a different application. There can also be up to 25 items in the line.
How do I quickly determine what item is "Dk" so I can write Dk = (row[i]) for it and extract it for all the data after the header.
I have tried this below on each of the potential 25 items and it works, but it seems like a waste of time, energy and my ocd.
while True:
try:
if (row[0]) == "Dk":
DkColumn = 0
break
elif (row[1]) == "Dk":
DkColumn = 1
break
...
elif (row[24]) == "Dk":
DkColumn = 24
break
else:
f.write('Stackup needs a "Dk" column.')
break
except:
print ("Exception occurred")
break
Can't you get the index of the column (using list.index()) that has the value Dk in it? Something like:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
# Store the header
headers = next(csv_reader, None)
# Get the index of the 'Dk' column
dkColumnIndex = header.index('Dk')
for row in csv_reader:
# Access values that belong to the 'Dk' column
rowDkValue = row[dkColumnIndex]
print(rowDkValue)
In the code above, we store the first line of the CSV in as a list in headers. We then search the list to find the index of the item that has the value of 'Dk'. That will be the column index.
Once we have that column index, we can then use it in each row to access the particular index, which will correspond to the column which Dk is the header of.
Use pandas library to save your order and have access to each column by typing:
row["column_name"]
import pandas as pd
dataframe = pd.read_csv(
"",
cols=["Number","Name","Type" ....])
for index, row in df.iterrows():
# do something
If I understand your question correctly, and you're not interested in using pandas (as suggested by Mikey - you sohuld really consider his suggestion, however), you should be able to do something like the following:
with open('CS_Data/_AD_LayersTest.csv','r') as infile:
csv_reader = csv.reader(infile, delimiter=',')
header = next(csv_reader)
col_map = {col_name: idx for idx, col_name in enumerate(header)}
for row in csv_reader:
row_dk = row[col_map['Dk']]
One solution would be to use pandas.
import pandas as pd
df=pd.read_csv('CS_Data/_AD_LayersTest.csv')
Now you can access 'Dk' easily as long as the file is read correctly.
dk=df['Dk']
and you can access individual values of dk like
for i in range(0,10):
temp_var=df.loc('Dk',i)
or however you want to access those indexes.

Resources