Sorting csv data by a column of numerical values - python-3.x

I have a csv file that has 10 columns and about 7,000 rows. I have to sort the data based on the 4th column (0, 1, 2, 3), which I know is column #3 with 0 based counting. The column has a header and the data in this column is numeric values. The largest value in this column is: 7548375288960, so that row should be at the top of my results.
My code is below. Something interesting is happening. If I change the reverse=True to reverse=False then the 15 rows printed to the screen are the correct ones based on me manually sorting the csv file in Excel. But when I set reverse=True they are not the correct ones. Below are the first 4 that my print statement puts out:
999016759.26
9989694.0
99841428.0
998313048.0
Here is my code:
def display():
theFile = open("MyData.csv", "r")
mycsv = csv.reader(theFile)
sort = sorted(mycsv, key=operator.itemgetter(3), reverse=True)
for row in islice(sort, 15):
print(row)
Appreciate any help!

OK, I got this solved. A couple of things:
The data in the column, while only containing numerical values, was in string format. To overcome this I did the following in the function that was generating the csv file.
concatDf["ColumnName"] = concatDf["ColumnName"].astype(float)
This converted all of the strings to floats. Then in my display function I changed the sort line of code to the following:
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
Then I got a different error that I realized was trying to covert the header from string to float and this was impossible. To overcome that I added the following line:
next(theFile, None)
This is what the function looks like now:
def display():
theFile = open("MyData.csv", "r")
next(theFile, None)
reader = csv.reader(theFile, delimiter = ",", quotechar='"')
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
for row in islice(sort, 15):
print(row)

Related

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

extract position of char from each row & provide an aggregate of position across a list

I need some python help with this problem. Would appreciate any assistance !. Thanks.
I need an extracted matrix of values enclosed between square brackets. A toy example is below:
File Input will be in a txt file as below:
AB_1 Q[A]IHY[P]GVA
AB_2 Q[G][C]HY[R]GVA
AB_3 Q[G][C]HY[R]GV[D]
Answer out.txt: Script extracts index of char enclosed between sq.brackets "[]" for each row from input and makes an aggregate of index positions for the entire list. The aggregated index is then used to extract all of those positions from input file and produce a matrix as below.
Index 2,3,6,9
AB_1 [A],I,[P],A
AB_2 [G],[C],[R],A
AB_3 [G],[C],[R],[D]
Any help would be greatly appreciated !. Thanks.
If you want to reduce your table to only those columns in which an entry with square-brackets appears, you can go with this:
import re
def transpose(matrix):
return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]
with open("test_table.txt", "r") as f:
content = f.read()
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")]
columns = transpose(rows)
matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)]
matching_rows = transpose(matching_columns)
headline = ["Index {}".format(",".join(matching_rows[0]))]
target_table = headline + ["AB_{0} {1}".format((i + 1), ",".join(line)) for i, line in enumerate(matching_rows[1:])]
with open("out.txt", "w") as f:
f.write("\n".join(target_table))
First of all you want the content of your .txt file to be represented in arrays. Unfortunately your input data has no seperators yet (as in .csv files) so you need to take care of that. To get a string like this "Q[A]IHY[P]GVA" sorted out I would recommend working with regular expressions.
import re
cells = re.findall(r'(\[.\]|.)', "Q[A]IHY[P]GVA")
The pattern within the r'' string matches a single character within square brackets or just a single character. The re.findall() method returns a list of all matching substrings, in this case: ['Q', '[A]', 'I', 'H', 'Y', '[P]', 'G', 'V', 'A']
rows = [re.findall(r'(\[.\]|.)', line.split()[1]) for line in content.split("\n")] applies this method on every line in your file. The line.split()[1] will leave out the row label "AB_X " as it is not usefull.
Having your data sorted in columns is more fitting, because you want to preserve all columns that match a certain condition (contain an entry in brackets). For this you can just transpose rows. This is done by the transpose() function. If you have imported numpy numpy.transpose(rows) would be the better option I guess.
Next you want to get all columns that satisfy your condition "[" in "".join(column). All done in one line by: matching_columns = [[str(i + 1)] + column for i, column in enumerate(columns) if "[" in "".join(column)] Here [str(i + 1)] does add the column index that you want to use later.
The rest now is easy: Transpose the columns back to rows, relabel the rows, format the row data into strings that fit your desired output format and then write those strings to the out.txt file.

From CSV list to XLSX. Numbers recognise as text not as numbers

I am working with CSV datafile.
From this file I took some specific data. These data convey to a list that contains strings of words but also numbers (saved as string, sigh!).
As this:
data_of_interest = ["string1", "string2, "242", "765", "string3", ...]
I create new XLSX (should have this format) file in which this data have been pasted in.
The script does the work but on the new XLSX file, the numbers (float and int) are pasted in as text.
I could manually convert their format on excel but it would be time consuming.
Is there a way to do it automatically when writing the new XLSX file?
Here the extract of code I used:
## import library and create excel file and the working sheet
import xlsxwriter
workbook = xlsxwriter.Workbook("newfile.xlsx")
sheet = workbook.add_worksheet('Sheet 1')
## take the data from the list (data_of_interest) from csv file
## paste them inside the excel file, in rows and columns
column = 0
row = 0
for value in data_of_interest:
if type(value) is float:
sheet.write_number(row, column, value)
elif type(value) is int:
sheet.write_number(row, column, value)
else:
sheet.write(row, column, value)
column += 1
row += 1
column = 0
workbook.close()
Is the problem related with the fact that the numbers are already str type in the original list, so the code cannot recognise that they are float or int (and so it doesn't write them as numbers)?
Thank you for your help!
Try int(value) or float(value) before if block.
All data you read are strings you have to try to convert them into float or int type first.
Example:
for value in data_of_interest:
try:
value.replace(',', '.') # Note that might change commas to dots in strings which are not numbers
value = float(value)
except ValueError:
pass
if type(value) is float:
sheet.write_number(row, column, line)
else:
sheet.write(row, column, line)
column += 1
row += 1
column = 0
workbook.close()
The best way to do this with XlsxWriter is to use the strings_to_numbers constructor option:
import xlsxwriter
workbook = xlsxwriter.Workbook("newfile.xlsx", {'strings_to_numbers': True})
sheet = workbook.add_worksheet('Sheet 1')
data_of_interest = ["string1", "string2", "242", "765", "string3"]
column = 0
row = 0
for value in data_of_interest:
sheet.write(row, column, value)
column += 1
workbook.close()
Output: (note that there aren't any warnings about numbers stored as strings):

I read a line on a csv file and want to know the item number of a word

The header line in my csv file is:
Number,Name,Type,Manufacturer,Material,Process,Thickness (mil),Weight (oz),Dk,Orientation,Pullback distance (mil),Description
I can open it and read the line, with no problems:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
for row in csv_reader:
But I want to find out what the item number is for the "Dk".
The problem is that not only can the items be in any order as decided by the user in a different application. There can also be up to 25 items in the line.
How do I quickly determine what item is "Dk" so I can write Dk = (row[i]) for it and extract it for all the data after the header.
I have tried this below on each of the potential 25 items and it works, but it seems like a waste of time, energy and my ocd.
while True:
try:
if (row[0]) == "Dk":
DkColumn = 0
break
elif (row[1]) == "Dk":
DkColumn = 1
break
...
elif (row[24]) == "Dk":
DkColumn = 24
break
else:
f.write('Stackup needs a "Dk" column.')
break
except:
print ("Exception occurred")
break
Can't you get the index of the column (using list.index()) that has the value Dk in it? Something like:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
# Store the header
headers = next(csv_reader, None)
# Get the index of the 'Dk' column
dkColumnIndex = header.index('Dk')
for row in csv_reader:
# Access values that belong to the 'Dk' column
rowDkValue = row[dkColumnIndex]
print(rowDkValue)
In the code above, we store the first line of the CSV in as a list in headers. We then search the list to find the index of the item that has the value of 'Dk'. That will be the column index.
Once we have that column index, we can then use it in each row to access the particular index, which will correspond to the column which Dk is the header of.
Use pandas library to save your order and have access to each column by typing:
row["column_name"]
import pandas as pd
dataframe = pd.read_csv(
"",
cols=["Number","Name","Type" ....])
for index, row in df.iterrows():
# do something
If I understand your question correctly, and you're not interested in using pandas (as suggested by Mikey - you sohuld really consider his suggestion, however), you should be able to do something like the following:
with open('CS_Data/_AD_LayersTest.csv','r') as infile:
csv_reader = csv.reader(infile, delimiter=',')
header = next(csv_reader)
col_map = {col_name: idx for idx, col_name in enumerate(header)}
for row in csv_reader:
row_dk = row[col_map['Dk']]
One solution would be to use pandas.
import pandas as pd
df=pd.read_csv('CS_Data/_AD_LayersTest.csv')
Now you can access 'Dk' easily as long as the file is read correctly.
dk=df['Dk']
and you can access individual values of dk like
for i in range(0,10):
temp_var=df.loc('Dk',i)
or however you want to access those indexes.

Python: How to sum a column in a CSV file while skipping the header row

Trying to sum a column in a csv file that has a header row at the top. I'm trying to use this for loop but it's just return zero. Any thoughts?
CSVFile = open('Data103.csv')
CSVReader = csv.reader(CSVFile) #you don't pass a file name directly to csv.reader
CSVDataList = list(CSVReader) #stores the csv file as a list of lists
print(CSVDataList[0][16])
total = 0
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
print (total)
Here is what the data sample looks like in txt:
Value,Value,Value, "15,500.05", 00.00, 00.00
So the items are deliminted by , except in the case where they need an escape then it's "". It's a pretty standard file with a header row and about 1k lines of data across 18 columns.
You might want to use Pandas.
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
column_sum = df['column_name'].sum()
It seems that you've over-indented the line that does the sum. It should be like this:
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
Otherwise you'll only sum the values for the first row, which is exactly the one you want to skip.
EDIT:
Since you said the previous change doesn't work, I'd suggest working with the excellent Python lib called rows.
With the following CSV (fruits.csv):
id,name,amount
1,apple,3
2,banana,6
3,pineapple,2
4,lemon,5
You can access columns directly by their name instead of their index:
import rows
data = rows.import_from_csv('fruits.csv')
for fruit_data in data:
print(fruit_data.name, fruit_data.amount)
# output:
# apple 3
# banana 6
# pineapple 2
# lemon 5
NEW EDIT:
After you've provided the data, I believe in your case you could do something like:
import rows
data = rows.import_from_csv('Data103.csv')
print(data.field_names[16]) # prints the field name
total = 0
for row in data:
value = row.<column_name>
value = value.replace(',', '') # remove commas
total += float(value)
print (total)

Resources