Extract numbers only from specific lines within a txt file with certain keywords in Python - python-3.x

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|

What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

Related

Sorting csv data by a column of numerical values

I have a csv file that has 10 columns and about 7,000 rows. I have to sort the data based on the 4th column (0, 1, 2, 3), which I know is column #3 with 0 based counting. The column has a header and the data in this column is numeric values. The largest value in this column is: 7548375288960, so that row should be at the top of my results.
My code is below. Something interesting is happening. If I change the reverse=True to reverse=False then the 15 rows printed to the screen are the correct ones based on me manually sorting the csv file in Excel. But when I set reverse=True they are not the correct ones. Below are the first 4 that my print statement puts out:
999016759.26
9989694.0
99841428.0
998313048.0
Here is my code:
def display():
theFile = open("MyData.csv", "r")
mycsv = csv.reader(theFile)
sort = sorted(mycsv, key=operator.itemgetter(3), reverse=True)
for row in islice(sort, 15):
print(row)
Appreciate any help!
OK, I got this solved. A couple of things:
The data in the column, while only containing numerical values, was in string format. To overcome this I did the following in the function that was generating the csv file.
concatDf["ColumnName"] = concatDf["ColumnName"].astype(float)
This converted all of the strings to floats. Then in my display function I changed the sort line of code to the following:
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
Then I got a different error that I realized was trying to covert the header from string to float and this was impossible. To overcome that I added the following line:
next(theFile, None)
This is what the function looks like now:
def display():
theFile = open("MyData.csv", "r")
next(theFile, None)
reader = csv.reader(theFile, delimiter = ",", quotechar='"')
sort = sorted(reader, key=lambda x: int(float(x[3])), reverse=True)
for row in islice(sort, 15):
print(row)

Stop reading the CSV file after finding empty rows python

I am trying to read a CSV file that has four parts that are on the same page but distinguished by putting some empty rows in the middle of the spreadsheet. I want to somehow ask pandas to stop reading the rest of the file as soon as it finds the empty row.
Edit: I need to elaborate on the problem. I have a CSV file, that has 4 different sections that separated with 3-4 empty rows. I need to extract each of these sections or at least the first section. In other words, I want read_csv stop when it finds the first empty row(of course after skipping rows with detail about the file)
url = urlopen("https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/30_Industry_Portfolios_CSV.zip")
zipfile = ZipFile(BytesIO(url.read()))
data = pd.read_csv(zipfile.open('30_Industry_Portfolios.CSV'),
header = 0, index_col=0,
skiprows=11,parse_dates=True)
You could use a generator.
Suppose the csv module is generating rows.
(We might use yield from sheet,
except that we'll change the loop in a moment.)
import csv
def get_rows(csv_fspec, skip_rows=12):
with open(csv_fspec) as fin:
sheet = csv.reader(fin)
for _ in range(skip_rows):
next(sheet) # discard initial rows
for row in sheet:
yield row
df = pd.DataFrame(get_rows(my_csv))
Now you want to ignore rows after encountering some condition,
perhaps after initial column is empty.
Ok, that's simple enough, just change the loop body:
for row in sheet:
if row[0]:
yield row
else:
break # Ignore rest of input file.

From CSV list to XLSX. Numbers recognise as text not as numbers

I am working with CSV datafile.
From this file I took some specific data. These data convey to a list that contains strings of words but also numbers (saved as string, sigh!).
As this:
data_of_interest = ["string1", "string2, "242", "765", "string3", ...]
I create new XLSX (should have this format) file in which this data have been pasted in.
The script does the work but on the new XLSX file, the numbers (float and int) are pasted in as text.
I could manually convert their format on excel but it would be time consuming.
Is there a way to do it automatically when writing the new XLSX file?
Here the extract of code I used:
## import library and create excel file and the working sheet
import xlsxwriter
workbook = xlsxwriter.Workbook("newfile.xlsx")
sheet = workbook.add_worksheet('Sheet 1')
## take the data from the list (data_of_interest) from csv file
## paste them inside the excel file, in rows and columns
column = 0
row = 0
for value in data_of_interest:
if type(value) is float:
sheet.write_number(row, column, value)
elif type(value) is int:
sheet.write_number(row, column, value)
else:
sheet.write(row, column, value)
column += 1
row += 1
column = 0
workbook.close()
Is the problem related with the fact that the numbers are already str type in the original list, so the code cannot recognise that they are float or int (and so it doesn't write them as numbers)?
Thank you for your help!
Try int(value) or float(value) before if block.
All data you read are strings you have to try to convert them into float or int type first.
Example:
for value in data_of_interest:
try:
value.replace(',', '.') # Note that might change commas to dots in strings which are not numbers
value = float(value)
except ValueError:
pass
if type(value) is float:
sheet.write_number(row, column, line)
else:
sheet.write(row, column, line)
column += 1
row += 1
column = 0
workbook.close()
The best way to do this with XlsxWriter is to use the strings_to_numbers constructor option:
import xlsxwriter
workbook = xlsxwriter.Workbook("newfile.xlsx", {'strings_to_numbers': True})
sheet = workbook.add_worksheet('Sheet 1')
data_of_interest = ["string1", "string2", "242", "765", "string3"]
column = 0
row = 0
for value in data_of_interest:
sheet.write(row, column, value)
column += 1
workbook.close()
Output: (note that there aren't any warnings about numbers stored as strings):

Python: How to sum a column in a CSV file while skipping the header row

Trying to sum a column in a csv file that has a header row at the top. I'm trying to use this for loop but it's just return zero. Any thoughts?
CSVFile = open('Data103.csv')
CSVReader = csv.reader(CSVFile) #you don't pass a file name directly to csv.reader
CSVDataList = list(CSVReader) #stores the csv file as a list of lists
print(CSVDataList[0][16])
total = 0
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
print (total)
Here is what the data sample looks like in txt:
Value,Value,Value, "15,500.05", 00.00, 00.00
So the items are deliminted by , except in the case where they need an escape then it's "". It's a pretty standard file with a header row and about 1k lines of data across 18 columns.
You might want to use Pandas.
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
column_sum = df['column_name'].sum()
It seems that you've over-indented the line that does the sum. It should be like this:
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
Otherwise you'll only sum the values for the first row, which is exactly the one you want to skip.
EDIT:
Since you said the previous change doesn't work, I'd suggest working with the excellent Python lib called rows.
With the following CSV (fruits.csv):
id,name,amount
1,apple,3
2,banana,6
3,pineapple,2
4,lemon,5
You can access columns directly by their name instead of their index:
import rows
data = rows.import_from_csv('fruits.csv')
for fruit_data in data:
print(fruit_data.name, fruit_data.amount)
# output:
# apple 3
# banana 6
# pineapple 2
# lemon 5
NEW EDIT:
After you've provided the data, I believe in your case you could do something like:
import rows
data = rows.import_from_csv('Data103.csv')
print(data.field_names[16]) # prints the field name
total = 0
for row in data:
value = row.<column_name>
value = value.replace(',', '') # remove commas
total += float(value)
print (total)

How to pick a Number that changes in a list python 3

I have a list with product orders and I need to pick the number from the order order_id:TheNumberIwant. I have tried some stuff but none do the trick. I can't Access the number because can change the location in the list, but always comes after the order_id:. I have tried using the split method but only pick one of the order_idand I need to pick all of then.
here is what I'm doing:
i have this string
{"data":[{"order_id":744152,"pedido_venda":"Z921211","supplier_id":11042,.....
with open("items.txt","r") as file:
data = file.readlines()
for line in data:
word = line.split("order_id:" )
abre_arquivo1 = open("items2.txt","w")
abre_arquivo1.write("%s\n" % word)
abre_arquivo1.close()
This removes the order_id but i want the number that comes after in the string to save in the "items2.txt".

Resources