Stop reading the CSV file after finding empty rows python - python-3.x

I am trying to read a CSV file that has four parts that are on the same page but distinguished by putting some empty rows in the middle of the spreadsheet. I want to somehow ask pandas to stop reading the rest of the file as soon as it finds the empty row.
Edit: I need to elaborate on the problem. I have a CSV file, that has 4 different sections that separated with 3-4 empty rows. I need to extract each of these sections or at least the first section. In other words, I want read_csv stop when it finds the first empty row(of course after skipping rows with detail about the file)
url = urlopen("https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/30_Industry_Portfolios_CSV.zip")
zipfile = ZipFile(BytesIO(url.read()))
data = pd.read_csv(zipfile.open('30_Industry_Portfolios.CSV'),
header = 0, index_col=0,
skiprows=11,parse_dates=True)

You could use a generator.
Suppose the csv module is generating rows.
(We might use yield from sheet,
except that we'll change the loop in a moment.)
import csv
def get_rows(csv_fspec, skip_rows=12):
with open(csv_fspec) as fin:
sheet = csv.reader(fin)
for _ in range(skip_rows):
next(sheet) # discard initial rows
for row in sheet:
yield row
df = pd.DataFrame(get_rows(my_csv))
Now you want to ignore rows after encountering some condition,
perhaps after initial column is empty.
Ok, that's simple enough, just change the loop body:
for row in sheet:
if row[0]:
yield row
else:
break # Ignore rest of input file.

Related

Extract numbers only from specific lines within a txt file with certain keywords in Python

I want to extract numbers only from lines in a txt file that have a certain keyword and add them up then compare them, and then print the highest total number and the lowest total number. How should I go about this?
I want to print the highest and the lowest valid total numbers
I managed to extract lines with "valid" keyword in them, but now I want to get numbers from this lines, and then add the numbers up of each line, and then compare these numbers with other lines that have the same keyword and print the highest and the lowest valid numbers.
my code so far
#get file object reference to the file
file = open("shelfs.txt", "r")
#read content of file to string
data = file.read()
#close file<br>
closefile = file.close()
#get number of occurrences of the substring in the string
totalshelfs = data.count("SHELF")
totalvalid = data.count("VALID")
totalinvalid = data.count("INVALID")
print('Number of total shelfs :', totalshelfs)
print('Number of valid valid books :', totalvalid)
print('Number of invalid books :', totalinvalid)
txt file
HEADER|<br>
SHELF|2200019605568|<br>
BOOK|20200120000000|4810.1|20210402|VALID|<br>
SHELF|1591024987400|<br>
BOOK|20200215000000|29310.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200229000000|11519.0|20210401|VALID|<br>
SHELF|1300001188124|<br>
BOOK|20200329001234|115.0|20210331|INVALID|<br>
SHELF|1300001188124|<br>
BOOK|2020032904567|1144.0|20210401|INVALID|<br>
FOOTER|
What you need is to use the pandas library.
https://pandas.pydata.org/
You can read a csv file like this:
data = pd.read_csv('shelfs.txt', sep='|')
it returns a DataFrame object that can easily select or sort your data. It will use the first row as header, then you can select a specific column like a dictionnary:
header = data['HEADER']
header is a Series object.
To select columns you can do:
shelfs = data.loc[:,data['HEADER']=='SHELF']
to select only the row where the header is 'SHELF'.
I'm just not sure how pandas will handle the fact that you only have 1 header but 2 or 5 columns.
Maybe you should try to create one header per colmun in your csv, and add separators to make each row the same size first.
Edit (No External libraries or change in the txt file):
# Split by row
data = data.split('<br>\n')
# Split by col
data = [d.split('|') for d in data]
# Fill empty cells
n_cols = max([len(d) for d in data])
for i in range(len(data)):
while len(data[i])<n_cols:
data[i].append('')
# List VALID rows
valid_rows = [d for d in data if d[4]=='VALID']
# Get sum min and max
valid_sum = [d[1]+d[2]+d[3] for d in valid_rows]
valid_minimum = min(valid_sum)
valid_maximum = max(valid_sum)
It's maybe not exactly what you want to do but it solves a part of your problem. I didn't test the code.

Python Ignoring few x number of lines in CSV file

So I'm trying to read from a CSV file that's tab-delimited saved as a .DAT file.
I need to skip everything until I get to the data portion under the Date header.
So 05-29-2012 and everything on that row and rows below it. I found plenty of documentation on how to skip the first few lines, but I don't know how many lines that may be. From the
"Data file created", to the meat of the data may have more lines of text than another file. Could
be 3 rows, could be 10 rows.
I have thousands of these files I'm trying to extract the data out of and plot it. Easy in excel just to cut and paste but I'm going for efficiency here.
This is the code I'm using. I see that data perfectly. Only IF I know how many lines to skip. There will be blanks lines, I get how to bypass those, but if I have text there, that adds the extra lines I can't bypass.
import pandas as pd
import csv
myfile = ('E:\\TTF Data Backup\\1X ARRAY MOD #2.dat')
df = pd.read_csv(myfile, skiprows=3 , delimiter='\t')
print(df.head(20))
Data File
Try this code:
myfile = ('E:\TTF Data Backup\1X ARRAY MOD #2.dat')
skipcnt = 0
with open(myfile) as f: # auto closes after loop
for row in f:
skipcnt += 1
if "Tension" in row and "Elong" in row: # top of header
break;
skipcnt += 3 # skip headers
df = pd.read_csv(myfile, skiprows=skipcnt , delimiter='\t')

Pandas Copy Values from Rows to other files without disturbing the existing data

I have 20 csv files pertaining to different individuals.
And I have a Main csv file, which is based on the final row values in specific columns. Below are the sample for both kinds of files.
All Individual Files look like this:
alex.csv
name,day,calls,closed,commision($)
alex,25-05-2019,68,6,15
alex,27-05-2019,71,8,20
alex,28-05-2019,65,7,17.5
alex,29-05-2019,68,8,20
stacy.csv
name,day,calls,closed,commision($)
stacy,25-05-2019,82,16,56.00
stacy,27-05-2019,76,13,45.50
stacy,28-05-2019,80,19,66.50
stacy,29-05-2019,79,18,63.00
But the Main File(single day report), which is the output file, looks like this:
name,day,designation,calls,weekly_avg_calls,closed,commision($)
alex,29-05-2019,rep,68,67,8,20
stacy,29-05-2019,sme,79,81,18,63
madhu,29-05-2019,rep,74,77,16,56
gabrielle,29-05-2019,rep,59,61,6,15
I require to copy the required values from the columns(calls,closed,commision($)) of the last line, for end-of-today's report, and then populate it to the Main File(template that already has some columns filled like the {name,day,designation....}).
And so, how can I write a for or a while program, for all the csv files in the "Employee_performance_DB" list.
Employee_performance_DB = ['alex.csv', 'stacy.csv', 'poduzav.csv', 'ankit.csv' .... .... .... 'gabrielle.csv']
for employee_db in Employee_performance_DB:
read_object = pd.read_csv(employee_db)
read_object2 = read_object.tail(1)
read_object2.to_csv("Main_Report.csv", header=False, index=False, columns=["calls", "closed", "commision($)"], mode='a')
How to copy values of {calls,closed,commision($)} from the 'Employee_performance_DB' list of files to the exact column in the 'Main_Report.csv' for those exact empolyees?
Well, as I had no answers for this, it took a while for me to find a solution.
The code below fixed my issue...
# Created a list of all the files in "employees_list"
employees_list = ['alex.csv', ......, 'stacy.csv']
for employees in employees_list:
read_object = pd.read_csv(employees)
read_object2 = read_object.tail(1)
read_object2.to_csv("Employee_performance_DB.csv", index=False, mode='a', header=False)

removing extra column in a csv file while exporting data using python3

I wrote a function in python3 which merges some files in the same directory and returns a csv file as the output but the problem with csv file is that I get one extra column at the beginning which does not have header and the other rows of that columns are numbers starting from 0. do you know how I write the csv file without getting the extra column?
you can split by ,, and then use slicing to remove the first element.
example:
original = """col1,col2,col3
0,val01,val02,val03
1,val11,val12,val13
2,val21,val22,val23
"""
original_lines = original.splitlines()
result = original_lines[:1] # copy header
for line in original_lines[1:]:
result.append(','.join(line.split(',')[1:]))
print('\n'.join(result))
Output:
col1,col2,col3
val01,val02,val03
val11,val12,val13
val21,val22,val23

Is it possible to create a new column for each iteration in XlsxWriter

I want to write data into Excel columns using XlsxWriter. One 'set' of data gets written for each iteration. Each set should be written in a separate column. How do I do this?
I have tried playing around with the col value as follows:
At [1] I define i=0 outside the loop and later increment it by 1 and set col=i. When this is done output is blank. To me this is the most logical solution & I don't know why it won't work.
At [2] i is defined inside the loop. When this happens one column gets written.
At [3] I define col the standard way. This works as expected: One column gets written.
My code:
import xlsxwriter
txt_file = open('folder/txt_file','r')
lines = dofile.readlines()
# [1]Define i outside the loop. When this is used output is blank.
i = 0
for line in lines:
if condition_a is met:
#parse text file to find a string. reg_name = string_1.
elif condition_b:
#parse text file for a second string. esto_name = string_2.
elif condition_c:
#parse text file for a group of strings.
# use .split() to append these strings to a list.
# reg_vars = list of strings.
#[2] Define i inside the loop. When this is used one column gets written. Relevant for [1] & [2].
i+=1 #Increment for each loop
row=1
col=i #Increments by one for each iteration, changing the column.
#[3] #Define col =1. When this is used one column also gets written.
col=1
#Open Excel
book= xlsxwriter.Workbook('folder/variable_list.xlsx')
sheet = book.add_worksheet()
#Write reg_name
sheet.write(row, col, reg_name)
row+=1
#Write esto_name
sheet.write(row, col, esto_name)
row+=1
#Write variables
for variable in reg_vars:
row+=1
sheet.write(row, col, variable)
book.close()
You can use the XlsxWriter write_column() method to write a list of data as a column.
However, in this particular case the issue seems to be that you are creating a new, duplicate, file via xlsxwriter.Workbook() each time you go through the condition_c part of the loop. Therefore the last value of col is used and the entire file is overwritten the next time through the loop.
You should probably move the creation of the xlsx file outside the loop. Probably to the same place you open() the text file.

Resources