Comparing items in an Excel file with Openpyxl in Python - excel

I am working with a big set of data, which has 9 rows (B3:J3 in column 3) and stretches until B1325:J1325. Using Python and the Openpyxl library, I need to get the biggest and second biggest value of each row and print those to a new field in the same row. I already assigned values to single fields manually (headings), but cannot seem to even get the max value in my range automatically written to a new field. My code looks like the following:
for row in ws.rows['B3':'J3']:
sumup = 0.0
for cell in row:
if cell.value != None:
.........
It throws the error:
for row in ws.rows['B3':'J3']:
TypeError: 'generator' object has no attribute '__getitem__'
How could I get to my goal here?

You can you iter_rows to do what you want.
Try this:
for row in ws.iter_rows('B3':'J3'):
sumup = 0.0
for cell in row:
if cell.value != None:
........
Check out this answer for more info:
How we can use iter_rows() in Python openpyxl package?

Related

Change number format using headers - openpyxl

I have an Excel file in which I want to convert the number formatting from 'General' to 'Date'. I know how to do so for one column when referring to the column letter:
workbook = openpyxl.load_workbook('path\filename.xlsx')
worksheet = workbook['Sheet1']
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Report_date'], row)].number_format='yyyy-mm-dd;#'
As you can see, I now use the column letter "D" to point out the column that I want to be formatted differently. Now, I would like to use the header in row 1 called "Start_Date" to refer to this column. I tried a method from the following post to achieve this: select a column by its name - openpyxl. However, that resulted in a KeyError: "Start_Date":
# Create a dictionary of column names
ColNames = {}
Current = 0
for COL in worksheet.iter_cols(1, worksheet.max_column):
ColNames[COL[0].value] = Current
Current += 1
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Start_Date'], row)].number_format='yyyy-mm-dd;#'
EDIT
This method results in the following error:
AttributeError: 'tuple' object has no attribute 'number_format'
Additionally, I have more columns from which the number formatting needs to be changed. I have a list with the names of those columns:
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
Is there a way that I can use the list DateColumns so that I can save some lines of code?
Thanks in advance.
Please note that I posted a similar question earlier. The following post was referred to as an answer Python: Simulating CSV.DictReader with OpenPyXL. However, I don't see how the answers in that post can be adjusted to my needs.
You need to know which columns you want to change the number format on which you have conveniently put into a list, so why not just use that list.
Get the headers in your sheet, check if the Header is in the DateColumns list, if so then update all the entries in that column from row 2 to max with the date format you want...
...
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
for COL in worksheet.iter_cols(min_row=1,max_row=1):
header = COL[0]
if header.value in DateColumns:
for row in range(2, worksheet.max_row+1):
worksheet.cell(row, COL[0].column).number_format='yyyy-mm-dd;#'

Average Excel Columns

I am trying to average out values within a column at a certain range. I tried listing out the range as a tuple then for looping to be able to get the cell value. I then created a variable for the average but get the error 'TypeError: 'float' object is not iterable.
range1 = ws["A2":"A6]
for cell in range1:
for x in cell:
average = sum(x.value)/len(x.value)
print(average)
Python and the Openpyxl API makes this kind of thing very easy.
rows = ws.iter_rows(min_row=2, max_row=6, max_col=1, values_only=True)
values = [row[0] for row in rows]
avg = sum(values) / len(values)
But you should probably check that the cells contain numbers, otherwise you'll see an exception.
Something like this will get you the mean of the cells.
import openpyxl as op
def main():
wb = op.load_workbook(filename='C:\\Users\\####\\Desktop\\SO nonsense\\Book1.xlsm')
range1 = wb['Sheet1']['A2:A6']
cellsum = 0
for i, cell in enumerate(range1, 1):
print(i)
cellsum += cell[0].value
print(cellsum / i)
main()

Get the value from another cell if a pattern matches a string in another cell of the same row

Hi,
I have an Excel Workbook with above data. I use Pandas to read through the Excel sheet in my Python script.
My requirement is if I find YES in the Primary Key column, I need to get the Data Element value of that row.
So from the above sample, I need to get the 3 Data Element's value into an array variable.
I tried with the below piece of code and couldn't achieve it.
PK_COLUMNS={}
workbook_sheet = pd.read_excel(excel_file_name,sheet_name=i,keep_default_na=False)
workbook_sheet=workbook_sheet.fillna("NULL", inplace = False)
df=pd.DataFrame(workbook_sheet,columns=[0])
total_rows=len(df.axes[0])
h=0
while h<total_rows:
CURRENT_COLUMN=workbook_sheet["Primary Key"].fillna("NULL", inplace = False)
#print(CURRENT_COLUMN)
for i in CURRENT_COLUMN:
if str(i).upper() == 'YES':
PK_COLUMNS[h]=workbook_sheet.iat[h,0].strip()
print(PK_COLUMNS)
h=h+1
else:
print ("NULL")
print(PK_COLUMNS)
Any help on this is highly appreciated. Python 3.7 with Pandas.

Pandas read_excel() parses date columns with blank values to NaT

I am trying to read an excel file that has date columns with the below code
src1_df = pd.read_excel("src_file1.xlsx", keep_default_na = False)
Even though I have specified, keep_default_na = False, I see that the data frame has 'NaT' value(s) for corresponding blank cells in Excel date columns.
Please suggest, how to get a blank string instead of 'NaT' while parsing Excel files.
I am using Python 3.x and Pandas 0.23.4
src1_df = pd.read_excel("src_file1.xlsx", na_filter=False)
Then you will have empty string ("") as "na" value
In my case I read excel per line and replace "" and "NaT" to None:
for line in src1_df.values:
for index, value in enumerate(line):
if value == '' or isinstance(value, pd._libs.tslibs.nattype.NaTType):
line[index] = None
dostuff_with(line)

Creating new column using for loop returns NaN value in Python/Pandas

I am using Python/Pandas to manipulate a data frame. I have a column 'month' (values from 1.0 to 12.0). Now I want to create another column 'quarter'. When I write -
for x in data['month']:
print ((x-1)//3+1)
I get proper output that is quarter number (1,2,3,4 etc).
But I am not being able to assign the output to the new column.
for x in data['month']:
data['quarter'] = ((x-1)//3 + 1)
This creates the quarter column with missing or 'NaN' value -
My question is why I am getting missing value while creating the column ?
Note: I am using python 3.6 and Anaconda 1.7.0. 'data' is the data frame I am using. Initially I had only the date which I converted to month and year using
data['month'] = pd.DatetimeIndex(data['first_approval']).month
Interestingly this month column shows dtype: float64 . I have read somewhere "dtype('float64') is equivalent to None" but I didn't understand that statement clearly. Any suggestion or help will be highly appreciated.
This is what I had in the beginning:
This is what I am getting after running the for loop:
The easiest way to get the quarter from the date would be to
data['quarter'] = pd.DatetimeIndex(data['date']).quarter
the same way as how you achieved the month information.
The below line would set the entire column to the last value achieved from the calculation. (There could have been some value which is not of a proper date format, hence the NaNs)
data['quarter'] = ((x-1)//3 + 1)
Try with below:
df['quarter'] = df['month'].apply(lambda x: ((x-1)//3 + 1))

Resources