Openpyxl: Manipulation of cell values - python-3.x

I'm trying to pull cell values from an excel sheet, do math with them, and write the output to a new sheet. I keep getting an ErrorType. I've run the code successfully before, but just added this aspect of it, thus code has been distilled to below:
import openpyxl
#set up ws from file, and ws_out write to new file
def get_data():
first = 0
second = 0
for x in range (1, 1000):
if ws.cell(row=x, column=1).value == 'string':
for y in range (1, 10): #Only need next ten rows after 'string'
ws_out.cell(row=y, column=1).value = ws.cell(row=x+y, column=1).value
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
break
Throws a TypeError message:
first = ws.cell(row=x+y, column=1).value)/100
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
I assume this is referring to the ws.cell value and 100, respectively, so I've also tried:
first = int(ws.cell(row=x, column=1))/100 #also tried with float
Which raises:
TypeError: int() argument must be a string or a number
I've confirmed that every cell in the column is made up of numbers only. Additionally, openpyxl's cell.data_type returns 'n' (presumably for number as far as I can tell by the documentation).
I've also tested more simple math, and have the same error.
All of my searching seems to point to openpyxl normally behaving like this. Am I doing something wrong, or is this simply a limitation of the module? If so, are there any programmatic workarounds?
As a bonus, advice on writing code more succinctly would be much appreciated. I'm just beginning, and feel there must be a cleaner way to write an ideas like this.
Python 3.3, openpyxl-1.6.2, Windows 7
Summary
cfi's answer helped me figure it out, although I used a slightly different workaround. On inspection of the originating file, there was one empty cell (which I had missed earlier). Since I will be re-using this code later on columns with more sporadic empty cells, I used:
if ws.cell(row=x+r, column=40).data_type == 'n':
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
Thus, if a specified cell was empty, it was ignored and skipped.

Are you 100% sure (=have verified) that all the cells you are accessing actually hold a value? (Edit: Do a print("dbg> cell value of {}, {} is {}".format(row, 1, ws.cell(row=row, column=1).value)) to verify content)
Instead of going through a fixed range(1,1000) I'd recomment to use openpyxl introspection methods to iterate over existing rows. E.g.:
wb=load_workbook(inputfile)
for ws in wb.worksheets:
for row in ws.rows:
for cell in row: value = cell.value
When getting the values do not forget to extract the .value attribute:
first = ws.cell(row=x+y, column=1).value/100 #new value for first
As a general note: x, and y are useful variable names for 2D coordinates. Don't use them both for rows. It will mislead others who have to read the code. Instead of x you could use start_row or row_offset or something similar. Instead of y you could just use row and you could let it start with the first index being the start_row+1.
Some example code (untested):
def get_data():
first = 0
second = 0
for start_row in range (1, ws.rows):
if ws.cell(row=start_row, column=1).value == 'string':
for row in range (start_row+1, start_row+10):
ws_out.cell(row=start_row, column=1).value = ws.cell(row=row, column=1)
second = first
first = ws.cell(row=row, column=1).value/100
difference = first - second
ws_out.cell(row=row+1, column=1).value = difference
break
Now with this code I still don't understand what you are trying to achieve. Is the break indented correctly? If yes, the first time you are matching string, the outer loop will be quit by the break. Then, what is the point of the variables first and second?
Edit: Also ensure that your reading from and writing into cell().value not just cell().

Related

`pandas iterrows` retaining unsaved changed (how?): replacing blank string value row problem

I recently needed to fill blank string values within a pandas dataframe with an adjacent column for the same row.
I attempted df.apply(lambda x: x['A'].replace(...) as well attempted np.where. Neither worked. There were anomalies with the assignment of "blank string values", I couldn't pick them up via '' or df['A'].replace(r'^\s$',df['B'],regex=True), or replacing df['B'] with e.g. -. The only two things that worked was .isnull() and iterrows where they appeared as nan.
So iterrows worked, but I'm not saving the changes.
How is pandas saving the changes?
mylist = {'A':['fe','fi', 'fo', ''], 'B':['fe1,','fi2', 'fi3', 'thum']}
coffee = pd.DataFrame(mylist)
print ("output1\n",coffee.head())
for index,row in coffee.iterrows():
if str(row['A']) == '':
row['A'] = row['B']
print ("output2\n", coffee.head())
output1
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum
output2
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum thum
Note The dataframe is an object BTW.
About pandas.DataFrame.iterrows, the documentation says :
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
In your case, you can use one of these *solutions (that should work with your real dataset as well):
coffee.loc[coffee["A"].eq(""), "A"] = coffee["B"]
Or :
coffee["A"] = coffee["B"].where(coffee["A"].eq(""), coffee["A"])
Or :
coffee["A"] = coffee["A"].replace("", None).fillna(coffee["B"])
Still a strange behaviour though that your original dataframe got updated within the loop without any re-assign. Also, not to mention that the row/Series is supposed to return a copy and not a view..

Python Pandas, Try to update cell value

I've 2 dataframe, both with a column date:
I need to set in first dataframe the value of specific column found in the second dataframe,
So in first of all I find the correct row of first dataframe with:
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0]) #example: 910
and then I want to update the value of column ['search_volume'] at this row: 910
I will do this with:
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
but I get back this error:
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
my full code is, but not working and nothing is updated.
for x in range(len(searchs)):
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0])
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
It work fine if I test manually the update with:
dataset['search_volume'][910] = searchs[kw_list[0]][47]
What's append?!
Use .loc:
dataset.loc[910, 'search_volume'] = searchs.loc[47, kw_list[0]]
For more info about the error message, see this
Also, there are way more efficient methods for doing this. As a rule of thumb, if you are looping over a dataframe, you are generally doing something wrong. Some potential solutions: pd.DataFrame.join, pd.merge, masking, pd.DataFrame.where, etc.

Converting a csv file containing pixel values to it's equivalent images

This is my first time working with such a dataset.
I have a .csv file containing pixel values (48x48 = 2304 columns) of images, with their labels in the first column and the pixels in the subsequent ones, as below:
A glimpse of the dataset
I want to convert these pixels into their images, and store them into different directories corresponding to their respective labels. Now I have tried the solution posted here but it doesn't seem to work for me.
Here's what I've tried to do:
labels = ['Fear', 'Happy', 'Sad']
with open('dataset.csv') as csv_file:
csv_reader = csv.reader(csv_file)
fear = 0
happy = 0
sad = 0
# skip headers
next(csv_reader)
for row in csv_reader:
pixels = row[1:] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if csv_file['emotion'][row] == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif csv_file['emotion'][row] == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif csv_file['emotion'][row] == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
However, upon running the above block of code, the following is the error message I get:
Traceback (most recent call last):
File "<ipython-input-11-aa928099f061>", line 18, in <module>
if csv_file['emotion'][row] == 'Fear':
TypeError: '_io.TextIOWrapper' object is not subscriptable
I referred to a bunch of posts that solved the above error (like this one), but I found that the people were trying their hand at a relatively different problem than mine, and others I couldn't understand.
This may well be a very trivial question, but as I mentioned earlier, this is my first time working with such a dataset. Kindly tell me what am I doing wrong and how I can fix my code.
Try -
if str(row[0]) == 'Fear':
And in a similar way for the other conditions:
elif str(row[0]) == 'Happy':
elif str(row[0]) == 'Sad':
(a good practice is to just save the first value of the array as a variable)
The first problem that arose was that the first row was just the column names. In order to take care of this, I used the skiprows parameter like so:
raw = pd.read_csv('dataset.csv', skiprows = 1)
Secondly, I moved the labels column to the end due to it being in the first column. For my own convenience.
Thirdly, after all the preparations were done, the dataset won't iterate over the whole row, and instead just took in the value of the first row and first column, which gave an issue in resizing. So I instead used the df.itertuples() like so:
for row in data.itertuples(index = False, name = 'Pandas'):
Lastly, thanks to #HadarM 's suggestions, I was able to get it to work.
Modified code of the above code snippet that was the problematic block:
for row in data.itertuples(index = False, name = 'Pandas'):
pixels = row[:-1] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if str(row[-1]) == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif str(row[-1]) == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif str(row[-1]) == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
print('done')

Stuck in an if statement in a while loop. How do i exit?

i wrote the following code to iteratively build a matrix depending on the conditions met. But i'm stuck in the 2nd if statement. for some reason it doesnt move to else and prints the contents of my print statements over and over, infinitely.
B=np.empty([u,v])
for i in range(u):
for j in range(v):
B[i][j]=0
vi= df['GroupId'][0]
G_df= pd.DataFrame(index=np.arange(0, v), columns=['j', 'vi'])
G_df['j'][0] = 0
G_df['vi'][0] = 0
j=0
i=0
old_tweet_id=1
new_tweet_id=1
value= df['TFIDF_score'][0]
while i<u:
old_tweet_id=new_tweet_id
while (old_tweet_id == new_tweet_id):
if j < v:
new_js= [G_df.loc[G_df['vi'] == vi, 'j'].item()]
if new_js != 0:
print('new_js',new_js)
new_j= int(''.join(str(x) for x in new_js))
print('new_j', new_j)
B[i][new_j] = value
print('matrix', B)
else:
G_df.loc[-1]=[j,vi]
B[i][j]=value
vi = vi +1
j=j+1
if j>=v:
old_tweet_id = u +10
else:
cc = df['tweet_id'][j:j + 1]
dd = df['TFIDF_score'][j:j + 1]
value = dd[j]
new_tweet_id = cc[j]
i = i + 1
I tried using break and also tried to empty the new_js and new_j variables just before the else line but that didn't work either.
I'm sure I'm missing something but I can't place my finger on it.
EDIT:
I am trying to build a matrix from a dataframe of several columns one of the dataframe columns contains what i will use for the labels of my matrix column and a couple of them are repeating, so i used df.groupy to group the overlapping entries and assign an index to them so that all similar entries would have the same index value. These index values are stored in another dataframe column called GroupId. So while building the matrix, the values of the matrix itself are the df[TFIDF scores] and they will be inputted to the matrix based on a which column and row they belong to. where my problem arises from is while checking to see if a column label has been encountered and this current encounter is an overlap, so we need to use the first occurrence of the column label instead of creating a new column for it. I created a new dataframe (G_df) where it appends all the column labels it encountered and also where it compares the current column label to see if there is an existing one.
I know this is a lot but i've tried everything I know. I've been stuck at this problem for a long time.

Write values from Pandas DataFrame columns into tkinter TreeView/Table Columns

I want to write values from a dataframe into a tkinter treeview/Table, I am not able to do this.
my code:
#Setting up tkinter window.
root = Tk()
tree = ttk.Treeview(root)
#taking file input through a dialog box from the user.
file = filedialog.askopenfile(parent=root,mode='rb',title='Choose a xlsx file')
#readinf the excel file selected by the user and then creating a dataframe of that file.
xls = pd.read_excel(file)
df = pd.DataFrame(xls)
#taking all the columns heading in a variable"df_col".
df_col = df.columns.values
#all the column name are generated dynamically.
tree["columns"]=(df_col)
counter = len(df)
#generating for loop to create columns and give heading to them through df_col var.
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
#generating for loop to print values of dataframe in treeview column.
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
It is not printing the columns and showing the KeyError:0.
Output Required:
The first argument of tree.column() should be the column name, which you assigned with:
tree["columns"]=(df_col)
The problem is that you have named the columns using a string, but you are attempting to access them using integers in:
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
Above, you are attempting to access tree.columns(0), instead of tree.columns('Company'), hence the key error.
Try instead:
for x in range(len(df_col)):
tree.column(df_col[x], width=100)
tree.heading(df_col[x], text=df_col[x])
Note that df_col is an ndarray, not a dataframe, which is why df_col[x] works correctly (df[x] would give a key error). This is because df.columns.values returns an ndarray. As a side note, it may be a bit confusing to name an ndarray df_col.
There are also a few issues with your insert. The second argument should correspond to the index of the entry you wish to address. One solution is then to use a row index as the second argument, followed by a row label as text="rowLabel", followed by a list of values for the row:
tree.insert('', i, text=rowLabels[i], values=df.iloc[i,:].tolist())
Where rowLabels should be defined as whatever you want to use in the first column of the table. I would suggest using an index column from the spreadsheet here, if possible. It could be defined by:
rowLabels = df.iloc[:,indexColumn].tolist()
or:
rowLabels = df.index.tolist()
The latter is viable if df has named indices defined by a column during the spreadsheet import. In the former, indexColumn is an int referring to a column number in df that contains unique identifiers.
The option values=df.iloc[i,:].tolist() converts all columns of the ith row into a list, and, since we have passed an index value (the second argument) that gets larger, the call will insert a new row every loop (from the python tkinter docs entry on Treeview --> insert: "if index is greater than or equal to the current number of children, it is inserted at the end").
Finally, I am not sure if you did not post the end of your code, but, in order for the tree to show up, you will also need to use pack, grid, etc.
tree.pack()
or
tree.grid(row=0, column=0)
References:
https://docs.python.org/3/library/tkinter.ttk.html#tkinter.ttk.Treeview
This helpful example makes a few of the steps clear:
https://knowpapa.com/ttk-treeview/
As I was reading over your code. I noticed at the end line you have an extra bracket #:
df[df_col[x]]]
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
I would assume that would explain the KeyError.

Resources