`pandas iterrows` retaining unsaved changed (how?): replacing blank string value row problem - python-3.x

I recently needed to fill blank string values within a pandas dataframe with an adjacent column for the same row.
I attempted df.apply(lambda x: x['A'].replace(...) as well attempted np.where. Neither worked. There were anomalies with the assignment of "blank string values", I couldn't pick them up via '' or df['A'].replace(r'^\s$',df['B'],regex=True), or replacing df['B'] with e.g. -. The only two things that worked was .isnull() and iterrows where they appeared as nan.
So iterrows worked, but I'm not saving the changes.
How is pandas saving the changes?
mylist = {'A':['fe','fi', 'fo', ''], 'B':['fe1,','fi2', 'fi3', 'thum']}
coffee = pd.DataFrame(mylist)
print ("output1\n",coffee.head())
for index,row in coffee.iterrows():
if str(row['A']) == '':
row['A'] = row['B']
print ("output2\n", coffee.head())
output1
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum
output2
A B
0 fe fe1,
1 fi fi2
2 fo fi3
3 thum thum
Note The dataframe is an object BTW.

About pandas.DataFrame.iterrows, the documentation says :
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
In your case, you can use one of these *solutions (that should work with your real dataset as well):
coffee.loc[coffee["A"].eq(""), "A"] = coffee["B"]
Or :
coffee["A"] = coffee["B"].where(coffee["A"].eq(""), coffee["A"])
Or :
coffee["A"] = coffee["A"].replace("", None).fillna(coffee["B"])
Still a strange behaviour though that your original dataframe got updated within the loop without any re-assign. Also, not to mention that the row/Series is supposed to return a copy and not a view..

Related

Pandas object - save to .csv

I have a pandas object df and I would like to save that to .csv:
df.to_csv('output.csv', index = False)
Even if the data frame is displayed right in the terminal after printing, in the *.csv some lines are shifted several columns forward. I do not know how to demonstrate that in the minimal working code. I tried that with the one problematic column, but the result of one column was correct in the *.csv. What should I check, please? The whole column contains strings.
After advice:
selected['SpType'] = selected['SpType'].str.replace('\t', '')
I obtained an error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
selected['SpType'] = selected['SpType'].str.replace('\t', '')
If the tabs are the problem, you could just replace all tabs.
If the tabs occur in column column_name you could do something like:
df['column_name'] = df['column_name'].str.replace('\t', '')
If the problem is in several columns, you could loop over all columns. eg.:
for col in df.columns:
df[col] = df[col].str.replace('\t', '')
df.to_csv('output.csv', index = False)

split time series dataframe when value change

I'have a Dataframe, that correspond to lat/long of an object in movement.
This object go from one place to another, and I created a column that reference what place he is at every second.
I want to split that dataframe, so when the object go in one place, the leave to another, I'll have two separate dataframe.
'None' mean he is between places
My actual code :
def cut_df2(df):
df_copy = df.copy()
#check if change of place
df_copy['changed'] = df_copy['place'].ne(df_copy['place'].shift().bfill()).astype(int)
last = 0
dfs= []
for num, line in df_copy.iterrows():
if line.changed:
dfs.append(df.iloc[last:num,:])
last = num
# Check if last line was in a place
if line.place != 'None':
dfs.append(df.iloc[last:,:])
df_outs= []
# Delete empty dataframes
for num, dataframe in enumerate(dfs):
if not dataframe.empty :
if dataframe.reset_index().place.iloc[0] != 'None':
df_outs.append(dataframe)
return df_outs
It won't work on big dataset, but work on simple examples and I've no idea why, anyone can help me?
Try using this instead:
https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/
iloc can be a good way to split a dataframe
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Stuck in an if statement in a while loop. How do i exit?

i wrote the following code to iteratively build a matrix depending on the conditions met. But i'm stuck in the 2nd if statement. for some reason it doesnt move to else and prints the contents of my print statements over and over, infinitely.
B=np.empty([u,v])
for i in range(u):
for j in range(v):
B[i][j]=0
vi= df['GroupId'][0]
G_df= pd.DataFrame(index=np.arange(0, v), columns=['j', 'vi'])
G_df['j'][0] = 0
G_df['vi'][0] = 0
j=0
i=0
old_tweet_id=1
new_tweet_id=1
value= df['TFIDF_score'][0]
while i<u:
old_tweet_id=new_tweet_id
while (old_tweet_id == new_tweet_id):
if j < v:
new_js= [G_df.loc[G_df['vi'] == vi, 'j'].item()]
if new_js != 0:
print('new_js',new_js)
new_j= int(''.join(str(x) for x in new_js))
print('new_j', new_j)
B[i][new_j] = value
print('matrix', B)
else:
G_df.loc[-1]=[j,vi]
B[i][j]=value
vi = vi +1
j=j+1
if j>=v:
old_tweet_id = u +10
else:
cc = df['tweet_id'][j:j + 1]
dd = df['TFIDF_score'][j:j + 1]
value = dd[j]
new_tweet_id = cc[j]
i = i + 1
I tried using break and also tried to empty the new_js and new_j variables just before the else line but that didn't work either.
I'm sure I'm missing something but I can't place my finger on it.
EDIT:
I am trying to build a matrix from a dataframe of several columns one of the dataframe columns contains what i will use for the labels of my matrix column and a couple of them are repeating, so i used df.groupy to group the overlapping entries and assign an index to them so that all similar entries would have the same index value. These index values are stored in another dataframe column called GroupId. So while building the matrix, the values of the matrix itself are the df[TFIDF scores] and they will be inputted to the matrix based on a which column and row they belong to. where my problem arises from is while checking to see if a column label has been encountered and this current encounter is an overlap, so we need to use the first occurrence of the column label instead of creating a new column for it. I created a new dataframe (G_df) where it appends all the column labels it encountered and also where it compares the current column label to see if there is an existing one.
I know this is a lot but i've tried everything I know. I've been stuck at this problem for a long time.

Openpyxl: Manipulation of cell values

I'm trying to pull cell values from an excel sheet, do math with them, and write the output to a new sheet. I keep getting an ErrorType. I've run the code successfully before, but just added this aspect of it, thus code has been distilled to below:
import openpyxl
#set up ws from file, and ws_out write to new file
def get_data():
first = 0
second = 0
for x in range (1, 1000):
if ws.cell(row=x, column=1).value == 'string':
for y in range (1, 10): #Only need next ten rows after 'string'
ws_out.cell(row=y, column=1).value = ws.cell(row=x+y, column=1).value
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
break
Throws a TypeError message:
first = ws.cell(row=x+y, column=1).value)/100
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
I assume this is referring to the ws.cell value and 100, respectively, so I've also tried:
first = int(ws.cell(row=x, column=1))/100 #also tried with float
Which raises:
TypeError: int() argument must be a string or a number
I've confirmed that every cell in the column is made up of numbers only. Additionally, openpyxl's cell.data_type returns 'n' (presumably for number as far as I can tell by the documentation).
I've also tested more simple math, and have the same error.
All of my searching seems to point to openpyxl normally behaving like this. Am I doing something wrong, or is this simply a limitation of the module? If so, are there any programmatic workarounds?
As a bonus, advice on writing code more succinctly would be much appreciated. I'm just beginning, and feel there must be a cleaner way to write an ideas like this.
Python 3.3, openpyxl-1.6.2, Windows 7
Summary
cfi's answer helped me figure it out, although I used a slightly different workaround. On inspection of the originating file, there was one empty cell (which I had missed earlier). Since I will be re-using this code later on columns with more sporadic empty cells, I used:
if ws.cell(row=x+r, column=40).data_type == 'n':
second = first #displaces first -> second
first = ws.cell(row=x+y, column=1).value/100 #new value for first
difference = first - second
ws_out.cell(row=x+y+1, column=1).value = difference #add to output
Thus, if a specified cell was empty, it was ignored and skipped.
Are you 100% sure (=have verified) that all the cells you are accessing actually hold a value? (Edit: Do a print("dbg> cell value of {}, {} is {}".format(row, 1, ws.cell(row=row, column=1).value)) to verify content)
Instead of going through a fixed range(1,1000) I'd recomment to use openpyxl introspection methods to iterate over existing rows. E.g.:
wb=load_workbook(inputfile)
for ws in wb.worksheets:
for row in ws.rows:
for cell in row: value = cell.value
When getting the values do not forget to extract the .value attribute:
first = ws.cell(row=x+y, column=1).value/100 #new value for first
As a general note: x, and y are useful variable names for 2D coordinates. Don't use them both for rows. It will mislead others who have to read the code. Instead of x you could use start_row or row_offset or something similar. Instead of y you could just use row and you could let it start with the first index being the start_row+1.
Some example code (untested):
def get_data():
first = 0
second = 0
for start_row in range (1, ws.rows):
if ws.cell(row=start_row, column=1).value == 'string':
for row in range (start_row+1, start_row+10):
ws_out.cell(row=start_row, column=1).value = ws.cell(row=row, column=1)
second = first
first = ws.cell(row=row, column=1).value/100
difference = first - second
ws_out.cell(row=row+1, column=1).value = difference
break
Now with this code I still don't understand what you are trying to achieve. Is the break indented correctly? If yes, the first time you are matching string, the outer loop will be quit by the break. Then, what is the point of the variables first and second?
Edit: Also ensure that your reading from and writing into cell().value not just cell().

Resources