Replacing NaN value with None fill the value from the previous row in a DataFrame (pandas 1.0.3) - python-3.x

Not sure if it is a bug, but I am unable to replace NaN value with None value using latest Pandas library. When I use DataFrame.replace() method to replace NaN with None, dataframe is taking value from previous row instead of None value. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': [10, 20, np.nan], 'y': [30, 40, 50]})
print(df)
Outputs
x y
0 10.0 30
1 20.0 40
2 NaN 50
And if I apply replace method
print(df.replace(np.NaN, None))
Outputs. Cell(X, 2) should be None instead of 20.0.
x y
0 10.0 30
1 20.0 40
2 20.0 50
Anyone help is appreciated.

Related

Calculation of percentile and mean

I want to find the 3% percentile of the following data and then average the data.
Given below is the data structure.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
96927 NaN
96928 NaN
96929 NaN
96930 NaN
96931 NaN
Over here the concerned data lies exactly between the data from 13240:61156.
Given below is my code:
enter code here
import pandas as pd
import numpy as np
load_var=pd.read_excel(r'path\file name.xlsx')
load_var
a=pd.DataFrame(load_var['column whose percentile is to be found'])
print(a)
b=np.nanpercentile(a,3)
print(b)
Please suggest the changes in the code.
Thank you.
Use Series.quantile with mean in Series.agg:
df = pd.DataFrame({
'col':[7,8,9,4,2,3, np.nan],
})
f = lambda x: x.quantile(0.03)
f.__name__ = 'q'
s = df['col'].agg(['mean', f])
print (s)
mean 5.50
q 2.15
Name: col, dtype: float64

How to replace Specific values of a particular column in Pandas Dataframe based on a certain condition?

I have a Pandas dataframe which contains students and percentages of marks obtained by them. There are some students whose marks are shown as greater than 100%. Obviously these values are incorrect and I would like to replace all percentage values which are greater than 100% by NaN.
I have tried on some code but not quite able to get exactly what I would like to desire.
import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
# Percentages Student
#0 85 S1
#1 70 S2
#2 101 S3
#3 55 S4
#4 120 S5
new_DF[(new_DF.iloc[:, 0] > 100)] = np.NaN
# Percentages Student
#0 85.0 S1
#1 70.0 S2
#2 NaN NaN
#3 55.0 S4
#4 NaN NaN
As you can see the code kind of works but it actually replaces all the values in that particular row where Percentages is greater than 100 by NaN. I would only like to replace the value in Percentages column by NaN where its greater than 100. Is there any way to do that?
Try and use np.where:
new_DF.Percentages=np.where(new_DF.Percentages.gt(100),np.nan,new_DF.Percentages)
or
new_DF.loc[new_DF.Percentages.gt(100),'Percentages']=np.nan
print(new_DF)
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN
Also,
df.Percentages = df.Percentages.apply(lambda x: np.nan if x>100 else x)
or,
df.Percentages = df.Percentages.where(df.Percentages<100, np.nan)
You can use .loc:
new_DF.loc[new_DF['Percentages']>100, 'Percentages'] = np.NaN
Output:
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN
import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
#print(new_DF['Student'])
index=-1
for i in new_DF['Percentages']:
index+=1
if i > 100:
new_DF['Percentages'][index] = "nan"
print(new_DF)

How to Animate multiple columns as dots with matplotlib from pandas dataframe with NaN in python

I have a question that maybe is asked before, but I couldn't find any post describing my problem.
I have two pandas dataframes, each with the same index, representing x coordinates in one dataframe and y coordinates in the other. Each colum represent a car that started a specific timestep, logged every step until it arrived, and then stopped logging.
Everytime a car starts on its route, a column is added to each dataframe and the coordinates of each step are added to each frame (every step it moves trough space therefor has new x,y coordinates), (see example for the x coordinates dataframe)
But I am trying to animate the tracks of each car by plotting the coordinates in an animated graph, but I cannot seem to get it worked. My code:
%matplotlib notebook
from matplotlib import animation
from JSAnimation import IPython_display
from IPython.display import HTML
fig = plt.figure(figsize=(10,10))
#ax = plt.axes()
#nx.draw_networkx_edges(w_G,nx.get_node_attributes(w_G, 'pos'))
n_steps = simulation.x_df.index
def init():
graph, = plt.plot([],[],'o')
return graph,
def get_data_x(i):
return simulation.x_df.loc[i]
def get_data_y(i):
return simulation.y_df.loc[i]
def animate(i):
x = get_data_x(i)
y= get_data_y(i)
graph.set_data(x,y)
return graph,
animation.FuncAnimation(fig, animate, frames=100, init_func = init, repeat=True)
plt.show()
It does not plot anything, so any help would be very much appreciated.
EDIT: Minimal, Complete, and Verifiable example!
So two simple examples of the x and y dataframes that I have. Each has the same index.
import random
import geopandas as gpd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
from shapely.geometry import Point
from matplotlib import animation
from JSAnimation import IPython_display
%matplotlib inline
[IN]: df_x = pd.DataFrame(data=np.array([[np.NaN, np.NaN, np.NaN, np.NaN], [4, np.nan, np.NaN,np.NaN], [7, 12, np.NaN,np.NaN], [6, 18, 12,9]]), index= [1, 2, 3, 4], columns=[1, 2, 3, 4])
gives:
[OUT]
1 2 3 4
1 NaN NaN NaN NaN
2 4.0 NaN NaN NaN
3 7.0 12.0 NaN NaN
4 6.0 18.0 12.0 9.0
And the y coordinate dataframe:
[IN] df_y = pd.DataFrame(data=np.array([[np.NaN, np.NaN, np.NaN, np.NaN], [6, np.nan, np.NaN,np.NaN], [19, 2, np.NaN,np.NaN], [4, 3, 1,12]]), index= [1, 2, 3, 4], columns=[1, 2, 3, 4])'
gives:
[OUT]
1 2 3 4
1 NaN NaN NaN NaN
2 6.0 NaN NaN NaN
3 19.0 2.0 NaN NaN
4 4.0 3.0 1.0 12.0
Now I want to create an animation, by creating a frame by plotting the x coordinate and the y coordinate of each column per each row of both dataframes. In this example, frame 1 should not contain any plot. Frame 2 should plot point (4.0 , 6.0) (of column 1). Frame 3 should plot point (7.0,19.0) (column1) and point (12.0,2.0) (column 2). Frame 4 should plot point (6.0, 4.0) (column 1), point (18.0,3.0) (column 2), point (12.0,1.0) (column 3) and (9.0, 12.0) column 4. Therefore I wrote the following code:
I tried writing the following code to animate this:
[IN] %matplotlib notebook
from matplotlib import animation
from JSAnimation import IPython_display
from IPython.display import HTML
fig = plt.figure(figsize=(10,10))
#ax = plt.axes()
graph, = plt.plot([],[],'o')
def get_data_x(i):
return df_x.loc[i]
def get_data_y(i):
return df_y.loc[i]
def animate(i):
x = get_data_x(i)
y= get_data_y(i)
graph.set_data(x,y)
return graph,
animation.FuncAnimation(fig, animate, frames=4, repeat=True)
plt.show()
But this does not give any output. Any suggestions?
I've reformatted your code, but I think your main issue was that your dataframes start with a index of 1, but when you're calling your animation with frames=4, it's calling update() with i=[0,1,2,3]. Therefore when you do get_data_x(0) you raise a KeyError: 'the label [0] is not in the [index]'
As per the documentation, frames= can be passed an iterable instead of an int. Here, I simply pass the index of your dataframe, and the function will iterate and call update() with each value. Actually, I decided to pass the intersection of your two dataframe indexes, that way, if there is one index present in one dataframe but not the other, it will not raise an Error. If you are garanteed that your two indexes are the same, then you could just do frames=df_x.index
x_ = """ 1 2 3 4
1 NaN NaN NaN NaN
2 4.0 NaN NaN NaN
3 7.0 12.0 NaN NaN
4 6.0 18.0 12.0 9.0
"""
y_ = """1 2 3 4
1 NaN NaN NaN NaN
2 6.0 NaN NaN NaN
3 19.0 2.0 NaN NaN
4 4.0 3.0 1.0 12.0"""
df_x = pd.read_table(StringIO(x_), sep='\s+')
df_y = pd.read_table(StringIO(y_), sep='\s+')
fig, ax = plt.subplots(figsize=(5, 5))
graph, = ax.plot([],[], 'o')
# either set up sensible limits here that won't change during the animation
# or see the comment in function `update()`
ax.set_xlim(0,20)
ax.set_ylim(0,20)
def get_data_x(i):
return df_x.loc[i]
def get_data_y(i):
return df_y.loc[i]
def update(i):
x = get_data_x(i)
y = get_data_y(i)
graph.set_data(x,y)
# if you don't know the range of your data, you can use the following
# instructions to rescale the axes.
#ax.relim()
#ax.autoscale_view()
return graph,
# Creating the Animation object
ani = animation.FuncAnimation(fig, update,
frames=pd.Index.intersection(df_x.index,df_y.index),
interval=500, blit=False)
plt.show()

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

Filling in nans for numbers in a column-specific way

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?
Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7
It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

Resources