I have a this Dataframe
a
b
c
d
e
f
g
h
o
1
1
nan
nan
nan
nan
nan
nan
o
nan
nan
2
2
nan
nan
nan
nan
o
nan
nan
nan
nan
3
3
nan
nan
o
nan
nan
nan
nan
nan
nan
4
4
I want to conversion this DataFrame
a
b
c
d
e
f
g
h
o
1
1
2
2
3
3
4
4
How to make code use pandas..? I try numpy diagonal but it is failed
Another possible solution:
a = df.values.flatten()
pd.DataFrame(a[~np.isnan(a)].reshape(-1,df.shape[1]), columns=df.columns)
Output:
a b c d e f g h
0 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0
Use:
df = df.replace('nan', np.nan).groupby(level=0).first()
I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0
I have a DataFrame with some NaN values. In this DataFrame there are some rows with all NaN values. When I apply sum function on these rows, it is returning zero instead of NaN. Code is as follows:
df = pd.DataFrame(np.random.randint(10,60,size=(5,3)),
index = ['a','c','e','f','h'],
columns = ['One','Two','Three'])
df = df.reindex(index=['a','b','c','d','e','f','g','h'])
print(df.loc['b'].sum())
Any Suggestion?
the sum function takes the NaN values as 0.
if you want the result of the sum of NaN values to be NaN:
df.loc['b'].sum(min_count=1)
Output:
nan
if you apply to all rows(
after using reindex) you will get the following:
df.sum(axis=1,min_count=1)
a 137.0
b NaN
c 79.0
d NaN
e 132.0
f 95.0
g NaN
h 81.0
dtype: float64
if you now modify a NaN value of a row:
df.at['b','One']=0
print(df)
One Two Three
a 54.0 20.0 29.0
b 0.0 NaN NaN
c 13.0 24.0 27.0
d NaN NaN NaN
e 28.0 53.0 25.0
f 46.0 55.0 50.0
g NaN NaN NaN
h 47.0 26.0 48.0
df.sum(axis=1,min_count=1)
a 103.0
b 0.0
c 64.0
d NaN
e 106.0
f 151.0
g NaN
h 121.0
dtype: float64
as you can see now the result of row b is 0
If i've a dataframe like this:
A B C
Nan 1.0 0.0
1.0 Nan 1.0
1.0 0.0 Nan
I want to create a new column in the dataframe that will provide info about which column in each row contains contains nan values.
A B C Col4
Nan 1.0 Nan A,C
1.0 Nan 1.0 B
1.0 Nan Nan B,C
Any help?
Compare by DataFrame.isna and use DataFrame.dot with columns names, last remove last , by Series.str.rstrip:
df['col4'] = df.isna().dot(df.columns + ',').str.rstrip(',')
#if values are strings Nan
#df['col4'] = df.eq('Nan').dot(df.columns + ',').str.rstrip(',')
print (df)
A B C col4
0 NaN 1.0 NaN A,C
1 1.0 NaN 1.0 B
2 1.0 NaN NaN B,C
Naive approach:
def f(r):
ret=[]
if(r['A']=='Nan'): ret.append('A')
if(r['B']=='Nan'): ret.append('B')
if(r['C']=='Nan'): ret.append('C')
return ','.join(ret)
df['D'] = df.apply(f, axis=1)
print(df)
A B C
0 Nan 1.0 Nan
1 1.0 Nan 1.0
2 1.0 Nan Nan
A B C D
0 Nan 1.0 Nan A,C
1 1.0 Nan 1.0 B
2 1.0 Nan Nan B,C
I tested on strings but you can replace that with np.nan.
I have a couple of toy dataframes I can stack using df.append, but I need to keep the source dataframes as a column, as well. I can't seem to find anything about how to do that. Here's what I do have:
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2005
A B C G
0 1 2 3 7
1 2 4 5 8
2 3 5 7 9
3 4 6 8 10
d2006
A B D F
0 2 3 a 7
1 1 1 c 8
2 4 5 d 10
3 5 6 e 12
Then I can stack them like this:
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
d_combined
A B C D F G
0 1 2 3.0 NaN NaN 7.0
1 2 4 5.0 NaN NaN 8.0
2 3 5 7.0 NaN NaN 9.0
3 4 6 8.0 NaN NaN 10.0
4 2 3 NaN a 7.0 NaN
5 1 1 NaN c 8.0 NaN
6 4 5 NaN d 10.0 NaN
7 5 6 NaN e 12.0 NaN
But what I really need is another column with the source dataframe added to the right end of d_combined. Something like this:
A B C D G F From
0 1 2 3.0 NaN 7.0 NaN d2005
1 2 4 5.0 NaN 8.0 NaN d2005
2 3 5 7.0 NaN 9.0 NaN d2005
3 4 6 8.0 NaN 10.0 NaN d2005
4 2 3 NaN a NaN 7.0 d2006
5 1 1 NaN c NaN 8.0 d2006
6 4 5 NaN d NaN 10.0 d2006
7 5 6 NaN e NaN 12.0 d2006
Hopefully someone has a quick trick they can share.
Thanks.
This gets what you want but there should be a more elegant way:
df_list = [d2005, d2006]
name_list = ['2005', '2006']
for df, name in zip(df_list, name_list):
df['from'] = name
Then
d_combined = d2005.append(d2006, ignore_index=True)
d_combined
A B C D F G from
0 1 2 3.0 NaN NaN 7.0 2005
1 2 4 5.0 NaN NaN 8.0 2005
2 3 5 7.0 NaN NaN 9.0 2005
3 4 6 8.0 NaN NaN 10.0 2005
4 2 3 NaN a 7.0 NaN 2006
5 1 1 NaN c 8.0 NaN 2006
6 4 5 NaN d 10.0 NaN 2006
7 5 6 NaN e 12.0 NaN 2006
Alternatively, you can set df.name at the time of creation of the df and use it in the for loop.
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]} )
d2005.name = 2005
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2006.name = 2006
df_list = [d2005, d2006]
for df in df_list:
df['from'] = df.name
I believe this can be simply achieved by adding the From column to the original dataframes itself.
So effectively,
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
Then,
d2005['From'] = 'd2005'
d2006['From'] = 'd2006'
And then you append,
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
gives you something like this: