I have a pandas dataframe as below:
data = {'A' :[1,2,3],
'B':[2,17,17],
'C1' :["C1",np.nan,np.nan],
'C2' :[np.nan,"C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
df
A B C1 C2
0 1 2 C1 NaN
1 2 17 NaN C2
2 3 17 NaN NaN
I want to create a variable "C" based on "C1" and"C2"(there could be "C4", "C5". If any of C's has the value "C"= value from C's(C1, C2, C3....). My output in this case should look like below:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Try this
df1 = df.filter(regex='^C\d+')
df['C'] = df1[df1.isin(df1.columns)].bfill(1).iloc[:,0]
Out[117]:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
If you want to strictly compare values matching to its own column name, Use eq instead of isin as follows
df['C'] = df1[df1.eq(df1.columns, axis=1)].bfill(1).iloc[:,0]
IIUC
df['C']=df.filter(like='C').bfill(axis=1).iloc[:,0]
df
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
IIUC,
we can filter your columns by the word C then aggregate the values with an agg call:
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
print(df)
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Related
I have a dict like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
And my desired output like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
Only column A to column E will shift null, I have a current script using lamba but all dataframe shift the null values to the last column. I need certain columns only, any one can help me? THank you!
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df = df.T.apply(lambda arr: shift_null(arr)).T
You can remove missing values per rows by Series.dropna, add possible only missing values columns by DataFrame.reindex and then set columns names by list by DataFrame.set_axis:
cols = ['A','B','C','D','E']
df[cols] = (df[cols].apply(lambda x: pd.Series(x.dropna().tolist()), axis=1)
.reindex(range(len(cols)), axis=1)
.set_axis(cols, axis=1))
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Your solution is changed with remove transposing and result_type='expand' in DataFrame.apply:
cols = ['A','B','C','D','E']
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df[cols] = df[cols].apply(lambda arr: shift_null(arr), axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Another idea is sorting by key parameter:
cols = ['A','B','C','D','E']
df[cols] = df[cols].apply(lambda x: x.sort_values(key=lambda x: x.isna()).tolist(),
axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Solution with reshape by DataFrame.stack, add counter for new columns names and last reshape back by Series.unstack:
s = df[cols].stack().droplevel(1)
s.index = [s.index, s.groupby(level=0).cumcount()]
df[cols] = s.unstack().rename(dict(enumerate(cols)), axis=1).reindex(cols, axis=1)
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
I have two DataFrames d1 and d2.
d1:
category value
0 a 4
1 b 9
2 c 14
3 d 19
4 e 24
5 f 29
d2:
one two
0 NaN a
1 NaN a
2 NaN c
3 NaN d
4 NaN e
5 NaN a
I want to map values from d1 to 'one' column in d2 using category marker form d1.
this should return me:
one two
0 4 a
1 4 a
2 14 c
3 19 d
4 24 e
5 4 a
Try:
df2['one'] = df2['two'].map(df1.set_index('category')['value'])
I have a pandas dataframe as below:
data = {'A' : [1,2,3],
'B' : [2,17,17],
'C1' : ["C1", np.nan,np.nan],
'C2' : ["C2", "C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
Dataframe:
A B C1 C2
0 1 2 C1 C2
1 2 17 NaN C2
2 3 17 NaN NaN
I am creating a variable "C" based on the below logic and code
If any of C's(C1, C2, C3..) has the value "C"= value from C's(C1, C2, C3....).
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
Result:
A B C1 C2 C
0 1 2 C1 C2 C1,C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Now, I want to perform below logic
If "C" has more than 1 values(say C1, C2) for any row, create a new row and append 2nd value. So I want my output to look like below:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
We can do it by use explode then concat
s=df.filter(regex='C\d+').stack().groupby(level=0).agg(list).explode().to_frame('C').join(df)
s=pd.concat([s,df[~df.index.isin(s.index)]],axis=0,join='outer',ignore_index=True,sort=False)
s
Out[62]:
C A B C1 C2
0 C1 1 2 C1 C2
1 C2 1 2 C1 C2
2 C2 2 17 NaN C2
3 NaN 3 17 NaN NaN
you could do:
df.merge(df.melt(['A','B'],value_name= 'C').dropna().drop('variable',axis = 1),how = "left")
A B C1 C2 C
0 1 2 C1 C2 C1
1 1 2 C1 C2 C2
2 2 17 NaN C2 C2
3 3 17 NaN NaN NaN
You can just df.explode(...), try:
#please note I aggregate it into list, not string
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(list)
df=df.explode("C")
Outputs:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
I'm working with thousands of pd.series each with a multi-index that has 2 static index, a dynamic one, and then timestamps:
start = np.concatenate((np.random.rand(3), [np.nan]*3))
end = np.concatenate(([np.nan]*3, np.random.rand(3)))
index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["A"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i1_start = pd.Series(start, index=index1, name="col1")
i1_end = pd.Series(end, index=index1, name="col2")
index2 = index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["B"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i2_start = pd.Series(start, index=index2, name="col1")
i2_end = pd.Series(end, index=index2, name="col2")
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.DataFrame(data).T
df
Here are the results of turning it into a dataframe:
col1 col2 col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.248504 NaN NaN NaN
d2 0.424774 NaN NaN NaN
d3 0.333638 NaN NaN NaN
d4 NaN 0.987744 NaN NaN
d5 NaN 0.093231 NaN NaN
d6 NaN 0.918666 NaN NaN
B d1 NaN NaN 0.248504 NaN
d2 NaN NaN 0.424774 NaN
d3 NaN NaN 0.333638 NaN
d4 NaN NaN NaN 0.987744
d5 NaN NaN NaN 0.093231
d6 NaN NaN NaN 0.918666
I'm looking for advice on how to groupby the series with the same series.names and concat/merge/join them so that the columns line up, instead of having an entire triangle of just null values.
I think you need concat with sum or max and parameter axis=1 with level=0:
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.concat(data, 1).sum(axis=1, level=0)
#same as
#df = pd.concat(data, 1).groupby(axis=1, level=0).sum()
#alternative
df = pd.concat(data, 1).max(axis=1, level=0)
print (df)
col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
B d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
How about this?
df.fillna(0).sum(1)
That is, replace NaN with zero and sum all the columns for each row.
I have a dataframe df, I want to fill values in a column, based on the condition applied to other column
Structure of DF, After ID there are some columns:
ID ...... col1 col2 col3 col4
1 A1 A1 A1 A1
2 G3 D5
3 R6
4 Q3
5 M5 N8
I want to create two new column called 'final_col' and 'status', where 'final_col' has value from col1 or col2 or col3 or col4 depending on which col had first non-blank (not null/NaN) value.
The column 'status' is just the name of the column
Expected Output:
ID ...... col1 col2 col3 col4 final_col status
1 A1 A1 A1 A1 A1 col1
2 G3 D5 G3 col2
3 R6 L4 R6 col1
4 Not_found Not_found
5 M5 N8 M5 col2
I know how to do this in excel, with nested ifs as so, assuming ID is cell 'A1'
In the first row of 'final_col':
=IF(A2<>"",A2,IF(B2<>"",B2,IF(C2<>"",C2,IF(D2<>"",D2,"Not_found"))))
For column 'status'
=IF(A2<>"","col1",IF(B2<>"","col2",IF(C2<>"","col3",IF(D2<>"","col4","Not_found"))))
P.S: Please use column names in your solution, and not index because the structure of the data frame may vary (order of the columns).
You can use first_valid_index. If you can all NaN values in some row in columns col1 to col4 use:
print df
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
def f1(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return str(x.first_valid_index())
def f2(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return x[x.first_valid_index()]
df['status'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f1, axis=1)
df['final_col'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f2, axis=1)
print df
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not_found Not_found
4 5 NaN M5 N8 NaN col2 M5
You could use first_valid_index:
In [105]: df
Out[105]:
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
df['status'] = df.iloc[:,1:].apply(lambda x: x.first_valid_index(), axis=1)
df['final_col'] = df.iloc[:, 1:].apply(lambda x: x[x['status']] if x['status'] != None else 'Not found', axis=1)
df['status'].fillna('Not found', inplace=True)
In [129]: df
Out[129]:
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not found Not found
4 5 NaN M5 N8 NaN col2 M5