Manipulating a column in pandas dataframe - python-3.x

I have a pandas dataframe as below:
data = {'A' : [1,2,3],
'B' : [2,17,17],
'C1' : ["C1", np.nan,np.nan],
'C2' : ["C2", "C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
Dataframe:
A B C1 C2
0 1 2 C1 C2
1 2 17 NaN C2
2 3 17 NaN NaN
I am creating a variable "C" based on the below logic and code
If any of C's(C1, C2, C3..) has the value "C"= value from C's(C1, C2, C3....).
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
Result:
A B C1 C2 C
0 1 2 C1 C2 C1,C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Now, I want to perform below logic
If "C" has more than 1 values(say C1, C2) for any row, create a new row and append 2nd value. So I want my output to look like below:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN

We can do it by use explode then concat
s=df.filter(regex='C\d+').stack().groupby(level=0).agg(list).explode().to_frame('C').join(df)
s=pd.concat([s,df[~df.index.isin(s.index)]],axis=0,join='outer',ignore_index=True,sort=False)
s
Out[62]:
C A B C1 C2
0 C1 1 2 C1 C2
1 C2 1 2 C1 C2
2 C2 2 17 NaN C2
3 NaN 3 17 NaN NaN

you could do:
df.merge(df.melt(['A','B'],value_name= 'C').dropna().drop('variable',axis = 1),how = "left")
A B C1 C2 C
0 1 2 C1 C2 C1
1 1 2 C1 C2 C2
2 2 17 NaN C2 C2
3 3 17 NaN NaN NaN

You can just df.explode(...), try:
#please note I aggregate it into list, not string
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(list)
df=df.explode("C")
Outputs:
A B C1 C2 C
0 1 2 C1 C2 C1
0 1 2 C1 C2 C2
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN

Related

map data from one column to another

I have two DataFrames d1 and d2.
d1:
category value
0 a 4
1 b 9
2 c 14
3 d 19
4 e 24
5 f 29
d2:
one two
0 NaN a
1 NaN a
2 NaN c
3 NaN d
4 NaN e
5 NaN a
I want to map values from d1 to 'one' column in d2 using category marker form d1.
this should return me:
one two
0 4 a
1 4 a
2 14 c
3 19 d
4 24 e
5 4 a
Try:
df2['one'] = df2['two'].map(df1.set_index('category')['value'])

replace a column in a dataframe by inserting another dataframe

I have two dataframes df1 and df2:
data1 = {'A':[1,3,2,1.4,2,1,2,4], 'B':[10,30,20,1.4,2,78,2,78],'C':[200,340,20,180,2,201,2,100]}
df1 = pd.DataFrame(data1)
print(df1)
A B C
0 1.0 10.0 200
1 3.0 30.0 340
2 2.0 20.0 20
3 1.4 1.4 180
4 2.0 2.0 2
5 1.0 78.0 201
6 2.0 2.0 2
7 4.0 78.0 100
data2 = {'D':['a1','a2','a3','a4',2,1,'a3',4], 'E':['b1','b2',20,1.4,2,78,2,78],'F':[200,340,'c1',180,2,'c2',2,100]}
df2 = pd.DataFrame(data2)
print(df2)
D E F
0 a1 b1 200
1 a2 b2 340
2 a3 20 c1
3 a4 1.4 180
4 2 2 2
5 1 78 c2
6 a3 2 2
7 4 78 100
I want to insert df2 in df1 by replacing column B in df1. How can a dataframe be inserted by reaplcing column in another dataframe.
desired result:
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Idea is use concat with select columns by positions by DataFrame.iloc and Index.get_loc, last remove original column by DataFrame.drop:
c = 'B'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Tested another columns:
c = 'A'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
D E F B C
0 a1 b1 200 10.0 200
1 a2 b2 340 30.0 340
2 a3 20 c1 20.0 20
3 a4 1.4 180 1.4 180
4 2 2 2 2.0 2
5 1 78 c2 78.0 201
6 a3 2 2 2.0 2
7 4 78 100 78.0 100
c = 'C'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A B D E F
0 1.0 10.0 a1 b1 200
1 3.0 30.0 a2 b2 340
2 2.0 20.0 a3 20 c1
3 1.4 1.4 a4 1.4 180
4 2.0 2.0 2 2 2
5 1.0 78.0 1 78 c2
6 2.0 2.0 a3 2 2
7 4.0 78.0 4 78 100
IIUC we use itertools with reindex
import itertools
l=list(itertools.chain.from_iterable(list(df2) if item == 'B' else [item] for item in list(df1)))
pd.concat([df1,df2], axis=1).reindex(columns=l)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100

Python combines multiple column in one

I have a pandas dataframe as below:
data = {'A' :[1,2,3],
'B':[2,17,17],
'C1' :["C1",np.nan,np.nan],
'C2' :[np.nan,"C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
df
A B C1 C2
0 1 2 C1 NaN
1 2 17 NaN C2
2 3 17 NaN NaN
I want to create a variable "C" based on "C1" and"C2"(there could be "C4", "C5". If any of C's has the value "C"= value from C's(C1, C2, C3....). My output in this case should look like below:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Try this
df1 = df.filter(regex='^C\d+')
df['C'] = df1[df1.isin(df1.columns)].bfill(1).iloc[:,0]
Out[117]:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
If you want to strictly compare values matching to its own column name, Use eq instead of isin as follows
df['C'] = df1[df1.eq(df1.columns, axis=1)].bfill(1).iloc[:,0]
IIUC
df['C']=df.filter(like='C').bfill(axis=1).iloc[:,0]
df
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
IIUC,
we can filter your columns by the word C then aggregate the values with an agg call:
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
print(df)
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN

Merging list of multi-index series in Python

I'm working with thousands of pd.series each with a multi-index that has 2 static index, a dynamic one, and then timestamps:
start = np.concatenate((np.random.rand(3), [np.nan]*3))
end = np.concatenate(([np.nan]*3, np.random.rand(3)))
index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["A"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i1_start = pd.Series(start, index=index1, name="col1")
i1_end = pd.Series(end, index=index1, name="col2")
index2 = index1 = pd.MultiIndex(levels = [["X"], ["Y"], ["B"], ["d1","d2","d3","d4","d5","d6"]],
labels = [[0,0,0,0,0,0], [0,0,0,0,0,0], [0,0,0,0,0,0], [0,1,2,3,4,5]],
names = ["static1", "static2", "dynamo", "timestamps"])
i2_start = pd.Series(start, index=index2, name="col1")
i2_end = pd.Series(end, index=index2, name="col2")
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.DataFrame(data).T
df
Here are the results of turning it into a dataframe:
col1 col2 col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.248504 NaN NaN NaN
d2 0.424774 NaN NaN NaN
d3 0.333638 NaN NaN NaN
d4 NaN 0.987744 NaN NaN
d5 NaN 0.093231 NaN NaN
d6 NaN 0.918666 NaN NaN
B d1 NaN NaN 0.248504 NaN
d2 NaN NaN 0.424774 NaN
d3 NaN NaN 0.333638 NaN
d4 NaN NaN NaN 0.987744
d5 NaN NaN NaN 0.093231
d6 NaN NaN NaN 0.918666
I'm looking for advice on how to groupby the series with the same series.names and concat/merge/join them so that the columns line up, instead of having an entire triangle of just null values.
I think you need concat with sum or max and parameter axis=1 with level=0:
data = [i1_start, i1_end, i2_start, i2_end]
df = pd.concat(data, 1).sum(axis=1, level=0)
#same as
#df = pd.concat(data, 1).groupby(axis=1, level=0).sum()
#alternative
df = pd.concat(data, 1).max(axis=1, level=0)
print (df)
col1 col2
static1 static2 dynamo timestamps
X Y A d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
B d1 0.771148 NaN
d2 0.074757 NaN
d3 0.526310 NaN
d4 NaN 0.975088
d5 NaN 0.992226
d6 NaN 0.465135
How about this?
df.fillna(0).sum(1)
That is, replace NaN with zero and sum all the columns for each row.

Nested ifs to get values from different column

I have a dataframe df, I want to fill values in a column, based on the condition applied to other column
Structure of DF, After ID there are some columns:
ID ...... col1 col2 col3 col4
1 A1 A1 A1 A1
2 G3 D5
3 R6
4 Q3
5 M5 N8
I want to create two new column called 'final_col' and 'status', where 'final_col' has value from col1 or col2 or col3 or col4 depending on which col had first non-blank (not null/NaN) value.
The column 'status' is just the name of the column
Expected Output:
ID ...... col1 col2 col3 col4 final_col status
1 A1 A1 A1 A1 A1 col1
2 G3 D5 G3 col2
3 R6 L4 R6 col1
4 Not_found Not_found
5 M5 N8 M5 col2
I know how to do this in excel, with nested ifs as so, assuming ID is cell 'A1'
In the first row of 'final_col':
=IF(A2<>"",A2,IF(B2<>"",B2,IF(C2<>"",C2,IF(D2<>"",D2,"Not_found"))))
For column 'status'
=IF(A2<>"","col1",IF(B2<>"","col2",IF(C2<>"","col3",IF(D2<>"","col4","Not_found"))))
P.S: Please use column names in your solution, and not index because the structure of the data frame may vary (order of the columns).
You can use first_valid_index. If you can all NaN values in some row in columns col1 to col4 use:
print df
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
def f1(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return str(x.first_valid_index())
def f2(x):
if x.first_valid_index() is None:
return 'Not_found'
else:
return x[x.first_valid_index()]
df['status'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f1, axis=1)
df['final_col'] = df.ix[:, df.columns.tolist().index("col1") :].apply(f2, axis=1)
print df
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not_found Not_found
4 5 NaN M5 N8 NaN col2 M5
You could use first_valid_index:
In [105]: df
Out[105]:
ID col1 col2 col3 col4
0 1 A1 A1 A1 A1
1 2 NaN G3 NaN D5
2 3 R6 NaN NaN NaN
3 4 NaN NaN NaN NaN
4 5 NaN M5 N8 NaN
df['status'] = df.iloc[:,1:].apply(lambda x: x.first_valid_index(), axis=1)
df['final_col'] = df.iloc[:, 1:].apply(lambda x: x[x['status']] if x['status'] != None else 'Not found', axis=1)
df['status'].fillna('Not found', inplace=True)
In [129]: df
Out[129]:
ID col1 col2 col3 col4 status final_col
0 1 A1 A1 A1 A1 col1 A1
1 2 NaN G3 NaN D5 col2 G3
2 3 R6 NaN NaN NaN col1 R6
3 4 NaN NaN NaN NaN Not found Not found
4 5 NaN M5 N8 NaN col2 M5

Resources