I have two DataFrames d1 and d2.
d1:
category value
0 a 4
1 b 9
2 c 14
3 d 19
4 e 24
5 f 29
d2:
one two
0 NaN a
1 NaN a
2 NaN c
3 NaN d
4 NaN e
5 NaN a
I want to map values from d1 to 'one' column in d2 using category marker form d1.
this should return me:
one two
0 4 a
1 4 a
2 14 c
3 19 d
4 24 e
5 4 a
Try:
df2['one'] = df2['two'].map(df1.set_index('category')['value'])
Related
Suppose the below simplified dataframe. (The actual df is much, much bigger.) How does one assign values to a new column f such that f is a function of another column (e.g. e)?
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
df.columns = pd.MultiIndex.from_tuples((("a", "d"), ("a", "e"), ("b", "d"), ("b","e")))
df
a b
d e d e
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Desired Output:
a b
d e f d e f
0 1 2 nan 3 4 nan
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
where column f is computed as np.log(df['e']).diff()
You could access the MultiIndex column using loc, then use the functions directly on the sliced column, then join it back to df:
import numpy as np
df = (df.join(np.log(df.loc[:, (slice(None), 'e')])
.diff().round(2).rename(columns={'e':'f'}, level=1))
.sort_index(axis=1))
Output:
a b
d e f d e f
0 1 2 NaN 3 4 NaN
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
df = {c:df[c].assign(r=np.log(df[(c,'d')]).diff()) for c in df.columns.levels[0]}
df = pd.concat([df[c] for c in df.keys()], axis=1, keys = df.keys())
Given a dataframe df as follows:
id building floor_number floor_name
0 1 A 8 5F
1 2 A 4 4F
2 3 A 3 3F
3 4 A 2 2F
4 5 A 1 1F
5 6 B 14 17F
6 7 B 13 16F
7 8 B 20 world
8 9 B 13 hello
9 10 B 13 16F
I need to extract values from floor_name column then: groupby building then compare floor_number's values for each row with floor_name's maximum values, if floor number is bigger than the extracted values from floor name, then return new column check with content invalid floor number.
This is expected result:
id building ... floor_name check
0 1 A ... 5F invalid floor number
1 2 A ... 4F NaN
2 3 A ... 3F NaN
3 4 A ... 2F NaN
4 5 A ... 1F NaN
5 6 B ... 17F NaN
6 7 B ... 16F NaN
7 8 B ... world invalid floor number
8 9 B ... hello NaN
9 10 B ... 16F NaN
For extract values from floor_name, groupby building and get max for floor_name, I have used:
df['floor_name'] = df['floor_name'].str.extract('(\d*)', expand = False)
df.groupby('building')['floor_name'].max()
Out:
building
A 5
B 17
Name: floor_name, dtype: object
How could I finish the rest of code? Thanks at advance.
Use groupby().transform(). Also, it's better to convert to numeric type, since '2' > '17':
numeric_floors = (df['floor_name'].str.extract('(\d+)', # use \d+ instead of *
expand=False)
.astype(float) # convert to numeric type
.groupby(df['building'])
.transform('max')
)
df.loc[df['floor_number'] > numeric_floors, 'check'] = 'invalid floor number'
Output:
id building floor_number floor_name check
0 1 A 8 5F invalid floor number
1 2 A 4 4F NaN
2 3 A 3 3F NaN
3 4 A 2 2F NaN
4 5 A 1 1F NaN
5 6 B 14 17F NaN
6 7 B 13 16F NaN
7 8 B 20 world invalid floor number
8 9 B 13 hello NaN
9 10 B 13 16F NaN
I need to perform rearrenging column operations that include:
find difference respect to a list
create new columns sortes as list
Keep multiindex
import pandas as pd
cols = ['c1','c2','c3','c4']
df=pd.DataFrame(data=[[1,1,2,2,3,3],[4,4,5,5,6,6],[7,7,8,8,9,9]],index=['r1','r2','r3'],columns=pd.MultiIndex.from_tuples([('c2','mean'),('c2','max'),('c1','mean'),('c1','max'),('c3','mean'),('c3','max')]))
df
Out[52]:
c2 c1 c3
mean max mean max mean max
r1 1 1 2 2 3 3
r2 4 4 5 5 6 6
r3 7 7 8 8 9 9
so, the final result is:
df
Out[52]:
c1 c2 c3 c4
mean max mean max mean max mean max
r1 2 2 1 1 3 3 NaN NaN
r2 5 5 4 4 6 6 NaN NaN
r3 8 8 7 7 9 9 NaN NaN
Use DataFrame.reindex for change order and also add missing combinations to original data with new MultiIndex, e.g. created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([['c1','c2','c3','c4'], ['mean','max']])
df = df.reindex(mux, axis=1)
print (df)
c1 c2 c3 c4
mean max mean max mean max mean max
r1 2 2 1 1 3 3 NaN NaN
r2 5 5 4 4 6 6 NaN NaN
r3 8 8 7 7 9 9 NaN NaN
I have a pandas dataframe as below:
data = {'A' :[1,2,3],
'B':[2,17,17],
'C1' :["C1",np.nan,np.nan],
'C2' :[np.nan,"C2",np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
df
A B C1 C2
0 1 2 C1 NaN
1 2 17 NaN C2
2 3 17 NaN NaN
I want to create a variable "C" based on "C1" and"C2"(there could be "C4", "C5". If any of C's has the value "C"= value from C's(C1, C2, C3....). My output in this case should look like below:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
Try this
df1 = df.filter(regex='^C\d+')
df['C'] = df1[df1.isin(df1.columns)].bfill(1).iloc[:,0]
Out[117]:
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
If you want to strictly compare values matching to its own column name, Use eq instead of isin as follows
df['C'] = df1[df1.eq(df1.columns, axis=1)].bfill(1).iloc[:,0]
IIUC
df['C']=df.filter(like='C').bfill(axis=1).iloc[:,0]
df
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
IIUC,
we can filter your columns by the word C then aggregate the values with an agg call:
df['C'] = df.filter(regex='C\d+').stack().groupby(level=0).agg(','.join)
print(df)
A B C1 C2 C
0 1 2 C1 NaN C1
1 2 17 NaN C2 C2
2 3 17 NaN NaN NaN
I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0