I've a dataframe like this:
pd.DataFrame({'time':['01-01-2020','02-01-2020','01-01-2020','02-01-2020'],'level':['q','q','r','r'],'a':[1,2,3,4],'b':[12,34,54,67],'c':[18,29,39,47],'a_1':[0.1,0.2,0.3,0.4],'a_2':[0,1,0,1],'b_1':[0.28,0.47,0.02,0.05],'b_2':[1,1,0,1],'c_1':[0.18,0.40,0.12,0.01],'c_2':[1,1,0,0]})
>> time level a b c a_1 a_2 b_1 b_2 c_1 c_2
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
I wish to melt the data with time and level as index have all other columns as rows which had a flag 1 corresponding to their prefixes. Eg. I wish to have the values of a and a_1 listed as values and items if the value a_2 was 1.
Desired output:
>> time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
I can get all the values irrespective of the flags and then filter for flags==1. But,not sure how to "melt" /"unstack" in this case. I tried a lot of ways but in vain. Please help me out here.
Let's try with melt:
i, c = ['time', 'level'], pd.Index(['a', 'b','c'])
# mask the values where flag=0
m = df[c + '_1'].mask(df[c + '_2'].eq(0).values)
# melt the dataframe & assign the items column
s = df[[*i, *c]].melt(i, var_name='columns')\
.assign(items=m.values.T.reshape((-1, 1)))
# drop the nan values and sort the dataframe
s = s.dropna(subset=['items']).sort_values(i, ignore_index=True)
Details:
mask the values in columns ending with suffix _1 where the values in the corresponding flag columns equals 0:
a_1 b_1 c_1
0 NaN 0.28 0.18
1 0.2 0.47 0.40
2 NaN NaN NaN
3 0.4 0.05 NaN
melt the dataframe containing the columns a, b, c, then reshape the masked values and assign a new column items in melted dataframe:
time level columns value items
0 01-01-2020 q a 1 NaN
1 02-01-2020 q a 2 0.20
2 01-01-2020 r a 3 NaN
3 02-01-2020 r a 4 0.40
4 01-01-2020 q b 12 0.28
5 02-01-2020 q b 34 0.47
6 01-01-2020 r b 54 NaN
7 02-01-2020 r b 67 0.05
8 01-01-2020 q c 18 0.18
9 02-01-2020 q c 29 0.40
10 01-01-2020 r c 39 NaN
11 02-01-2020 r c 47 NaN
Lastly drop the NaN values in items and sort the values on time and level to get the final result:
time level columns value items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
There may be a more elegant way, but this works. Extract data for each column name (a, b, c), select those that have flags set to 1 and concatenate the results.
df.set_index(['time', 'level'], inplace=True)
parts = []
for name in 'a','b','c':
d = df[[name, f'{name}_1', f'{name}_2']]\
.rename(columns={name: 'values', f'{name}_1': 'items', f'{name}_2': 'flag'})
d['column'] = name
parts.append(d[d.flag == 1])
pd.concat(parts)[['column','values','items']].reset_index()
Step 1: reorder the columns such that the numbers come before the letters:
res = df.copy()
res.columns = ["_".join(entry.split("_")[::-1]) for entry in res]
Step2 : reorder the columns (again) such that "num" is prefixed if the column is in ("a","b","c")
res.columns = [f"num_{letter}" if letter in ("a", "b", "c")
else letter
for letter in res]
res
time level num_a num_b num_c 1_a 2_a 1_b 2_b 1_c 2_c
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
Step 3: Use pandas wide to long to reshape the data, filter for rows equal to 1, rename the columns and finally reset the index:
(
pd.wide_to_long(
res,
stubnames=["num", "1", "2"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
# this is where the filter for rows equal to 1 occur
.query("`2`==1")
.drop(columns="2")
.set_axis(["values", "items"], axis="columns")
.reset_index()
)
time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
This is another way, but same idea of renaming the columns - makes it easy to reshape with wide to long:
result = df.rename(
columns=lambda x: f"values_{x}"
if x in ("a", "b", "c")
else f"items_{x[0]}"
if re.search(".1$", x)
else f"equals1_{x[0]}"
if re.search(".2$", x)
else x
)
(
pd.wide_to_long(
result,
stubnames=["values", "items", "equals1"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
.query("equals1==1")
.iloc[:, :-1]
.reset_index()
)
Another option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(index = ['time', 'level'],
names_to = ["column", ".value"],
names_pattern = r"(.)_?(.?)",
sort_by_appearance = True)
.query('`2` == 1')
.drop(columns = '2')
.rename(columns={'':'values', '1':'items'})
)
time level column values items
1 01-01-2020 q b 12 0.28
2 01-01-2020 q c 18 0.18
3 02-01-2020 q a 2 0.20
4 02-01-2020 q b 34 0.47
5 02-01-2020 q c 29 0.40
9 02-01-2020 r a 4 0.40
10 02-01-2020 r b 67 0.05
Related
I have a dict like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
And my desired output like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
Only column A to column E will shift null, I have a current script using lamba but all dataframe shift the null values to the last column. I need certain columns only, any one can help me? THank you!
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df = df.T.apply(lambda arr: shift_null(arr)).T
You can remove missing values per rows by Series.dropna, add possible only missing values columns by DataFrame.reindex and then set columns names by list by DataFrame.set_axis:
cols = ['A','B','C','D','E']
df[cols] = (df[cols].apply(lambda x: pd.Series(x.dropna().tolist()), axis=1)
.reindex(range(len(cols)), axis=1)
.set_axis(cols, axis=1))
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Your solution is changed with remove transposing and result_type='expand' in DataFrame.apply:
cols = ['A','B','C','D','E']
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df[cols] = df[cols].apply(lambda arr: shift_null(arr), axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Another idea is sorting by key parameter:
cols = ['A','B','C','D','E']
df[cols] = df[cols].apply(lambda x: x.sort_values(key=lambda x: x.isna()).tolist(),
axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Solution with reshape by DataFrame.stack, add counter for new columns names and last reshape back by Series.unstack:
s = df[cols].stack().droplevel(1)
s.index = [s.index, s.groupby(level=0).cumcount()]
df[cols] = s.unstack().rename(dict(enumerate(cols)), axis=1).reindex(cols, axis=1)
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Suppose the below simplified dataframe. (The actual df is much, much bigger.) How does one assign values to a new column f such that f is a function of another column (e.g. e)?
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
df.columns = pd.MultiIndex.from_tuples((("a", "d"), ("a", "e"), ("b", "d"), ("b","e")))
df
a b
d e d e
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Desired Output:
a b
d e f d e f
0 1 2 nan 3 4 nan
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
where column f is computed as np.log(df['e']).diff()
You could access the MultiIndex column using loc, then use the functions directly on the sliced column, then join it back to df:
import numpy as np
df = (df.join(np.log(df.loc[:, (slice(None), 'e')])
.diff().round(2).rename(columns={'e':'f'}, level=1))
.sort_index(axis=1))
Output:
a b
d e f d e f
0 1 2 NaN 3 4 NaN
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
df = {c:df[c].assign(r=np.log(df[(c,'d')]).diff()) for c in df.columns.levels[0]}
df = pd.concat([df[c] for c in df.keys()], axis=1, keys = df.keys())
How can replace the first row's value of pct as NaN for each group city and district? Thank you.
city district date pct
0 a b 2019/8/1 0.15
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 0.03
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
I can only get the first row's pct value for dataframe by df['pct'].iloc[0].
My desired output will like this:
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Use Series.where + DataFrame.duplicated
df['pct']=df['pct'].where(df.duplicated(subset = ['city','district']))
print(df)
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Detail:
df.duplicated(subset = ['city','district'])
0 False
1 True
2 True
3 False
4 True
5 True
dtype: bool
I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2
I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.