Melting a dataframe based on a flag - python-3.x

I've a dataframe like this:
pd.DataFrame({'time':['01-01-2020','02-01-2020','01-01-2020','02-01-2020'],'level':['q','q','r','r'],'a':[1,2,3,4],'b':[12,34,54,67],'c':[18,29,39,47],'a_1':[0.1,0.2,0.3,0.4],'a_2':[0,1,0,1],'b_1':[0.28,0.47,0.02,0.05],'b_2':[1,1,0,1],'c_1':[0.18,0.40,0.12,0.01],'c_2':[1,1,0,0]})
>> time level a b c a_1 a_2 b_1 b_2 c_1 c_2
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
I wish to melt the data with time and level as index have all other columns as rows which had a flag 1 corresponding to their prefixes. Eg. I wish to have the values of a and a_1 listed as values and items if the value a_2 was 1.
Desired output:
>> time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
I can get all the values irrespective of the flags and then filter for flags==1. But,not sure how to "melt" /"unstack" in this case. I tried a lot of ways but in vain. Please help me out here.

Let's try with melt:
i, c = ['time', 'level'], pd.Index(['a', 'b','c'])
# mask the values where flag=0
m = df[c + '_1'].mask(df[c + '_2'].eq(0).values)
# melt the dataframe & assign the items column
s = df[[*i, *c]].melt(i, var_name='columns')\
.assign(items=m.values.T.reshape((-1, 1)))
# drop the nan values and sort the dataframe
s = s.dropna(subset=['items']).sort_values(i, ignore_index=True)
Details:
mask the values in columns ending with suffix _1 where the values in the corresponding flag columns equals 0:
a_1 b_1 c_1
0 NaN 0.28 0.18
1 0.2 0.47 0.40
2 NaN NaN NaN
3 0.4 0.05 NaN
melt the dataframe containing the columns a, b, c, then reshape the masked values and assign a new column items in melted dataframe:
time level columns value items
0 01-01-2020 q a 1 NaN
1 02-01-2020 q a 2 0.20
2 01-01-2020 r a 3 NaN
3 02-01-2020 r a 4 0.40
4 01-01-2020 q b 12 0.28
5 02-01-2020 q b 34 0.47
6 01-01-2020 r b 54 NaN
7 02-01-2020 r b 67 0.05
8 01-01-2020 q c 18 0.18
9 02-01-2020 q c 29 0.40
10 01-01-2020 r c 39 NaN
11 02-01-2020 r c 47 NaN
Lastly drop the NaN values in items and sort the values on time and level to get the final result:
time level columns value items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05

There may be a more elegant way, but this works. Extract data for each column name (a, b, c), select those that have flags set to 1 and concatenate the results.
df.set_index(['time', 'level'], inplace=True)
parts = []
for name in 'a','b','c':
d = df[[name, f'{name}_1', f'{name}_2']]\
.rename(columns={name: 'values', f'{name}_1': 'items', f'{name}_2': 'flag'})
d['column'] = name
parts.append(d[d.flag == 1])
pd.concat(parts)[['column','values','items']].reset_index()

Step 1: reorder the columns such that the numbers come before the letters:
res = df.copy()
res.columns = ["_".join(entry.split("_")[::-1]) for entry in res]
Step2 : reorder the columns (again) such that "num" is prefixed if the column is in ("a","b","c")
res.columns = [f"num_{letter}" if letter in ("a", "b", "c")
else letter
for letter in res]
res
time level num_a num_b num_c 1_a 2_a 1_b 2_b 1_c 2_c
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
Step 3: Use pandas wide to long to reshape the data, filter for rows equal to 1, rename the columns and finally reset the index:
(
pd.wide_to_long(
res,
stubnames=["num", "1", "2"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
# this is where the filter for rows equal to 1 occur
.query("`2`==1")
.drop(columns="2")
.set_axis(["values", "items"], axis="columns")
.reset_index()
)
time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
This is another way, but same idea of renaming the columns - makes it easy to reshape with wide to long:
result = df.rename(
columns=lambda x: f"values_{x}"
if x in ("a", "b", "c")
else f"items_{x[0]}"
if re.search(".1$", x)
else f"equals1_{x[0]}"
if re.search(".2$", x)
else x
)
(
pd.wide_to_long(
result,
stubnames=["values", "items", "equals1"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
.query("equals1==1")
.iloc[:, :-1]
.reset_index()
)
Another option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(index = ['time', 'level'],
names_to = ["column", ".value"],
names_pattern = r"(.)_?(.?)",
sort_by_appearance = True)
.query('`2` == 1')
.drop(columns = '2')
.rename(columns={'':'values', '1':'items'})
)
time level column values items
1 01-01-2020 q b 12 0.28
2 01-01-2020 q c 18 0.18
3 02-01-2020 q a 2 0.20
4 02-01-2020 q b 34 0.47
5 02-01-2020 q c 29 0.40
9 02-01-2020 r a 4 0.40
10 02-01-2020 r b 67 0.05

Related

Python pandas shift null in certain columns Only

I have a dict like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
And my desired output like this:
A B C D E F G H I J
0 A.1 Data Data 223 52
1 A.2 Data Data Data 12 6
2 A.4 Data 32 365
3 A.5 Data 100 88
4 A.6 Data 654 98
5 A.7 Data 356 56
Only column A to column E will shift null, I have a current script using lamba but all dataframe shift the null values to the last column. I need certain columns only, any one can help me? THank you!
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df = df.T.apply(lambda arr: shift_null(arr)).T
You can remove missing values per rows by Series.dropna, add possible only missing values columns by DataFrame.reindex and then set columns names by list by DataFrame.set_axis:
cols = ['A','B','C','D','E']
df[cols] = (df[cols].apply(lambda x: pd.Series(x.dropna().tolist()), axis=1)
.reindex(range(len(cols)), axis=1)
.set_axis(cols, axis=1))
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Your solution is changed with remove transposing and result_type='expand' in DataFrame.apply:
cols = ['A','B','C','D','E']
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
df[cols] = df[cols].apply(lambda arr: shift_null(arr), axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Another idea is sorting by key parameter:
cols = ['A','B','C','D','E']
df[cols] = df[cols].apply(lambda x: x.sort_values(key=lambda x: x.isna()).tolist(),
axis=1, result_type='expand')
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56
Solution with reshape by DataFrame.stack, add counter for new columns names and last reshape back by Series.unstack:
s = df[cols].stack().droplevel(1)
s.index = [s.index, s.groupby(level=0).cumcount()]
df[cols] = s.unstack().rename(dict(enumerate(cols)), axis=1).reindex(cols, axis=1)
print (df)
A B C D E F G
0 A.1 Data Data NaN NaN 223 52
1 A.2 Data Data Data NaN 12 6
2 A.4 Data NaN NaN NaN 32 365
3 A.5 Data NaN NaN NaN 100 88
4 A.6 Data NaN NaN NaN 654 98
5 A.7 Data NaN NaN NaN 356 56

Functional Programming: How does one create a new column in a multi-index data frame that is a function of another column?

Suppose the below simplified dataframe. (The actual df is much, much bigger.) How does one assign values to a new column f such that f is a function of another column (e.g. e)?
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
df.columns = pd.MultiIndex.from_tuples((("a", "d"), ("a", "e"), ("b", "d"), ("b","e")))
df
a b
d e d e
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Desired Output:
a b
d e f d e f
0 1 2 nan 3 4 nan
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
where column f is computed as np.log(df['e']).diff()
You could access the MultiIndex column using loc, then use the functions directly on the sliced column, then join it back to df:
import numpy as np
df = (df.join(np.log(df.loc[:, (slice(None), 'e')])
.diff().round(2).rename(columns={'e':'f'}, level=1))
.sort_index(axis=1))
Output:
a b
d e f d e f
0 1 2 NaN 3 4 NaN
1 5 6 1.10 7 8 0.69
2 9 10 0.51 11 12 0.41
3 13 14 0.34 15 16 0.29
df = {c:df[c].assign(r=np.log(df[(c,'d')]).diff()) for c in df.columns.levels[0]}
df = pd.concat([df[c] for c in df.keys()], axis=1, keys = df.keys())

Replace given column's first row value with NaN for each group in Pandas

How can replace the first row's value of pct as NaN for each group city and district? Thank you.
city district date pct
0 a b 2019/8/1 0.15
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 0.03
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
I can only get the first row's pct value for dataframe by df['pct'].iloc[0].
My desired output will like this:
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Use Series.where + DataFrame.duplicated
df['pct']=df['pct'].where(df.duplicated(subset = ['city','district']))
print(df)
city district date pct
0 a b 2019/8/1 NaN
1 a b 2019/9/1 0.12
2 a b 2019/10/1 0.25
3 c d 2019/7/1 NaN
4 c d 2019/8/1 -0.36
5 c d 2019/9/1 0.57
Detail:
df.duplicated(subset = ['city','district'])
0 False
1 True
2 True
3 False
4 True
5 True
dtype: bool

Rename column index from 0 to last column pandas

I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.

Resources