My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN
Related
For the dataframe df1 as follows:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN NaN NaN
1 M0066352 aluminum NaN NaN NaN
2 M0066353 gold NaN NaN NaN
3 M0066354 silver NaN NaN NaN
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore NaN NaN NaN
7 S0212352 coke NaN NaN NaN
8 S0212353 others 1.0 NaN 1.0
and I hope to fill columns cols = ['black metal', 'non-ferrous metals', 'precious metal'] with 1s based on customized_dict:
customized_dict = {
'black metal': ['iron ore', 'coke'],
'non-ferrous metals': ['copper', 'aluminum'],
'precious metal': ['gold', 'silver']
}
Please note the keys are from column names of df1 and values are from content of products in df1.
So my question is how could I get the following output:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
EDIT: new data with duplicates in products column.
id products black metal non-ferrous metals precious metal
0 S0212350 Apple NaN NaN NaN
1 M0066352 aluminum NaN 1.0 NaN
2 S0212352 coke 1.0 NaN NaN
3 S0212354 coke 1.0 NaN NaN
4 M0066350 copper NaN 1.0 NaN
5 M0066353 gold NaN NaN 1.0
6 S0212351 iron ore 1.0 NaN NaN
7 S0212353 others 1.0 NaN 1.0
8 M0066354 silver NaN NaN 1.0
9 S0200837 soybean NaN NaN NaN
Using a simple loop on the columns and update:
customized_dict = {
'black metal': ['iron ore', 'coke'],
'non-ferrous metals': ['copper', 'aluminum'],
'precious metal': ['gold', 'silver']
}
df.update(df.iloc[:,2:].apply(lambda c: c[df['products']
.isin(customized_dict[c.name])]
.fillna(1)))
output:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
Use:
# list comprehension for MultiIndex Series with 1
L = [(x, k) for k, v in customized_dict.items() for x in v]
# reshape for DataFrame
df2 = pd.Series(1, index=pd.MultiIndex.from_tuples(L)).unstack()
# replace missing values by products column converted to index
df = df1.set_index('products').combine_first(df2).rename_axis('products').reset_index().reindex(df1.columns, axis=1)
print(df)
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
Create a reverse dict mapping and use crosstab to create the updated array then fillna:
reversed_dict = {v: k for k, l in customized_dict.items() for v in l}
df1 = df1.fillna(pd.crosstab(df1.index, df1['products'].map(reversed_dict), values=1, aggfunc='mean'))
print(df1)
# Output
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
This may be a different in requirements
i have data frame
A B C
1 2 4
2 4 6
8 10 12
1 3 5
and a dynamic list (the length may vary
list[1,2,3,4,5,6,7,8,9,10,11]
I wish to add C colum value with list value with each of this list and generate new dataframe column with added value how to do this?
A B C C_1 C_2 .......................... C_11
1 2 4 5 6 15
2 4 6 7 8 17
8 10 12 11 14
1 3 5 6 7 16
Thank you for your support
you can use a dict comprehension to create a simple dataframe.
dynamic_vals = [1,2,3,4,5,6,7,8,9,10,11]
df2 = pd.concat(
[df,pd.DataFrame({f'C_{val}' : [0] for val in dynamic_vals })]
,axis=1).fillna(0)
print(df2)
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 4 6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 8 10 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1 3 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
or you could use assign again with the suggestion of #piRSqaured
df2 = df.assign(**dict((f'C_{i}', np.nan) for i in dynamic_vals))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
or better and more simple solution suggested by #piRSquare
df.join(pd.DataFrame(np.nan, df.index, dynamic_vals).add_prefix('C_')
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Edit :
Using df.join with a dictionary comprehension.
df.join(pd.DataFrame({f'C_{val}' : df['C'].values + val for val in dynamic_vals }))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 5 6 7 8 9 10 11 12 13 14 15
1 2 4 6 7 8 9 10 11 12 13 14 15 16 17
2 8 10 12 13 14 15 16 17 18 19 20 21 22 23
3 1 3 5 6 7 8 9 10 11 12 13 14 15 16
I have a dataframe like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 3.0 2019-10-01 11.25
3 MMG 1.0 2019-08-01 15.71
4 MMG 2.0 2020-10-01 11.50
5 MMG 3.0 2021-10-01 11.75
6 MMG 4.0 2014-01-01 14.00
I would like to have an output like this,
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 XYZ 1 2 3 NaN 2018-10-01 2017-08-01 2019-10-01 NaN 11.0 15.25 11.25 NaN
1 MMG 1 2 3 4 2019-08-01 2020-10-01 2021-10-01 2014-01-01 15.71 11.50 11.75 14.00
I tried the following,
df.vstack().reset_index().drop('level_1',axis=0)
This is not giving me the output I need.
If you see, I have per column that has incremental values that can go into column names when I transpose them.
Any suggestion would be great.
Use GroupBy.cumcount for counter and reshape by DataFrame.unstack, last flatten columns names by f-strings:
df = df.set_index(['param', df.groupby('param').cumcount().add(1)]).unstack()
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
param per_1 per_2 per_3 per_4 per_date_1 per_date_2 per_date_3 \
0 MMG 1.0 2.0 3.0 4.0 2019-08-01 2020-10-01 2021-10-01
1 XYZ 1.0 2.0 3.0 NaN 2018-10-01 2017-08-01 2019-10-01
per_date_4 per_num_1 per_num_2 per_num_3 per_num_4
0 2014-01-01 15.71 11.50 11.75 14.0
1 NaN 11.00 15.25 11.25 NaN
I am trying to do some transformations and kind of stuck. Hopefully somebody, can help me out here.
l0 a b c d e f
l1 1 2 1 2 1 2 1 2 1 2 1 2
0 NaN NaN NaN NaN 93.4 NaN NaN NaN NaN NaN 19.0 28.9
1 NaN 9.0 NaN NaN 43.5 32.0 NaN NaN NaN NaN NaN 3.4
2 NaN 5.0 NaN NaN 93.3 83.6 NaN NaN NaN NaN 59.5 28.2
3 NaN 19.6 NaN NaN 72.8 47.4 NaN NaN NaN NaN 31.5 67.2
4 NaN NaN NaN NaN NaN 62.5 NaN NaN NaN NaN NaN 1.8
I have a dataframe, (shown above), and as u can see that, there are multiple 'NaN' with an multiindex column. Selecting the columns along level = 0 (i.e. l0)
I would like to drop the entire column if all are NaN. so, in this case the column's
l0 = ['b', 'd', 'e'] # drop-cols
should be dropped from the Dataframe
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
This will give me the dataframe (as shown above). I would like to then slide values along the rows if all the entries before are null (or swap values between adjacent cols). e.g. Looking at index = 0 i.e. first row.
l0 a c f
l1 1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
Since, all the values in col - a are null.
I would like to slide / swap values first b/w col - a and col - c.
and then receprocate the same for columns along the right-side i.e. replace entries in col-c with col-f and make all entries in col-f, NaN giving me
l0 a c f
l1 1 2 1 2 1 2
0 93.4 NaN 19.0 28.9 NaN NaN
This is really to save memory for processing and storing information, as interchainging labels ['a', 'b', 'c'...] does not change the meaning of the data.
EDIT: Any Idea's for (2)
I have managed to solve (1) with the following code:
for c in df.columns.get_level_values(0).unique():
if df[c].isna().all().all():
df = df.drop(columns=[c])
df
You can do with all
s=df.isnull().all(level=0,axis=1).all()
df.drop(s.index[s],axis=1,level=0)
Out[55]:
a c f
1 2 1 2 1 2
l1
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
groupby and filter
df.groupby(axis=1, level=0).filter(lambda d: ~d.isna().all().all())
a c f
1 2 1 2 1 2
0 NaN NaN 93.4 NaN 19.0 28.9
1 NaN 9.0 43.5 32.0 NaN 3.4
2 NaN 5.0 93.3 83.6 59.5 28.2
3 NaN 19.6 72.8 47.4 31.5 67.2
4 NaN NaN NaN 62.5 NaN 1.8
A little bit shorter
df.groupby(axis=1, level=0).filter(lambda d: ~np.all(d.isna()))
I Have A DataFrame , & I Want to Create New Columns Based o The Values of The Same Column , And At Each of This Column I want The Values To Be The Sum of repetition of Plate over the Time.
So I have This DataFrame:
Val_Tra.Head():
Plate EURO
Timestamp
2013-11-01 00:00:00 NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 0.0
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad 6.0
So I Have The EURO Column That Looks Like This:
Veh_Tra.EURO.value_counts():
5 1590144
6 745865
4 625512
0 440834
3 243800
2 40664
7 14207
1 4301
And This My Desired Output:
Plate EURO_1 EURO_2 EURO_3 EURO_4 EURO_5 EURO_6 EURO_7
Timestamp
2013-11-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN
2013-11-01 01:00:00 dcc2f657e897ffef752003469c688381 1.0 NaN NaN NaN NaN NaN NaN
2013-11-01 02:00:00 a5ac0c2f48ea80707621e530780139ad NaN NaN 1.0 NaN NaN NaN NaN
So Basically , What I Want , Is The Sum in Which Each Time That a Plate Value repeats Itself on a Specific Type of Euro over a specific Time.
Any Suggestions Would Be Much Appreciated , Thank U.
This is more like a get_dummies problem
s=df.dropna().EURO.astype(int).astype(str).str.get_dummies().add_prefix('EURO')
df=pd.concat([df,s],axis=1,sort=True)
df
Out[259]:
Plate EURO EURO0 EURO6
2013-11-0100:00:00 NaN NaN NaN NaN
2013-11-0101:00:00 dcc2f657e897ffef752003469c688381 0.0 1.0 0.0
2013-11-0102:00:00 a5ac0c2f48ea80707621e530780139ad 6.0 0.0 1.0