How to split every row in dataframe into two with some features? [closed] - python-3.x

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have this dataframe:
A C1 C2
a1 c1 c3
a2 c2 c4
And columns C1 and C2 has the same type.
And I want get this:
A C
a1 c1
a1 c3
a2 c2
a2 c4
How I can do this?
UPD:
In answers I get this info:
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4
But what I should if I want split in this way?
A B C1 C2 C3 C4
a1 b1 c1 c2 c3 c4
a2 b2 c5 c6 c7 c8
and get this:
A B C1 C2
a1 b1 c1 c2
a1 b1 c3 c4
a2 b2 c5 c6
a2 b2 c7 c8

Edit 2: If you have even number of columns Cx, you may use numpy to make it simple
import numpy as np
cols = ['C1','C2','C3','C4']
df1 = df.loc[df.index.repeat(len(cols) / 2), ['A','B']].reset_index(drop=True)
df_final = df1.join(pd.DataFrame(df[cols].to_numpy().reshape(-1,2), columns=['C1','C2']))
Out[698]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Edit for updated sample:
On multiple columns Cx splitting by 2, you need wide_to_long. However, beforing doing it, you need pre-processing columns names to appropriate format to use with wide_to_long
df1 = df.set_index(['A','B'])
stub_cols = (np.arange(df1.columns.size) % 2).astype(str)
suff_cols = (np.arange(df1.columns.size) // 2).astype(str)
d = dict(zip(stub_cols, ['C1', 'C2']))
df1.columns = pd.Series(stub_cols) + '_' + suff_cols
df_final = pd.wide_to_long(df1.reset_index(),
i=['A','B'],
j='num',
stubnames=['0','1'],
sep='_').droplevel(-1).rename(d, axis=1).reset_index()
Out[680]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Give this a try
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4

print(
pd.concat([df.A, df[['C1', 'C2']].apply(list, axis=1)], axis=1).explode(0).rename(columns={0:'C'})
)
Prints:
A C
0 a1 c1
0 a1 c3
1 a2 c2
1 a2 c4

Related

Pandas rename one value as other value in a column and add corresponding values in the other column

So, I have a pandas data frame:
df =
a b c
a1 b1 c1
a2 b2 c1
a2 b3 c2
a2 b4 c2
I want to rename a2 into a1 and then group by a and c and add the corresponding values of b
df =
a b c
a1 b1+b2 c1
a1 b3+b4 c2
So, something like this
df =
a value c
a1 10 c1
a2 20 c1
a2 50 c2
a2 60 c2
df =
a value c
a1 30 c1
a1 110 c2
How to do this?
What about
>>> res = df.replace({"a": {"a2": "a1"}}).groupby(["a", "c"], as_index=False).sum()
>>> res
a c value
0 a1 c1 30
1 a1 c2 110
which first replaces "a2"s with "a1" in only a column and then groups by and sums.
To get the original column order back, we can reindex:
>>> res.reindex(df.columns, axis=1)
a value c
0 a1 30 c1
1 a1 110 c2
Try this:
df.groupby([df['a'].replace({'a2':'a1'}),'c']).sum().reset_index()

Modify Dataframe based on Priority

I have a df like this:
ID A1 A2 A3 A4 A5
1 1 2 3
2 1 2 3
3 2 1
4 3 1 2
5
For every ID, I have 5 columns A1 to A5 (In real I have many more) and the values are top priority for a particular ID.
For example: ID 1 has A1, A3 and A5 as priorites, , ID 3 has only 2 A2 and A1 and ID 5 has no
Priorities
Resultant DF
ID Priority_1 Priority_2 Priority_3
1 A1 A3 A5
2 A1 A2 A4
3 A2 A1
4 A3 A5 A1
5
I am trying to same using melt and pivot using this and this_1 and many more, but exactly not able to get the same resultant df.
Any help on this or clarity from my side!!
Use DataFrame.melt with sorting by DataFrame.sort_values and removing missing rows by DataFrame.dropna, then add new column used for filtering by boolean indexing and Series.le for less or equal and last use DataFrame.pivot with DataFrame.add_prefix, last add DataFrame.reindex for added only mising rows ID:
N = 3
df1 = df.melt('ID').sort_values(['ID','value']).dropna(subset=['value'])
df1['new'] = df1.groupby('ID').cumcount().add(1)
df1 = df1[df1['new'].le(N)]
df2 = df1.pivot('ID','new','variable').add_prefix('Priority_').reindex(df['ID'])
print (df2)
new Priority_1 Priority_2 Priority_3
ID
1 A1 A3 A5
2 A1 A2 A4
3 A2 A1 NaN
4 A3 A5 A1
5 NaN NaN NaN

how to convert grouped dataframe multilevel index to datadict

sample dataframe:
avg
Key1 Key2
a1 b1 v1
b2 v2
b3 v3
a2 b4 v4
a3 b5 v5
b6 v6
a4 b7 v7
How to convert this to a datadict
{a1:v1, a1:v2, a1:v3, a2:v4, a3:v5, a3:v6, a4:v7}
I tried this with no luck
dict(zip(df['ColA'], df['avg']))
Appreciate any help !!
Since it is multiple index using get_level_values
dict(zip(df.index.get_level_values(1), df['avg']))

Match two column values from 2 data sets, then find associated values

The issue I'm having was hard to title, and hard to search as well.
Here's some example data.
A B C D E F
B1 04/14/16 746 B1 04/25/16 2
B1 04/15/16 180 B1 04/30/16 4
B1 04/16/16 494 B1 05/01/16 5
B1 04/17/16 726 B2 04/01/16 1
B1 04/18/16 206 B2 04/03/16 1
B1 04/19/16 22 B2 04/04/16 2
B1 04/20/16 193 B2 04/05/16 2
B1 04/21/16 739 B2 04/12/16 8
B1 04/22/16 926 B2 04/13/16 1
B1 04/23/16 748 B2 04/14/16 2
B1 04/24/16 830 B2 04/15/16 1
B1 04/25/16 272 B2 04/18/16 9
B1 04/26/16 0 B2 04/19/16 1
B1 04/27/16 0 B2 04/26/16 9
B1 04/28/16 0 B2 04/27/16 3
B1 04/29/16 0 B2 04/30/16 1
B1 04/30/16 685 B2 05/02/16 5
B1 05/01/16 770 B2 05/03/16 2
B1 05/02/16 701 B3 04/03/16 3
B1 05/03/16 181 B3 04/04/16 1
B2 04/01/16 77 B3 04/06/16 2
B2 04/02/16 182 B3 04/07/16 1
B2 04/03/16 53 B3 04/09/16 1
B2 04/04/16 32 B3 04/16/16 7
What I'm trying to do is check for matching A and D columns, as well as matching B and E columns. If the columns match I would like to take column F and divide by column C.
Also if there is no match for both A and B column values, then have return those values with a zero.
So for a match:
B1 04/25/16 =2/272
For a non-match:
B1 04/14/16 0
Thank you.
Two INDEX/MATCH Function will do it:
=IFERROR(INDEX($F$1:$F$24,MATCH(1,INDEX(($E$1:$E$24=J2)*($D$1:$D$24=I2),),0))/INDEX($C$1:$C$24,MATCH(1,INDEX(($B$1:$B$24=J2)*($A$1:$A$24=I2),),0)),0)
This is an array formula, Full column references should be avoided as the calculation are exponential and will increase the calculation times.
If a more dynamic range is wanted then use this formula:
=IFERROR(INDEX($F$1:INDEX(F:F,MATCH(1E+99,F:F)),MATCH(1,INDEX(($E$1:INDEX(E:E,MATCH(1E+99,F:F))=J2)*($D$1:INDEX(D:D,MATCH(1E+99,F:F))=I2),),0))/INDEX($C$1:INDEX(C:C,MATCH(1E+99,C:C)),MATCH(1,INDEX(($B$1:INDEX(B:B,MATCH(1E+99,C:C))=J2)*($A$1:INDEX(A:A,MATCH(1E+99,C:C))=I2),),0)),0)
This will find the last cell with data and use that to set the extents of the range. So now as the data grows or shrinks it will only look at the data and not iterate through any more or any less than what is needed to cover the entire data set.

Sum fields in a column if there is an entry in a corresponding row in another column

Assume the following data:
| A B C
--+------------------------
1 | 2 3 5
2 | 2 3
3 | 4 4
4 | 2 3
5 | 5 6
In cell A6, I want Excel to add cells C1, C2, C3 on the basis that A1, A2 and A3 have data in. Similarly, I want B6 to add together C1, C4 and C5 because B1, B4 and B5 have data.
Can someone help?
In A6 enter:
=SUMPRODUCT(($C1:$C5)*(A1:A5<>""))
and then copy to B6:
A simple SUMIF formula will work
=SUMIF(A$1:A$5,"<>",$C$1:$C$5)
Place that formula is cell A6 and then copy it to B6.
You can create another column, e.g. AValue, with the formula =IF(ISBLANK(A1),0,A1) in it. This will return 0 if the cell in A in the corresponding line is empty, or the value from the cell in A otherwise.
Then you can just sum up the values of the new column.

Resources