Pandas columns to rows transformation not working? - python-3.x

We have below pandas dataframe
and we need to convert into below dataframe
while using pd.wide_to_long command we are getting below error:-
ValueError: stubname can't be identical to a column name
This command is being use:-
pd.wide_to_long(df,['Org','City'],i=['First Name','Middle Name','Last Name','Years'],j='drop').reset_index(level=[0,1]

For me your solution working, also added parameter stubnames. Mayb eit is bug in oldier pandas version, link, so you can try to upgrade pandas to last version:
df = pd.wide_to_long(df, stubnames=['Org','City'],
i=['First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop('drop', 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN
EDIT: If possible some duplicates in data is possible create default index by reset_index and add column index to i variables:
print (df)
First Name Middle Name Last Name Years Org0 Org1 Org2 Org3 City0 City1 \
0 aa cc dd 2019 v m d p n m
1 zz yy xx 2018 p q i NaN n n
City2 City3
0 n j
1 d NaN
df = pd.wide_to_long(df.reset_index(), stubnames=['Org','City'],
i=['index','First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop(['drop', 'index'], 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN

Related

Fill 0s with mean values of previous and next in a Pandas DataFrame

Fill the 0s with the following conditions:
If there is one 0 value, do a simple average between the data after and before the 0. In this scenario 0 is replaced by mean of A and B
If there are two consecutive 0s, fill the first missing value with data from previous period and doing a simple average between the data after and before the second 0. First 0 is replaced by A and second 0 by mean of A and B.
If there are 3 consecutive 0s, replace first and second 0 with A and 3rd by mean of A and B.
Ticker is an identifier and would be common for every block(can be ignored). The entire table is 1000 rows long and in no case consecutive 0s would exceed 3. I am unable to manage scenario 2 and 3.
ID
asset
AA
34861000
AA
1607498
AA
0
AA
3530000000
AA
3333000000
AA
3179000000
AA
4053000000
AA
4520000000
AB
15250209
AB
0
AB
14691049
AB
0
AB
5044421
CC
5609212
CC
0
CC
0
CC
3673639
CC
132484747
CC
0
CC
0
CC
0
CC
141652646
You can use interpolate per group, on the reversed Series, with a limit of 1:
df['asset'] = (df
.assign(asset=df['asset'].replace(0, float('nan')))
.groupby('ID')['asset']
.transform(lambda s: s[::-1].interpolate(limit=1).bfill())
)
output:
ID asset
0 AA 3.486100e+07
1 AA 1.607498e+06
2 AA 1.765804e+09
3 AA 3.530000e+09
4 AA 3.333000e+09
5 AA 3.179000e+09
6 AA 4.053000e+09
7 AA 4.520000e+09
8 AB 1.525021e+07
9 AB 1.497063e+07
10 AB 1.469105e+07
11 AB 9.867735e+06
12 AB 5.044421e+06
13 CC 5.609212e+06
14 CC 5.609212e+06
15 CC 4.318830e+06
16 CC 3.673639e+06
17 CC 1.324847e+08 # X
18 CC 1.324847e+08 # filled X
19 CC 1.324847e+08 # filled X
20 CC 1.393607e+08 # (X+Y)/2
21 CC 1.416526e+08 # Y
Ok, compiling the answer here with help from #jezrael and #mozway
df['asset'] =df['asset'].replace(0, float('nan'))
df.loc[mask, 'asset'] = df.loc[mask | ~m, 'asset'].groupby(df['ID']).transform(lambda x: x.interpolate())
df.ffill()

The `column` utility without the -t option

After reading the man pages for column and trying a few examples, I wonder: what does this command do when it is not supplied the -t option?
It takes lines and puts them on separate cell inside a table, filling rows first:
$ seq 40 | column
1 4 7 10 13 16 19 22 25 28 31 34 37 40
2 5 8 11 14 17 20 23 26 29 32 35 38
3 6 9 12 15 18 21 24 27 30 33 36 39
It's similar to ls output, but the separator is tab between columns, while ls uses spaces:
$ ls
a ab ad af ah aj al an ap ar at av ax az c e g i k m o q s u w y
aa ac ae ag ai ak am ao aq as au aw ay b d f h j l n p r t v x z
$ printf "%s\n" * | column
a ac af ai al ao ar au ax b e h k n q t w z
aa ad ag aj am ap as av ay c f i l o r u x
ab ae ah ak an aq at aw az d g j m p s v y
If you have some newline-separated data that you want to represent in a condensed form in a nicely indented table, column is the way to go.

MELT: multiple values without duplication

Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())
EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45

How to rotate rows by index into columns?

I have dataframe that looks like this:
title answer
0 aa zz
1 bb xx
2 cc yy
3 dd ll
I want to rotate by index value and reset index the rows as index and value for only two rows like this:
bb cc
0 xx yy
How do I do this?
I tried Transpose:
df[['title','answer']].T
but that puts integer values in the column headers and not sure how to select by index number.
IIUC,
we can set title as the index and then transpose, the reason you get integers is because your tranposing on the index
df_new = df.set_index('title').T
print(df_new)
title aa bb cc dd
answer zz xx yy ll
if you want to get rid of the index as well :
df_new = df.set_index('title').T.reset_index(drop=True)
df_new.columns.name = ''
print(df_new)
aa bb cc dd
0 zz xx yy ll

How to pass 2 or more array criteria in SUMIFS formula?

I have a below table as below
A B C
a aa 1
a aa 1
a dd 1
a aa 1
b aa 1
b bb 1
b aa 1
b bb 1
c cc 1
c bb 1
c bb 1
c cc 1
d cc 1
d aa 1
d bb 1
d cc 1
When i put the formula
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"}))
it returns 12
However when i put
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,{"aa","bb"}))
it returns only 5
Can one one help me with this. I dont wnat to use multi formula like
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,"aa"})) + SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,"bb"}))
I got the answer that i was expecting from another site.
Thanks http://www.excelforum.com/members/30486.html
=SUMPRODUCT(SUMIFS(C1:C16,A1:A16,{"a","b","c"},B1:B16,{"aa";"bb"}))
I would use this one:
=SUMPRODUCT(ISNUMBER(MATCH(A1:A16,{"a","b","c"},0)*MATCH(B1:B16,{"aa","bb"},0))*(C1:C16))
or, if C1:C16 always contains only 1, simply:
=SUMPRODUCT(1*ISNUMBER(MATCH(A1:A16,{"a","b","c"},0)*MATCH(B1:B16,{"aa","bb"},0)))

Resources