I have dataframe that looks like this:
title answer
0 aa zz
1 bb xx
2 cc yy
3 dd ll
I want to rotate by index value and reset index the rows as index and value for only two rows like this:
bb cc
0 xx yy
How do I do this?
I tried Transpose:
df[['title','answer']].T
but that puts integer values in the column headers and not sure how to select by index number.
IIUC,
we can set title as the index and then transpose, the reason you get integers is because your tranposing on the index
df_new = df.set_index('title').T
print(df_new)
title aa bb cc dd
answer zz xx yy ll
if you want to get rid of the index as well :
df_new = df.set_index('title').T.reset_index(drop=True)
df_new.columns.name = ''
print(df_new)
aa bb cc dd
0 zz xx yy ll
Related
I have a dataframe as follows:
A B
AA BB
-----------
2 1
3 4
5 8
9 7
This dataframe have multilevel column. To save in excel I am doing the following
col1= ["A","B"]
col2= ["AA","BB"]
r1 = [2,1]
r2 =[3,4]
df = pd.DataFrame(columns = [col1,col2])
df.loc[(len(df))] = r1
df.to_excel("name.xlsx")
But in excel I can see one empty row automatically added between column name and data. How can I save the dataframe in xlax format so that empty row will not populate?
It is nothing but a bug of "to_excel" while processing multiindex.
Just use to_csv:
import pandas as pd
col1= ["A","B"]
col2= ["AA","BB"]
r1 = [2,1]
r2 =[3,4]
df = pd.DataFrame(columns = [col1,col2])
df.loc[(len(df))] = r1
df.loc[(len(df))] = r2
df.to_csv("name.csv")
Here is a workaround - reset the column index:
df.T.reset_index(level=1).T.to_excel("name.xlsx")
A B
level_1 AA BB
0 2 1
To set custom names for the column levels (do this, and then export to excel):
only one of the levels:
df.columns.set_names('name_lev_2', level=1, inplace=True)
A B
name_lev_2 AA BB
0 2 1
both levels:
df.rename_axis(columns=['name_lev_1', 'name_lev_2']) # not in place
name_lev_1 A B
name_lev_2 AA BB
0 2 1
Fill the 0s with the following conditions:
If there is one 0 value, do a simple average between the data after and before the 0. In this scenario 0 is replaced by mean of A and B
If there are two consecutive 0s, fill the first missing value with data from previous period and doing a simple average between the data after and before the second 0. First 0 is replaced by A and second 0 by mean of A and B.
If there are 3 consecutive 0s, replace first and second 0 with A and 3rd by mean of A and B.
Ticker is an identifier and would be common for every block(can be ignored). The entire table is 1000 rows long and in no case consecutive 0s would exceed 3. I am unable to manage scenario 2 and 3.
ID
asset
AA
34861000
AA
1607498
AA
0
AA
3530000000
AA
3333000000
AA
3179000000
AA
4053000000
AA
4520000000
AB
15250209
AB
0
AB
14691049
AB
0
AB
5044421
CC
5609212
CC
0
CC
0
CC
3673639
CC
132484747
CC
0
CC
0
CC
0
CC
141652646
You can use interpolate per group, on the reversed Series, with a limit of 1:
df['asset'] = (df
.assign(asset=df['asset'].replace(0, float('nan')))
.groupby('ID')['asset']
.transform(lambda s: s[::-1].interpolate(limit=1).bfill())
)
output:
ID asset
0 AA 3.486100e+07
1 AA 1.607498e+06
2 AA 1.765804e+09
3 AA 3.530000e+09
4 AA 3.333000e+09
5 AA 3.179000e+09
6 AA 4.053000e+09
7 AA 4.520000e+09
8 AB 1.525021e+07
9 AB 1.497063e+07
10 AB 1.469105e+07
11 AB 9.867735e+06
12 AB 5.044421e+06
13 CC 5.609212e+06
14 CC 5.609212e+06
15 CC 4.318830e+06
16 CC 3.673639e+06
17 CC 1.324847e+08 # X
18 CC 1.324847e+08 # filled X
19 CC 1.324847e+08 # filled X
20 CC 1.393607e+08 # (X+Y)/2
21 CC 1.416526e+08 # Y
Ok, compiling the answer here with help from #jezrael and #mozway
df['asset'] =df['asset'].replace(0, float('nan'))
df.loc[mask, 'asset'] = df.loc[mask | ~m, 'asset'].groupby(df['ID']).transform(lambda x: x.interpolate())
df.ffill()
I have CSV exports of data for individual pieces in a collection that each have two initial columns: A & B.
Column A has topics and column B has 0+ tags for those topics separated by commas.
There exists a master list of all possible singular combinations of topics/column A and tags/column B, but each CSV export may one, more, or none of any particular combination and there can be many total combinations.
In the actual master list, there are about 20 topics and anywhere from 2 to maybe 50 tags per topic.
Master list example:
Topics/A Tags/B
AA XX
AA XY
AA XZ
AB VV
AB VW
AB VX
AB VY
AB VZ
AC YY
AC YZ
Individual piece CSV example:
Topics/A Tags/B
AA
AA XZ
AA XZ, XX
AA XZ, XX
AB VV, VY
AB VY
AB VX, VV, VZ
AB VY
AB VZ, VW, VV, VY, VX
AC YY
AC YY
AC YY
AC YY, YZ
I want the final result to be a count of all combinations of topics/A and tags/B.
Final result for individual piece example from above (option 1):
Topics/A Tags/B Count
AA none 1
AA XX 2
AA XY 0
AA XZ 3
AB none 0
AB VV 3
AB VW 1
AB VX 2
AB VY 4
AB VZ 2
AC none 0
AC YY 4
AC YZ 1
Final result for individual piece example from above (option 2a, as seen below, or option 2b with columns and rows swapped):
AA (none) XX XY XZ AB (none) VV VW VX VY VZ AC (none) YY YZ
AA 1 2 0 3
AB 0 3 1 2 4 2
AC 0 4 1
I'm assuming I have to separate out the tags/B column so that it's something like:
AA
AA XZ
AA XZ XX
AA XZ XX
AB VV VY
AB VY
AB VX VV VZ
AB VY
AB VZ VW VV VY VX
AC YY
AC YY
AC YY
AC YY YZ
But, after this, I'm pretty stuck on what to do.
I tried looking up how to unpivot the above, but I would like some sort of formula or method that would be universal when the number of tags applied for a topic in a singular instance are generally unknown.
I've seen formulas that "flatten" the data, but I think they're designed for a fixed number of possible "tag" columns, which doesn't work for me.
I don't want to resort to manually entering formulas to flatten out the delimited matrix like:
Topics (flattened) Tags (flattened)
=A2 =B2
=A2 =C2
=A2 =D2
=A2 =E2
=A2 =F2
=A3 =B3
=A3 =C3
=A3 =D3
=A3 =E3
=A3 =F3
=A4 =B4
... ...
Please help.
Thank you.
Here's the formula method. There's one part you might need a vba script for - let me know.
Replace all blanks in your csv with "none" (highlight it>cntrl+h>replace nothing with none)
Add "none" rows to your master list (let me know if you need me to write a macro that does this)
Use this formula with control + shift + enter
=SUM((A2=$A$17:$A$29)*(ISNUMBER(FIND(B2,$B$17:$B$29))))
We have below pandas dataframe
and we need to convert into below dataframe
while using pd.wide_to_long command we are getting below error:-
ValueError: stubname can't be identical to a column name
This command is being use:-
pd.wide_to_long(df,['Org','City'],i=['First Name','Middle Name','Last Name','Years'],j='drop').reset_index(level=[0,1]
For me your solution working, also added parameter stubnames. Mayb eit is bug in oldier pandas version, link, so you can try to upgrade pandas to last version:
df = pd.wide_to_long(df, stubnames=['Org','City'],
i=['First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop('drop', 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN
EDIT: If possible some duplicates in data is possible create default index by reset_index and add column index to i variables:
print (df)
First Name Middle Name Last Name Years Org0 Org1 Org2 Org3 City0 City1 \
0 aa cc dd 2019 v m d p n m
1 zz yy xx 2018 p q i NaN n n
City2 City3
0 n j
1 d NaN
df = pd.wide_to_long(df.reset_index(), stubnames=['Org','City'],
i=['index','First Name','Middle Name','Last Name','Years'],
j='drop').reset_index().drop(['drop', 'index'], 1)
print (df)
First Name Middle Name Last Name Years Org City
0 aa cc dd 2019 v n
1 aa cc dd 2019 m m
2 aa cc dd 2019 d n
3 aa cc dd 2019 p j
4 zz yy xx 2018 p n
5 zz yy xx 2018 q n
6 zz yy xx 2018 i d
7 zz yy xx 2018 NaN NaN
I have a pandas dataframe that essentially looks like this:
type item string
1 0 aa
1 1 bb
1 2 cc
2 0 dd
2 1 ee
2 2 ff
I want to somehow create a new column 'newstring' based off of the group's 'string' column
type item string newstring
1 0 aa aa+bb+cc
1 1 bb aa+bb+cc
1 2 cc aa+bb+cc
2 0 dd dd+ee+ff
2 1 ee dd+ee+ff
2 2 ff dd+ee+ff
I've done
df.groupby('type').aggregate(lambda x: "+".join(x))
df.groupby('type').apply(lambda x: "+".join(x))
but I keep getting as a result in newstring (literally)
type item string newstring
1 0 aa type+item+string+newstring
1 1 bb type+item+string+newstring
1 2 cc type+item+string+newstring
2 0 dd type+item+string+newstring
2 1 ee type+item+string+newstring
2 2 ff type+item+string+newstring
How can I group by a specific column but then append the values of one column of that group to a new column.
Thanks in advance!
Sorry are you after this:
In [14]:
df['new_string'] = df.groupby('type')['string'].transform(lambda x: '+'.join(x))
df
Out[14]:
type item string new_string
0 1 0 aa aa+bb+cc
1 1 1 bb aa+bb+cc
2 1 2 cc aa+bb+cc
3 2 0 dd dd+ee+ff
4 2 1 ee dd+ee+ff
5 2 2 ff dd+ee+ff
The above groups on 'type' and then we call transform on 'string' column and call a lambda function that join's the string values.
The reason what you tried failed was because your function is being applied on the remaining columns rather than specifically for the string column. Also the transform here returns a series with an index aligned with the original df.