Change columns to rows per student ID - excel

I have data in excel sheet that I am reading into a dataframe:
ID
Grade
Course
Q1 Number
Q1 Letter
Q2 Number
Q2 Letter
1
9
English
73
B
69
C
1
9
Math
70
B
52
C
1
9
Science
69
C
80
A
desired output:
ID
Grade
Course
Semester
Number Grade
Letter Grade
1
9
English
Q1
73
B
1
9
English
Q2
69
C
1
9
Math
Q1
70
B
1
9
Math
Q2
52
C
1
9
Science
Q1
69
C
1
9
Science
Q2
80
A
I'm trying to do df.melt, but it's not working. Any help is appreciated.

Update:
df = pd.read_excel('Downloads/grades_mock+data.xlsx')
dfm = df.set_index(['ID', 'GRADE', 'COURSE'])\
.rename(columns=lambda x: ' '.join(x.split(' ', 1)[::-1]))\
.reset_index()
#Eliminating duplicates.
dfm = dfm.groupby(['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'], as_index=False).first()
df_out = pd.wide_to_long(dfm,
['GRADE NUMERIC', 'GRADE LETTER'],
['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Try this, using pd.wide_to_long, with some column renaming to make it easier:
df = pd.read_clipboard()
dfm = df.set_index(['ID', 'Grade', 'Course'])\
.rename(columns=lambda x: ' '.join(x.split(' ')[::-1]))\
.reset_index()
df_out = pd.wide_to_long(dfm,
['Number', 'Letter'],
['ID', 'Grade', 'Course'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Output:
ID Grade Course Semester Number Letter
0 1 9 English Q1 73 B
1 1 9 English Q2 69 C
2 1 9 Math Q1 70 B
3 1 9 Math Q2 52 C
4 1 9 Science Q1 69 C
5 1 9 Science Q2 80 A

Related

How to do similar type of columns addition in Pyspark?

I want to do addition of similar type of columns (total columns are more than 100) as follows:
id
b
c
d
b_apac
c_apac
d_apac
abcd
3
5
null
45
9
1
bcd
13
15
1
45
2
10
cd
32
null
6
45
90
1
resultant table should look like this:
id
b_sum
c_sum
d_sum
abcd
48
14
1
bcd
58
17
11
cd
77
90
7
Please help me with some generic code as I have more than 100 columns to do this for. |
You can use use sum and check the prefix of your column name:
df.select(
'id',
sum([df[col] for col in df.columns if col.startswith('b')]).alias('b_sum'),
sum([df[col] for col in df.columns if col.startswith('c')]).alias('c_sum'),
sum([df[col] for col in df.columns if col.startswith('d')]).alias('d_sum'),
).show(10, False)

duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 1 year ago.
Consider a pandas dataframe where the column value of some rows is a list of values:
df1 = { 'Name':['Jack','John','Paul','Todd','Chris'],
'ID':[62,47,55,74,31],
'Subjects':['A','A',['A','B','C'],['A','B'],'C']}
df1 = pd.DataFrame(df1)
becoming
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 [A, B, C]
3 Todd 74 [A, B]
4 Chris 31 C
I need to transform those rows where df1.Subjects is a list, so that the list is exploded and distributed across row copies, such that dataframe becomes something like:
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Todd 74 A
4 Chris 31 C
5 Paul 55 B
6 Paul 55 C
7 Todd 74 B
where the index is not so heavily important, but df1.ID should be preserved when making its row copies.
Use explode and merge:
>>> pd.merge(df1.drop(columns='Subjects'),
df1['Subjects'].explode(),
left_index=True, right_index=True, how='outer') \
.reset_index(drop=True)
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Paul 55 B
4 Paul 55 C
5 Todd 74 A
6 Todd 74 B
7 Chris 31 C

Unstack a dataframe with duplicated index in Pandas

Given a toy dataset as follow which has duplicated price and quantity:
city item value
0 bj price 12
1 bj quantity 15
2 bj price 12
3 bj quantity 15
4 bj level a
5 sh price 45
6 sh quantity 13
7 sh price 56
8 sh quantity 7
9 sh level b
I want to reshape it into the following dataframe, which means add sell_ for the first pair and buy_ for the second pair:
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 13 16 a
1 sh 45 13 56 7 b
I have tried with df.set_index(['city', 'item']).unstack().reset_index(), but it raises an error: ValueError: Index contains duplicate entries, cannot reshape.
How could I get the desired output as above? Thanks.
You can add for second duplicated values buy_ and for first duplicates sell_ and change values in item before your solution:
m1 = df.duplicated(['city', 'item'])
m2 = df.duplicated(['city', 'item'], keep=False)
df['item'] = np.where(m1, 'buy_', np.where(m2, 'sell_', '')) + df['item']
df = (df.set_index(['city', 'item'])['value']
.unstack()
.reset_index()
.rename_axis(None, axis=1))
#for change order of columns names
df = df[['city','sell_price','sell_quantity','buy_price','buy_quantity','level']]
print (df)
city sell_price sell_quantity buy_price buy_quantity level
0 bj 12 15 12 15 a
1 sh 45 13 56 7 b

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

how to add a new column in dataframe which divides multiple columns and finds the maximum value

This maybe real simple solution but I am new to python 3 and I have a dataframe with multiple columns. I would like to add a new column to the existing dataframe - which does the following calculation i.e.
New Column = Max((Column A/Column B), (Column C/Column D), (Column E/Column F))
I can do a max based on the following code but wanted to check how can I do div alongwith it.
df['Max'] = df[['Column A','Column B','Column C', 'Column D', 'Column E', 'Column F']].max(axis=1)
Column A Column B Column C Column D Column E Column F Max
3600 36000 22 11 3200 3200 36000
2300 2300 13 26 1100 1200 2300
1300 13000 15 33 1000 1000 13000
Thanks
You can div the df by itself by slicing the columns in steps and then take the max:
In [105]:
df['Max'] = df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1).max(axis=1)
df
Out[105]:
Column A Column B Column C Column D Column E Column F Max
0 3600 36000 22 11 3200 3200 2
1 2300 2300 13 26 1100 1200 1
2 1300 13000 15 33 1000 1000 1
Here are the intermediate values:
In [108]:
df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1)
Out[108]:
Column A Column C Column E
0 0.1 2.000000 1.000000
1 1.0 0.500000 0.916667
2 0.1 0.454545 1.000000
You can try something like as follows
df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] / V['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
Example
In [14]: df
Out[14]:
A B C D E F
0 1 11 1 11 12 98
1 2 22 2 22 67 1
2 3 33 3 33 23 4
3 4 44 4 44 11 10
In [15]: df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] /
v['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
In [16]: df
Out[16]:
A B C D E F Max
0 1 11 1 11 12 98 0.122449
1 2 22 2 22 67 1 67.000000
2 3 33 3 33 23 4 5.750000
3 4 44 4 44 11 10 1.100000

Resources