duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate] - python-3.x

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 1 year ago.
Consider a pandas dataframe where the column value of some rows is a list of values:
df1 = { 'Name':['Jack','John','Paul','Todd','Chris'],
'ID':[62,47,55,74,31],
'Subjects':['A','A',['A','B','C'],['A','B'],'C']}
df1 = pd.DataFrame(df1)
becoming
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 [A, B, C]
3 Todd 74 [A, B]
4 Chris 31 C
I need to transform those rows where df1.Subjects is a list, so that the list is exploded and distributed across row copies, such that dataframe becomes something like:
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Todd 74 A
4 Chris 31 C
5 Paul 55 B
6 Paul 55 C
7 Todd 74 B
where the index is not so heavily important, but df1.ID should be preserved when making its row copies.

Use explode and merge:
>>> pd.merge(df1.drop(columns='Subjects'),
df1['Subjects'].explode(),
left_index=True, right_index=True, how='outer') \
.reset_index(drop=True)
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Paul 55 B
4 Paul 55 C
5 Todd 74 A
6 Todd 74 B
7 Chris 31 C

Related

How to do similar type of columns addition in Pyspark?

I want to do addition of similar type of columns (total columns are more than 100) as follows:
id
b
c
d
b_apac
c_apac
d_apac
abcd
3
5
null
45
9
1
bcd
13
15
1
45
2
10
cd
32
null
6
45
90
1
resultant table should look like this:
id
b_sum
c_sum
d_sum
abcd
48
14
1
bcd
58
17
11
cd
77
90
7
Please help me with some generic code as I have more than 100 columns to do this for. |
You can use use sum and check the prefix of your column name:
df.select(
'id',
sum([df[col] for col in df.columns if col.startswith('b')]).alias('b_sum'),
sum([df[col] for col in df.columns if col.startswith('c')]).alias('c_sum'),
sum([df[col] for col in df.columns if col.startswith('d')]).alias('d_sum'),
).show(10, False)

Change columns to rows per student ID

I have data in excel sheet that I am reading into a dataframe:
ID
Grade
Course
Q1 Number
Q1 Letter
Q2 Number
Q2 Letter
1
9
English
73
B
69
C
1
9
Math
70
B
52
C
1
9
Science
69
C
80
A
desired output:
ID
Grade
Course
Semester
Number Grade
Letter Grade
1
9
English
Q1
73
B
1
9
English
Q2
69
C
1
9
Math
Q1
70
B
1
9
Math
Q2
52
C
1
9
Science
Q1
69
C
1
9
Science
Q2
80
A
I'm trying to do df.melt, but it's not working. Any help is appreciated.
Update:
df = pd.read_excel('Downloads/grades_mock+data.xlsx')
dfm = df.set_index(['ID', 'GRADE', 'COURSE'])\
.rename(columns=lambda x: ' '.join(x.split(' ', 1)[::-1]))\
.reset_index()
#Eliminating duplicates.
dfm = dfm.groupby(['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'], as_index=False).first()
df_out = pd.wide_to_long(dfm,
['GRADE NUMERIC', 'GRADE LETTER'],
['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Try this, using pd.wide_to_long, with some column renaming to make it easier:
df = pd.read_clipboard()
dfm = df.set_index(['ID', 'Grade', 'Course'])\
.rename(columns=lambda x: ' '.join(x.split(' ')[::-1]))\
.reset_index()
df_out = pd.wide_to_long(dfm,
['Number', 'Letter'],
['ID', 'Grade', 'Course'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Output:
ID Grade Course Semester Number Letter
0 1 9 English Q1 73 B
1 1 9 English Q2 69 C
2 1 9 Math Q1 70 B
3 1 9 Math Q2 52 C
4 1 9 Science Q1 69 C
5 1 9 Science Q2 80 A

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?
If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Resources