duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate]

duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate] - python-3.x

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 1 year ago.
Consider a pandas dataframe where the column value of some rows is a list of values:
df1 = { 'Name':['Jack','John','Paul','Todd','Chris'],
'ID':[62,47,55,74,31],
'Subjects':['A','A',['A','B','C'],['A','B'],'C']}
df1 = pd.DataFrame(df1)
becoming
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 [A, B, C]
3 Todd 74 [A, B]
4 Chris 31 C
I need to transform those rows where df1.Subjects is a list, so that the list is exploded and distributed across row copies, such that dataframe becomes something like:
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Todd 74 A
4 Chris 31 C
5 Paul 55 B
6 Paul 55 C
7 Todd 74 B
where the index is not so heavily important, but df1.ID should be preserved when making its row copies.

Use explode and merge:
>>> pd.merge(df1.drop(columns='Subjects'),
df1['Subjects'].explode(),
left_index=True, right_index=True, how='outer') \
.reset_index(drop=True)
Name ID Subjects
0 Jack 62 A
1 John 47 A
2 Paul 55 A
3 Paul 55 B
4 Paul 55 C
5 Todd 74 A
6 Todd 74 B
7 Chris 31 C

Related

How to do similar type of columns addition in Pyspark?

I want to do addition of similar type of columns (total columns are more than 100) as follows:
id
b
c
d
b_apac
c_apac
d_apac
abcd
3
5
null
45
9
1
bcd
13
15
1
45
2
10
cd
32
null
6
45
90
1
resultant table should look like this:
id
b_sum
c_sum
d_sum
abcd
48
14
1
bcd
58
17
11
cd
77
90
7
Please help me with some generic code as I have more than 100 columns to do this for. |

You can use use sum and check the prefix of your column name:
df.select(
'id',
sum([df[col] for col in df.columns if col.startswith('b')]).alias('b_sum'),
sum([df[col] for col in df.columns if col.startswith('c')]).alias('c_sum'),
sum([df[col] for col in df.columns if col.startswith('d')]).alias('d_sum'),
).show(10, False)

Change columns to rows per student ID

I have data in excel sheet that I am reading into a dataframe:
ID
Grade
Course
Q1 Number
Q1 Letter
Q2 Number
Q2 Letter
1
9
English
73
B
69
C
1
9
Math
70
B
52
C
1
9
Science
69
C
80
A
desired output:
ID
Grade
Course
Semester
Number Grade
Letter Grade
1
9
English
Q1
73
B
1
9
English
Q2
69
C
1
9
Math
Q1
70
B
1
9
Math
Q2
52
C
1
9
Science
Q1
69
C
1
9
Science
Q2
80
A
I'm trying to do df.melt, but it's not working. Any help is appreciated.

Update:
df = pd.read_excel('Downloads/grades_mock+data.xlsx')
dfm = df.set_index(['ID', 'GRADE', 'COURSE'])\
.rename(columns=lambda x: ' '.join(x.split(' ', 1)[::-1]))\
.reset_index()
#Eliminating duplicates.
dfm = dfm.groupby(['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'], as_index=False).first()
df_out = pd.wide_to_long(dfm,
['GRADE NUMERIC', 'GRADE LETTER'],
['ID', 'GRADE', 'COURSE', 'DISCIPLINE COURSE'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Try this, using pd.wide_to_long, with some column renaming to make it easier:
df = pd.read_clipboard()
dfm = df.set_index(['ID', 'Grade', 'Course'])\
.rename(columns=lambda x: ' '.join(x.split(' ')[::-1]))\
.reset_index()
df_out = pd.wide_to_long(dfm,
['Number', 'Letter'],
['ID', 'Grade', 'Course'],
'Semester', sep=' ', suffix='.*')\
.reset_index()
print(df_out)
Output:
ID Grade Course Semester Number Letter
0 1 9 English Q1 73 B
1 1 9 English Q2 69 C
2 1 9 Math Q1 70 B
3 1 9 Math Q2 52 C
4 1 9 Science Q1 69 C
5 1 9 Science Q2 80 A

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?

Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.

Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177

It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?

If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

duplicating pandas dataframe rows n times where n is conditional upon a cell value [duplicate] - python-3.x

Use explode and merge: >>> pd.merge(df1.drop(columns='Subjects'), df1['Subjects'].explode(), left_index=True, right_index=True, how='outer') \ .reset_index(drop=True) Name ID Subjects 0 Jack 62 A 1 John 47 A 2 Paul 55 A 3 Paul 55 B 4 Paul 55 C 5 Todd 74 A 6 Todd 74 B 7 Chris 31 C

Related

How to do similar type of columns addition in Pyspark?

Change columns to rows per student ID

How to take values in the column as the columns in the DataFrame in pandas

pandas df merge avoid duplicate column names

Grouping and Multiindexing a pandas dataframe

Categories

Resources