dataframe transformation python - python-3.x

I am new to pandas. I have dataframe,df with 3 columns:(date),(name) and (count).
Given each day: is there an easy way to create a new dataframe from original one that contains new columns representing the unique names in the original (name column) and their respective count values in the correct columns?
date name count
0 2017-08-07 ABC 12
1 2017-08-08 ABC 5
2 2017-08-08 TTT 6
3 2017-08-09 TAC 5
4 2017-08-09 ABC 10
It should now be
date ABC TTT TAC
0 2017-08-07 12 0 0
1 2017-08-08 5 6 0
3 2017-08-09 10 0 5

df = pd.DataFrame({"date":["2017-08-07","2017-08-08","2017-08-08","2017-08-09","2017-08-09"],"name":["ABC","ABC","TTT","TAC","ABC"], "count":["12","5","6","5","10"]})
df = df.pivot(index='date', columns='name', values='count').reset_index().fillna(0)

Related

Inner merge in python with tables having duplicate values in key column

I am struggling to replicate sas(another programming language) inner merge in python .
The python inner merge is not matching with sas inner merge when duplicate key values are coming .
Below is an example :
zw = pd.DataFrame({"ID":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Shivansh','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary":[22,15,10,9,16,18,22],
"city":['noida','bangalore','hyderabad','noida','pune','gurugram','bangalore'],
"ant":[10,15,15,10,16,17,18]})
zw1 = pd.DataFrame({"ID-":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Swati','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile_":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary_":[2,15,10,9,16,18,22],
"city_":['noida','kochi','hyderabad','noida','pune','gurugram','bangalore'],
"ant_":[1,15,15,10,16,17,18]})
zw and sw1 are the input tables . Both the tables need to be inner merged on the key column Name .The issue is both columns are having duplicate values in Name column .
Python is generating all possible combinations with the duplicate rows .
Below is the expected output :
I tried normal inner merge and tried dropping duplicate row with ID and Name columns , but still not getting the desired output .
df1=pd.merge(zw,zw1,on=['Name'],how='inner')
df1.drop_duplicates(['Name','ID'])
Use df.combine_first + df.sort_values combination:
df = zw.combine_first(zw1).sort_values('Name')
print(df)
ID ID- Name ant ant_ city city_ job_profile \
3 1 1 Amar 10 10 noida noida DataA
4 0 0 Arpit 16 16 pune pune AndroidD
6 1 1 Priyanka 18 18 bangalore bangalore fullstac
5 0 0 Ranjeet 17 17 gurugram gurugram PythonD
0 1 1 Shivansh 10 1 noida noida DataS
1 0 0 Shivansh 15 15 bangalore kochi SWD
2 0 0 Shivansh 15 15 hyderabad hyderabad DataA
job_profile_ salary salary_
3 DataA 9 9
4 AndroidD 16 16
6 fullstac 22 22
5 PythonD 18 18
0 DataS 22 2
1 SWD 15 15
2 DataA 10 10

How can I sort 3 columns and assign it to one python pandas

I have a dataframe:
df = {A:[1,1,1], B:[2012,3014,3343], C:[12,13,45], D:[111,222,444]}
but I need to join the last 3 columns in consecutive order horizontally and thus assign it to the first column, some like this:
df2 = {A:[1,1,1,2,2,2], Fusion3:[2012,12,111,3014,13,222]}
I have tried with .melt, but you are struggling with some ideas and grateful for your comments
From the desired output I'm making the assumption that the initial dataframe should have 1,2,3 in the A column rather 1,1,1
import pandas as pd
df= pd.DataFrame({'A':[1,2,3], 'B':[2012,3014,3343], 'C':[12,13,45], 'D':[111,222,444]})
df = df.set_index('A')
df = df.stack().droplevel(1)
will give you this series:
A
1 2012
1 12
1 111
2 3014
2 13
2 222
3 3343
3 45
3 444
Check melt
out = df.melt('A').drop('variable',1)
Out[15]:
A value
0 1 2012
1 2 3014
2 3 3343
3 1 12
4 2 13
5 3 45
6 1 111
7 2 222
8 3 444

Taking different records from groups using group by in pandas

Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.
Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64

Pandas: Sort a dataframe based on multiple columns

I know that this question has been asked several times. But none of the answers match my case.
I've a pandas dataframe with columns,department and employee_count. I need to sort the employee_count column in descending order. But if there is a tie between 2 employee_counts then they should be sorted alphabetically based on department.
Department Employee_Count
0 abc 10
1 adc 10
2 bca 11
3 cde 9
4 xyz 15
required output:
Department Employee_Count
0 xyz 15
1 bca 11
2 abc 10
3 adc 10
4 cde 9
This is what I've tried.
df = df.sort_values(['Department','Employee_Count'],ascending=[True,False])
But this just sorts the departments alphabetically.
I've also tried to sort by Department first and then by Employee_Count. Like this:
df = df.sort_values(['Department'],ascending=[True])
df = df.sort_values(['Employee_Count'],ascending=[False])
This doesn't give me correct output either:
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10
0 abc 10
3 cde 9
It gives 'adc' first and then 'abc'.
Kindly help me.
You can swap columns in list and also values in ascending parameter:
Explanation:
Order of columns names is order of sorting, first sort descending by Employee_Count and if some duplicates in Employee_Count then sorting by Department only duplicates rows ascending.
df1 = df.sort_values(['Employee_Count', 'Department'], ascending=[False, True])
print (df1)
Department Employee_Count
4 xyz 15
2 bca 11
0 abc 10 <-
1 adc 10 <-
3 cde 9
Or for test if use second False then duplicated rows are sorting descending:
df2 = df.sort_values(['Employee_Count', 'Department',],ascending=[False, False])
print (df2)
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10 <-
0 abc 10 <-
3 cde 9

How to remove the repeated row spaning two dataframe index in python

I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8

Resources