Sort pandas dataframe by a column - python-3.x

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7

You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs

Related

How to access list of list values in columns in dataset

In my DataFrame.I am having a list of list values in a column. For example, I am having columns as A, B, C, and my output column. In column A I'm having a value of 12 and in column B I am having values of 30 and in column C I am having a list of values like [0.01,1.234,2.31].When I try to find mean for all the list of list values.It shows list object as no attribute mean.How to convert all list of list values to mean in the dataframe?
You can transform the column which contains the lists to another DataFrame and calculate the mean.
import pandas as pd
df = ... # Original df
pd.DataFrame(df['column_with_lists'].values.tolist()).mean(1)
This would result in a pandas DataFrame which looks like the following:
0 mean_of_list_row_0
1 mean_of_list_row_1
. .
. .
. .
n mean_of_list_row_n
You can use apply(np.mean) on the column with the lists in it to get the mean. For example:
Build a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2,4],[4,6]])
df[3] = [[5,7],[8,9,10]]
print(df)
0 1 3
0 2 4 [5, 7]
1 4 6 [8, 9, 10]
Use apply(np.mean)
print(df[3].apply(np.mean))
0 6.0
1 9.0
If you want to convert that column into the mean of the lists:
df[3] = df[3].apply(np.mean)
print(df)
Name: 3, dtype: float64
0 1 3
0 2 4 6.0
1 4 6 9.0

How to change the format for values in a dataframe?

I need to change the format for values in a column in a dataframe. If I have a dataframe in that format:
df =
sector funding_total_usd
1 NaN
2 10,00,000
3 3,90,000
4 34,06,159
5 2,17,50,000
6 20,00,000
How to change it to that format:
df =
sector funding_total_usd
1 NaN
2 10000.00
3 3900.00
4 34061.59
5 217500.00
6 20000.00
This is my code:
for row in df['funding_total_usd']:
dt1 = row.replace (',','')
print (dt1)
This is the error that I got "AttributeError: 'float' object has no attribute 'replace'"
I need really to your help in how to do that?
Here's the way to get the decimal places:
import pandas as pd
import numpy as np
df= pd.DataFrame({'funding_total_usd': [np.nan, 1000000, 390000, 3406159,21750000,2000000]})
print(df)
df['funding_total_usd'] /= 100
print(df)
funding_total_usd
0 NaN
1 1000000.0
2 390000.0
3 3406159.0
4 21750000.0
funding_total_usd
0 NaN
1 10000.00
2 3900.00
3 34061.59
4 217500.00
To solve your comma problem, please run this as your first command before you print. It will remove all your commas for the float values.
pd.options.display.float_format = '{:.2f}'.format

Remove a character from a pandas dataframe columns

I have a dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
df
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
I want to remove all 0's after character 'L'.
My expected output:
col1
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
In [114]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
...: df
Out[114]:
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
In [115]: df.col1.str.replace("L([0]*)","L")
Out[115]:
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
Name: col1, dtype: object
Pandas string replace suffices for this. The code below looks for any 0, preceded by L, and replaces the 0 with an empty string :
df.col1.str.replace(r"(?<=L)0+", "")
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
If you need more speed, you could go down into plain Python with list comprehension:
import re
df["cleaned"] = [re.sub(r"(?<=L)0+", "", entry) for entry in df.col1]
df
col1 cleaned
0 AA_L8_ZZ AA_L8_ZZ
1 AA_L08_YY AA_L8_YY
2 AA_L800_XX AA_L800_XX
3 AA_L0008_CC AA_L8_CC

List of Dataframe

I have 2 DF that have in common some elements, and differentiates on 1 data. These DF are added into a list with the function append.
How do i re organise the list into a new DF with the data put in columns ?
The 2 DF are like below and are added with append
import pandas as pd
a=[]
r1={'date' : ['2003-01-31','2003-01-31'],'name' :['mod','dom'],'fib' :[2,3]}
df1=pd.DataFrame(r1,columns=['date','name','fib'])
r2={'date' : ['2003-01-31','2003-01-31'],'name' :['dom','mod'],'bif' :[5,7]}
df2=pd.DataFrame(r2,columns=['date','name','bif'])
a.append(df1)
a.append(df2)
a
Then i map the list a in a new DF
z=pd.concat(map(pd.DataFrame,a))
z
How do i re organize z that needs only two rows ?
The output i expect is
r3={'date':['2003-01-31','2003-01-31'],'name' :['mod','dom'],'fib':[2,3],'bif':[7,5]}
pd.DataFrame(r3)
For the z , I would do:
z=pd.concat([i.set_index(['date','name']) for i in a],axis=1).reset_index()
print(z)
date name fib bif
0 2003-01-31 dom 3 5
1 2003-01-31 mod 2 7
Try using pd.merge
df1.merge(df2, on=['name', 'date'])
Results:
date name fib bif
0 2003-01-31 mod 2 7
1 2003-01-31 dom 3 5

Change the bar item name in Pandas

I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")

Resources