I have a test excel file like:
df = pd.DataFrame({'name':list('abcdefg'),
'age':[10,20,5,23,58,4,6]})
print (df)
name age
0 a 10
1 b 20
2 c 5
3 d 23
4 e 58
5 f 4
6 g 6
I use Pandas and matplotlib to read and plot it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
df.plot(kind="bar")
plt.show()
the result shows:
it use index number as item name, how can I change it to the name, which stored in column name?
You can specify columns for x and y values in plot.bar:
df.plot(x='name', y='age', kind="bar")
Or create Series first by DataFrame.set_index and select age column:
df.set_index('name')['age'].plot(kind="bar")
#if multiple columns
#df.set_index('name').plot(kind="bar")
Related
I have a dataframe as shown below.
df:
ID tag
1 pandas
2 numpy
3 matplotlib
4 pandas
5 pandas
6 sns
7 sklearn
8 sklearn
9 pandas
10 pandas
to the above df, I would like to add a column named tag_binary. Which will whether it is pandas or not.
Expected output:
ID tag tag_binary
1 pandas pandas
2 numpy non_pandas
3 matplotlib non_pandas
4 pandas pandas
5 pandas pandas
6 sns non_pandas
7 sklearn non_pandas
8 sklearn non_pandas
9 pandas pandas
10 pandas pandas
I tried the below code using a dictionary and map function. It worked fine. But I am wondering is there any easier way without creating this complete dictionary.
d = {'pandas':'pandas', 'numpy':'non_pandas', 'matplotlib':'non_pandas',
'sns':'non_pandas', 'sklearn':'non_pandas'}
df["tag_binary"] = df['tag'].map(d)
You can use where with an equality check to keep 'pandas' and fill everything else with 'non_pandas'.
df['tag_binary'] = df['tag'].where(df['tag'].eq('pandas'), 'non_pandas')
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
If you want something a little more flexible, so you can also map specific values to some label, then you can leverage the fact that for keys not in your dict, map returns NaN. So only specify mappings you care about and then fillna to deal with every other case.
# Could be more general like {'pandas': 'pandas', 'geopandas': 'pandas'}
d = {'pandas': 'pandas'}
df['pandas_binary'] = df['tag'].map(d).fillna('non_pandas')
you can use apply
def is_pandas(name):
if name == 'pandas':
return 'pandas'#or True
return 'non_pandas' # or Fales
df['tag_binary'] = df['tag'].apply(lambda x: is_pandas(x))
If specifically needing "Categorical Data", to assign some ordering hierarchy, ensuring that only these values are permitted in the column, or simply reducing the amount of space, we can create a CategoricalDtype make the conversion with astype then fillna to fill the NaN values introduced when converting values that are not contained within the Categorical:
cat_dtype = pd.CategoricalDtype(['pandas', 'non_pandas'])
df['tag_binary'] = df['tag'].astype(cat_dtype).fillna('non_pandas')
df:
ID tag tag_binary
0 1 pandas pandas
1 2 numpy non_pandas
2 3 matplotlib non_pandas
3 4 pandas pandas
4 5 pandas pandas
5 6 sns non_pandas
6 7 sklearn non_pandas
7 8 sklearn non_pandas
8 9 pandas pandas
9 10 pandas pandas
Setup Used:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'tag': ['pandas', 'numpy', 'matplotlib', 'pandas', 'pandas', 'sns',
'sklearn', 'sklearn', 'pandas', 'pandas']
})
In my DataFrame.I am having a list of list values in a column. For example, I am having columns as A, B, C, and my output column. In column A I'm having a value of 12 and in column B I am having values of 30 and in column C I am having a list of values like [0.01,1.234,2.31].When I try to find mean for all the list of list values.It shows list object as no attribute mean.How to convert all list of list values to mean in the dataframe?
You can transform the column which contains the lists to another DataFrame and calculate the mean.
import pandas as pd
df = ... # Original df
pd.DataFrame(df['column_with_lists'].values.tolist()).mean(1)
This would result in a pandas DataFrame which looks like the following:
0 mean_of_list_row_0
1 mean_of_list_row_1
. .
. .
. .
n mean_of_list_row_n
You can use apply(np.mean) on the column with the lists in it to get the mean. For example:
Build a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2,4],[4,6]])
df[3] = [[5,7],[8,9,10]]
print(df)
0 1 3
0 2 4 [5, 7]
1 4 6 [8, 9, 10]
Use apply(np.mean)
print(df[3].apply(np.mean))
0 6.0
1 9.0
If you want to convert that column into the mean of the lists:
df[3] = df[3].apply(np.mean)
print(df)
Name: 3, dtype: float64
0 1 3
0 2 4 6.0
1 4 6 9.0
I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7
You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs
I have a dataframe with 990 rows and 7 columns, I want to make a XvsY linear graph, broking the line at every 22 rows.
I think that dividing the dataframe and then plotting it will be good way, but I don't get good results.
max_rows = 22
dataframes = []
while len(Co1new) > max_rows:
top = Co1new[:max_rows]
dataframes.append(top)
Co1new = Co1new[max_rows:]
else:
dataframes.append(Co1new)
for grafico in dataframes:
AC = plt.plot(grafico)
AC = plt.xlabel('Frequency (Hz)')
AC = plt.ylabel("Temperature (K)")
plt.show()
The code functions but it is not plotting the right columns.
Here some reduced data and in this case it should be divided at every four rows:
df = pd.DataFrame({
'col1':[2.17073,2.14109,2.16052,2.81882,2.29713,2.26273,2.26479,2.7643,2.5444,2.5027,2.52532,2.6778],
'col2':[10,100,1000,10000,10,100,1000,10000,10,100,1000,10000],
'col3':[2.17169E-4,2.15889E-4,2.10526E-4,1.53785E-4,2.09867E-4,2.07583E-4,2.01699E-4,1.56658E-4,1.94864E-4,1.92924E-4,1.87634E-4,1.58252E-4]})
One way I can think of is to add a new column with labels for every 22 records. See below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
seaborn.set(style='ticks')
"""
Assuming the index is numeric and is from [0-990)
this will return an integer for every 22 records
"""
Co1new['subset'] = 'S' + np.floor_divide(Co1new.index, 22).astype(str)
Out:
col1 col2 col3 subset
0 2.17073 10 0.000217 S0
1 2.14109 100 0.000216 S0
2 2.16052 1000 0.000211 S0
3 2.81882 10000 0.000154 S0
4 2.29713 10 0.000210 S1
5 2.26273 100 0.000208 S1
6 2.26479 1000 0.000202 S1
7 2.76434 10000 0.000157 S1
8 2.54445 10 0.000195 S2
9 2.50270 100 0.000193 S2
10 2.52532 1000 0.000188 S2
11 2.67780 10000 0.000158 S2
You can then use seaborn.pairplot to plot your data pairwise and use Co1new['subset'] as legend.
seaborn.pairplot(Co1new, hue='subset')
Or if you absolutely need line charts, you can make line charts of your data, each pair at a time separately, here is col1 vs. col3
seaborn.lineplot('col1', 'col3', hue='subset', data=Co1new)
Using #SIA ' s answer
df['groups'] = np.floor_divide(df.index, 3).astype(str)
import plotly.express as px
fig = px.line(df, x="col1", y="col2", color='groups')
fig.show()
Given the following data frame and pivot table:
import pandas as pd
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to create a heat map with divisions per indices A and B like this:
Is it possible?
You can use Styler in jupyter notebook, see docs and notebook:
import seaborn as sns
import pandas as pd
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
cm = sns.light_palette("blue", as_cmap=True)
s = df.reset_index().style.background_gradient(cmap=cm)
s