Plot number of occurrences in Pandas dataframe (2) - python-3.x

this is a followup from the previous question: Plot number of occurrences from Pandas DataFrame
I'm trying to produce a bar chart in descending order from the results of a pandas dataframe that is grouped by "Issuing Office." The data comes from a csv file which has 3 columns: System (string), Issuing Office (string), Error Type (string). The first four commands work fine - read, fix the column headers, strip out the offices I don't need, and reset the index. However I've never displayed a chart before.
CSV looks like:
System Issuing Office Error Type
East N1 Error1
East N1 Error1
East N2 Error1
West N1 Error3
Looking for a simple horizontal bar chart that would show N1 had a count of 3, N2 had a count of 2.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('mydatafile.csv',index_col=None, header=0) #ok
df.columns = [c.replace(' ','_') for c in df.columns] #ok
df = df[df['Issuing_Office'].str.contains("^(?:N|M|V|R)")] #ok
df = df.reset_index(drop=True) #ok
# produce chart that shows how many times an office came up (Decending)
df.groupby([df.index, 'Issuing_Office']).count().plot(kind='bar')
plt.show()
# produce chart that shows how many error types per Issuing Office (Descending).
There are no date fields in this which makes it different than the original question. Any help is greatly appreciated :)

JohnE's solution worked. Used the code:
# produce chart that shows how many times an office came up (Decending)
df['Issuing_Office'].value_counts().plot(kind='barh') #--JohnE
plt.gca().invert_yaxis()
plt.show()
# produce chart that shows how many error types per Issuing Office N1 (Descending).
dfN1 = df[df['Issuing_Office'].str.contains('N1')]
dfN1['Error_Type'].value_counts().plot(kind='barh')
plt.gca().invert_yaxis()
plt.show()

Related

Graphing three database in one graph Python

How can I plot the graph
Getting the data from those 3 sources
Using only first letter and last digits of the first column to put it in the X-axis as in the Excel graph above
How can I only show first column data by 20 digits difference ? aa010 aa030 aa050 ... etc
I have three different data from a source. Each one of them has 2 columns. Some of those 3 sources' first columns named the same but each one of them has different data corresponding to it in the second column.
I need to use python to plot those 3 data at one graph.
X-axis should be the combination of the first column of three data from the sources. - The data is in format of: aa001 - (up to sometimes aa400); ab001 - (up to sometimes ab400).
So, the X-axis should start with a aa001 and end with ab400. Since it would just overfill the x-axis and would make it impossible to look at it in a normal size, I want to just show aa020, aa040 ..... (using the number in the string, only show it after aa0(+20) or ab0(+20))
Y-axis should be just numbers from 0-10000 (may want to change if at least one of the data has max more than 10000.
I will add the sample graph I created using excel.
My sample data would be (Note: Data is not sorted by any column and I would prefer to sort it as stated above: aa001 ...... ab400):
Data1
Name Number
aa001 123
aa032 4211
ab400 1241
ab331 33
Data2
Name Number
aa002 1213
aa032 41
ab378 4231
ab331 63
aa163 999
Data3
Name Number
aa209 9876
ab132 5432
ab378 4124
aa031 754
aa378 44
ab344 1346
aa222 73
aa163 414
ab331 61
I searched up Matplotlib, found a sample example where it plots as I want (with dots for each x-y point) but does not apply to my question.
This is the similar code I found:
x = range(100)
y = range(100,200)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x[:4], y[:4], s=10, c='b', marker="s", label='first')
ax1.scatter(x[40:],y[40:], s=10, c='r', marker="o", label='second')
plt.legend(loc='upper left');
plt.show()
Sample graph (instead of aa for X-axis-> bc; ab -> mc)
I expect to see a graph as follows, but skipping every 20 in the X-axis. (I want the first graph dotted (symbolled) as the second graph but second graph to use X-axis as the first graph, but with skipping 20 in the name
First Graph ->- I want to use X-axis like this but without each data (only by 20 difference)
Second graph ->- I want to use symbols instead of lines like in this one
Please, let me know if I need to provide any other information or clarify/correct myself. Any help is appreciated!
The answer is as following but the following code has still some errors. The final answer will be posted after receiving complete answer at The answer will be in the following link:
Using sorted file to plot X-axis with corresponding Y-values from the original file
from matplotlib import pyplot as plt
import numpy as np
import csv
csv_file = []
with open('hostnum.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
csv_file.append(line)
us_csv_file = []
with open('unsorted.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
us_csv_file.append(line)
us_csv_file.sort(key=lambda x: csv_list.index(x[1]))
plt.plot([int(item[1]) for item in csvfile], 'o-')
plt.xticks(np.arange(len(csvfile)), [item[0] for item in csvfile])
plt.show()

How to force plotly plots to correct starting point on x axis?

I'm plotting the sales numbers (amount) per week YYYYWW per product product_name.
All the data appears on the graph, however some of the products are showing incorrectly. If product A only started having sales figures from year 2019 (ie no sales figures for the whole of 2018); then I want the line for that product to be zero in 2018 and begin showing values from 2019.
What's happening instead is Product A is showing the line graph from the origin of the graph. So week 1 of sales is at YYYYWW 201801 instead.
Is there a more efficient way to solve this than to place zero values for the product with a list comprehension?
import plotly.graph_objs as go
import plotly.offline as pyo
data = [go.Scatter(x=sorted(df.YYYYWW.unique().astype(str)),
y=list(df.loc[df.product_name == 'Product A',
['amount','YYYYWW']].groupby('YYYYWW').sum().amount),
mode='lines+markers',
)
]
pyo.plot(data)
The values in x are: 201801, 201802, ... 201920
The values in y are:
YYYYWW amount
2019/15 454.32
2019/16 1131.15
2019/17 1152.96
2019/18 2822.77
2019/19 3580.86
2019/20 2265.06
solved it!
My x values should be taken from a subset of the dataframe just as done in my y values:
x = df.loc[df.product_name == i].YYYYWW.unique().astype(str)

Difficulty grouping barchart using Python, Pandas and Matplotlib

I am having difficulty getting plot.bar to group the bars together the way I have them grouped in the dataframe. The dataframe returns the grouped data correctly, however, the bar graph is providing a separate bar for every line int he dataframe. Ideally, everything in my code below should group 3-6 bars together for each department (Dept X should have bars grouped together for each type, then count of true/false as the Y axis).
Dataframe:
dname Type purchased
Dept X 0 False 141
True 270
1 False 2020
True 2604
2 False 2023
True 1047
Code:
import psycopg2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
##connection and query data removed
df = pd.merge(df_departments[["id", "dname"]], df_widgets[["department", "widgetid", "purchased","Type"]], how='inner', left_on='id', right_on='department')
df.set_index(['dname'], inplace=True)
dx=df.groupby(['dname', 'Type','purchased'])['widgetid'].size()
dx.plot.bar(x='dname', y='widgetid', rot=90)
I can't be sure without a more reproducible example, but try unstacking the innermost level of the MultiIndex of dx before plotting:
dx.unstack().plot.bar(x='dname', y='widgetid', rot=90)
I expect this to work because when plotting a DataFrame, each column becomes a legend entry and each row becomes a category on the horizontal axis.

P-value normal test for multiple rows

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

with python3 need to Draw a count plot to show the number of each type of crime discovered each year

i need to Draw a count plot to show the number of each type of crime discovered each year some columns from csv file
i have 2 columns will make process only on it (primary type and date )
so any help to can implement in python
Try This,
df=pd.read_csv('FileName.csv')
df1 = df[['ColumnName1','ColumnName2']]
print(df1)
plt.xlabel('ColumnName1')
plt.ylabel('ColumnName2')
a=plt.bar(df1['ColumnName1'], df1['ColumnName2'])
plt.show()

Resources