Python. creating Pie chart using existing Object? - python-3.x

I'm working on a dataset called 'Crime Against Women in India.
I got the dataset from the website and tidy up the data using Excel.
For data Manipulation and Visualization i'm using python (3.0) on Jupyter Workbook (5.0.0 Version). Here's the the code I worked so far.
# importing Libraries
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
# Reading CSV File and naming the object called crime
crime=pd.read_csv("C:\\Users\\aneeq\\Documents\\python assignment\\crime.csv",index_col = None, skipinitialspace = True)
print(crime)
Now I can see my data. Now I want to do is to find out what type of crime has the most value against Woman in India in 2013. That's simple and I did that using the following code
Type = crime.loc[(crime.AreaName.isin(['All-India'])) & (crime.Year.isin([2013])) , ['Year', 'AreaName', 'Rape', 'Kidnapping', 'DowryDeaths', 'Assault', 'Insult', 'Cruelty']]
print(Type)
Results shows like this.
Year AreaName Rape Kidnapping DowryDeaths Assault Insult Cruelty
2013 All-India 33707 51881 8083 70739 12589 118866
Now , the next part is where I'm struggling with it at the moment. I want to make a piechart for the type of crimes that has the most values. You can see Cruelty('Cruelty by Husband or his relatives') has the most crime values than others.
I want to display 'Rape', 'Kidnapping', 'DowryDeaths', 'Assault', 'Insult' and 'Cruelty' on the Piechart (using matplotlib). Not 'Years' and 'AreaNames'.
This is my code so far
exp_val = Type.Rape, Type.Kidnapping, Type.DowryDeaths, Type.Assault, Type.Insult, Type.Cruelty
plt.pie(exp_val)
Not sure if my code is right. But anyways I got an error saying `'KeyError: 0'.
Can anyone help me with this and what is the right code for displaying Pie chart using existing object.

Related

Plotting graphs with Altair from a Pandas Dataframe

I am trying to read table values from a spreadsheet and plot different charts using Altair.
The spreadsheet can be found here
import pandas as pd
xls_file = pd.ExcelFile('PET_PRI_SPT_S1_D.xls')
xls_file
crude_df = xls_file.parse('Data 1')
crude_df
I am setting the second row values as column headers of the data frame.
crude_df.columns = crude_df.iloc[1]
crude_df.columns
Index(['Date', 'Cushing, OK WTI Spot Price FOB (Dollars per Barrel)',
'Europe Brent Spot Price FOB (Dollars per Barrel)'],
dtype='object', name=1)
The following is a modified version of Altair code got from documentation examples
crude_df_header = crude_df.head(100)
import altair as alt
alt.Chart(crude_df_header).mark_circle().encode(
# Mapping the WTI column to y-axis
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
This does not work.
Error is shown as
TypeError: Object of type datetime is not JSON serializable
How to make 2 D plots with this data?
Also, how to make plots for number of values exceeding 5000 in Altair? Even this results in errors.
Your error is due to the way you parsed the file. You have set the column name but forgot to remove the first two rows, including the ones which are now the column names. The presence of these string values resulted in the error.
The proper way of achieving what you are looking for will be as follow:
import pandas as pd
import altair as alt
crude_df = pd.read_excel(open('PET_PRI_SPT_S1_D.xls', 'rb'),
sheet_name='Data 1',index_col=None, header=2)
alt.Chart(crude_df.head(100)).mark_circle().encode(
x ='Date',
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
For the max rows issue, you can use the following
alt.data_transformers.disable_max_rows()
But be mindful of the official warning
If you choose this route, please be careful: if you are making multiple plots with the dataset in a particular notebook, the notebook will grow very large and performance may suffer.

Reading data from NG.L CSV File - Japanese Candlestick Chart

I've recently had to physically download a CSV file (NG.L stock) from the Yahoo Finance website as I can no longer pull data from Yahoo directly which I could do no problem with my original financial Python scripts.
My program almost works and displays my NG.L stock chart, but the dates at the bottom of the chart are completely wrong. They should display only the dates from 02/06/2021 to 09/07/2021 from my NG.L CSV file.
Instead my chart displays dates 23/01/2021 to 19/11/2021 which is very weird. Is there a quick code fix as I only want the dates to be displayed and extracted from my CSV file.
NG.L Python code:
import matplotlib.pyplot as plt
from mplfinance.original_flavor import candlestick_ohlc
import pandas as pd
import matplotlib.dates as mpl_dates
import datetime as dt
plt.style.use('ggplot')
# Extracting Data for plotting
data = pd.read_csv('NG.L.csv')
ohlc = data.loc[:, ['Date', 'Open', 'High', 'Low', 'Close', ]]
ohlc['Date'] = pd.to_datetime(ohlc['Date'])
ohlc['Date'] = ohlc['Date'].apply(mpl_dates.date2num)
ohlc = ohlc.astype(float)
# Creating Subplots
fig, ax = plt.subplots()
candlestick_ohlc(ax, ohlc.values, width=0.8, colorup='green', colordown='red', alpha=0.8)
# Setting labels & titles
ax.set_xlabel('TIMELINE of NG.L')
ax.set_ylabel('PRICE IN GBP POUND STERLING')
fig.suptitle('NATIONAL GRID PLC - 2 JUNE 2021 - 9 JULY 2021')
# Formatting Date
date_format = mpl_dates.DateFormatter('%d-%m-%Y')
ax.xaxis.set_major_formatter(date_format)
fig.autofmt_xdate()
fig.tight_layout()
plt.show()
NG.L Stock Chart:
NG.L CSV file
I downloaded the csv file, and copy the above code into a .py file on my machine, and then simply ran it, and got the following result. Looks fine to me. What version of mplfinance do you have installed?
Note also if you look directly at your csv file, in what format are the dates??
dino#DINO:~/code/mplfinance/examples/scratch_pad$ head NG.L.csv
Date,Open,High,Low,Close,Adj Close,Volume
2021-06-01,938.799988,957.755981,938.799988,950.099976,918.291504,5162873
2021-06-02,950.500000,965.599976,948.500000,960.599976,928.439941,9110791
2021-06-03,937.099976,937.099976,909.099976,921.299988,921.299988,9609182
2021-06-04,921.700012,927.799988,912.700012,914.299988,914.299988,7607690
2021-06-07,918.900024,919.229980,913.000000,916.000000,916.000000,5240943
2021-06-08,919.099976,926.200012,914.239990,914.700012,914.700012,12657157
2021-06-09,913.599976,916.299988,909.799988,913.000000,913.000000,6334877
2021-06-10,918.099976,925.000000,911.900024,914.000000,914.000000,7530470
2021-06-11,915.799988,921.809998,914.400024,918.599976,918.599976,8630006
You may want to consider using the new mpfinance API. You can accomplish the same thing with much less code:
import mplfinance as mpf
import pandas as pd
# Extracting Data for plotting
data = pd.read_csv('NG.L.csv',index_col=0,parse_dates=True)
mpf.plot(data,type='candle',style='yahoo',
ylabel='PRICE IN GBP POUND STERLING',
title='NATIONAL GRID PLC - 2 JUNE 2021 - 9 JULY 2021',
datetime_format='%d-%m-%Y')
full disclosure: I am the maintainer of the mplfinance library. For now the old API is remains available via the from mplfinance.original_flavor import, for those who are used to it, but I still try to encourage people to use the new API which was designed to be simpler.

Bokeh plot line not updating after checking CheckboxGroup in server mode (python callback)

I have just initiated myself to Bokeh library and I would like to add interactivity in my dashboard. To do so, I want to use CheckboxGroup widget in order to select which one of a pandas DataFrame column to plot.
I have followed tutorials but I must have misunderstood the use of ColumnDataSource as I cannot make a simple example work...
I am aware of previous questions on the matter, and one that seems relevant on the StackOverflow forum is the latter :
Bokeh not updating plot line update from CheckboxGroup
Sadly I did not succeed in reproducing the right behavior.
I have tried to reproduce an example following the same updating structure presented in Bokeh Server plot not updating as wanted, also it keeps shifting and axis information vanishes by #bigreddot without success.
import numpy as np
import pandas as pd
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.palettes import Spectral
from bokeh.layouts import row
from bokeh.models.widgets import CheckboxGroup
from bokeh.io import curdoc
# UPDATE FUNCTION ------------------------------------------------
# make update function
def update(attr, old, new):
feature_selected_test = [feature_checkbox.labels[i] for i in feature_checkbox.active]
# add index to plot
feature_selected_test.insert(0, 'index')
# create new DataFrame
new_df = dummy_df.filter(feature_selected_test)
plot_src.data = ColumnDataSource.from_df(data=new_df)
# CREATE DATA SOURCE ------------------------------------------------
# create dummy data for debugging purpose
index = list(range(0, 890))
index.extend(list(range(2376, 3618)))
feature_1 = np.random.rand(len(index))
feature_2 = np.random.rand(len(index))
feature_3 = np.random.rand(len(index))
feature_4 = np.random.rand(len(index))
dummy_df = pd.DataFrame(dict(index=index, feature_1=feature_1, feature_2=feature_2, feature_3=feature_3,feature_4=feature_4))
# CREATE CONTROL ------------------------------------------------------
# list available data to plot
available_feature = list(dummy_df.columns[1:])
# initialize control
feature_checkbox = CheckboxGroup(labels=available_feature, active=[0, 1], name='checkbox')
feature_checkbox.on_change('active', update)
# INITIALIZE DASHBOARD ---------------------------------------------------
# initialize ColumnDataSource object
plot_src = ColumnDataSource(dummy_df)
# create figure
line_fig = figure()
feature_selected = [feature_checkbox.labels[i] for i in feature_checkbox.active]
# feature_selected = ['feature_1', 'feature_2', 'feature_3', 'feature_4']
for index_int, col_name_str in enumerate(feature_selected):
line_fig.line(x='index', y=col_name_str, line_width=2, color=Spectral[11][index_int % 11], source=plot_src)
curdoc().add_root(row(feature_checkbox, line_fig))
The program should work with a copy/paste... well without interactivity...
Would someone please help me ? Thanks a lot in advance.
You are only adding glyphs for the initial subset of selected features:
for index_int, col_name_str in enumerate(feature_selected):
line_fig.line(x='index', y=col_name_str, line_width=2, color=Spectral[11][index_int % 11], source=plot_src)
So that is all that is ever going to show.
Adding new columns to the CDS does not automatically make anything in particular happen, it's just extra data that is available for glyphs or hover tools to potentially use. To actually show it, there have to be glyphs configured to display those columns. You could do that, add and remove glyphs dynamically, but it would be far simpler to just add everything once up front, and use the checkbox to toggle only the visibility. There is an example of just this in the repo:
https://github.com/bokeh/bokeh/blob/master/examples/app/line_on_off.py
That example passes the data as literals the the glyph function but you could put all the data in CDS up front, too.

how to use pandas to organize sales data into 12 months and find the 10 most profitable products for those 12 months?

I need write a program that organizes the data in the provided spreadsheet to find the top 10 most profitable products by each month. The program needs to take an input from the user to specify the year in which to compile the data.
I've gotten as far as printing all of the products sold in each month by their highest profitability but I don't know how to make it print only the top 10 for each month.
I'm also lost on how to take an input from the user to select only certain year for the program to compile the data.
Please help.
the link to download the files for my project: https://drive.google.com/drive/folders/1VkzTWydV7Qae7hOn6WUjDQutQGmhRaDH?usp=sharing
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
xl = pd.ExcelFile("SalesDataFull.xlsx")
OrdersOnlyData = xl.parse("Orders")
df_year = OrdersOnlyData["Order Date"].dt.year
OrdersOnlyData["Year"] = df_year
df_month = OrdersOnlyData["Order Date"].dt.month
OrdersOnlyData["Month"] = df_month
dataframe = OrdersOnlyData[["Year","Month","Product Name","Profit"]]
month_profit = dataframe.groupby(["Year","Month","Product Name"]).Profit.sum().sort_values(ascending=False)
month_profit = month_profit.reset_index()
month_profit = month_profit.sort_values(["Year","Month","Profit"],ascending=[True,True,False])
print(month_profit)
As #Franco pointed out, it is difficult to recommend a proper solution since you did not provide a data sample together with your question. In any case, the function that you are looking for is most likely nth().
This is probably how it should look like:
month_profit = month_profit.sort_values('Profit', ascending=False).groupby(['Year', 'Month']).nth([range(10)]).sort_values(by=['Year', 'Month', 'Profit'], ascending=[True, True, False])

Problems with graphing excel data off an internet source with dates

this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')
Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.

Resources