Pandas grouping and resampling for a bar plot: - python-3.x

I have a dataframe that records concentrations for several different locations in different years, with a high temporal frequency (<1 hour). I am trying to make a bar/multibar plot showing mean concentrations, at different locations in different years
To calculate mean concentration, I have to apply quality control filters to daily and monthly data.
My approach is to first apply filters and resample per year and then do the grouping by location and year.
Also, out of all the locations (in the column titled locations) I have to choose only a few rows. So, I am slicing the original dataframe and creating a new dataframe with selected rows.
I am not able to achieve this using the following code:
date=df['date']
location = df['location']
df.date = pd.to_datetime(df.date)
year=df.date.dt.year
df=df.set_index(date)
df['Year'] = df['date'].map(lambda x: x.year )
#Location name selection/correction in each city:
#Changing all stations:
df['location'] = df['location'].map(lambda x: "M" if x == "mm" else x)
#New dataframe:
df_new = df[(df['location'].isin(['K', 'L', 'M']))]
#Data filtering:
df_new = df_new[df_new['value'] >= 0]
df_new.drop(df_new[df_new['value'] > 400].index, inplace = True)
df_new.drop(df_new[df_new['value'] <2].index, inplace = True)
diurnal = df_new[df_new['value']].resample('12h')
diurnal_mean = diurnal.mean()[diurnal.count() >= 9]
daily_mean=diurnal_mean.resample('d').mean()
df_month=daily_mean.resample('m').mean()
df_yearly=df_month[df_month['value']].resample('y')
#For plotting:
df_grouped = df_new.groupby(['location', 'Year']).agg({'value':'sum'}).reset_index()
sns.barplot(x='location',y='value',hue='Year',data= df_grouped)
This is one of the many errors that cropped up:
"None of [Float64Index([22.73, 64.81, 8.67, 19.98, 33.12, 37.81, 39.87, 42.29, 37.81,\n 36.51,\n ...\n 11.0, 40.0, 23.0, 80.0, 50.0, 60.0, 40.0, 80.0, 80.0,\n 17.0],\n dtype='float64', length=63846)] are in the [columns]"
ERROR:root:Invalid alias: The name clear can't be aliased because it is another magic command.
This is a sample dataframe, showing what I need to plot; value column should ideally represent resampled values, after performing the quality control operations and resampling.
Unnamed: 0 location value \
date location value
2017-10-21 08:45:00+05:30 8335 M 339.3
2017-08-18 17:45:00+05:30 8344 M 45.1
2017-11-08 13:15:00+05:30 8347 L 594.4
2017-10-21 13:15:00+05:30 8659 N 189.9
2017-08-18 15:45:00+05:30 8662 N 46.5
This is how the a part of the actual data should look like, after selecting the chosen locations. I am a new user so cannot attach a screenshot of the graph I require. This query is an extension of the query I had posted earlier , with the additional requirement of plotting resampled data instead of simple value counts. Iteration over years to plot different group values as bar plot in pandas
Any help will be much appreciated.

Fundamentally, your errors come with this unclear indexing where you are passing continuous, float values of one column for rowwise selection of index which currently is a datetime type.
df_new[df_new['value']] # INDEXING DATETIME USING FLOAT VALUES
...
df_month[df_month['value']] # COLUMN value DOES NOT EXIST
Possibly, you meant to select the column value (out of the others) during resampling.
diurnal = df_new['value'].resample('12h')
diurnal.mean()[diurnal.count() >= 9]
daily_mean = diurnal_mean.resample('d').mean()
df_month = daily_mean.resample('m').mean() # REMOVE value BEING UNDERLYING SERIES
df_yearly = df_month.resample('y')
However, no where above do you retain location for plotting. Hence, instead of resample, use groupby(pd.Grouper(...))
# AGGREGATE TO KEEP LOCATION AND 12h
diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
.agg(["count", "mean"])
.reset_index().set_index(['date'])
)
# FILTER
diurnal_sub = diurnal[diurnal["count"] >= 9]
# MULTIPLE DATE TIME LEVEL MEANS
daily_mean = diurnal_sub.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
df_month = diurnal_sub.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
df_yearly = diurnal_sub.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()
print(df_yearly)
To demonstrate with random, reproducible data:
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(242020)
random_df = pd.DataFrame({'date': (np.random.choice(pd.date_range('2017-01-01', '2019-12-31'), 5000) +
pd.to_timedelta(np.random.randint(60*60, 60*60*24, 5000), unit='s')),
'location': np.random.choice(list("KLM"), 5000),
'value': np.random.uniform(10, 1000, 5000)
})
Aggregation
loc_list = list("KLM")
# NEW DATA FRAME WITH DATA FILTERING
df = (random_df.set_index(random_df['date'])
.assign(Year = lambda x: x['date'].dt.year,
location = lambda x: x['location'].where(x["location"] != "mm", "M"))
.query('(location == #loc_list) and (value >= 2 and value <= 400)')
)
# 12h AGGREGATION
diurnal = (df_new.groupby(["location", pd.Grouper(freq='12h')])["value"]
.agg(["count", "mean"])
.reset_index().set_index(['date'])
.query("count >= 2")
)
# d, m, y AGGREGATION
daily_mean = diurnal.groupby(["location", pd.Grouper(freq='d')])["mean"].mean()
df_month = diurnal.groupby(["location", pd.Grouper(freq='m')])["mean"].mean()
df_yearly = (diurnal.groupby(["location", pd.Grouper(freq='y')])["mean"].mean()
.reset_index()
.assign(Year = lambda x: x["date"].dt.year)
)
print(df_yearly)
# location date mean Year
# 0 K 2017-12-31 188.984592 2017
# 1 K 2018-12-31 199.521702 2018
# 2 K 2019-12-31 216.497268 2019
# 3 L 2017-12-31 214.347873 2017
# 4 L 2018-12-31 199.232711 2018
# 5 L 2019-12-31 177.689221 2019
# 6 M 2017-12-31 222.412711 2017
# 7 M 2018-12-31 241.597977 2018
# 8 M 2019-12-31 215.554228 2019
Plotting
sns.set()
fig, axs = plt.subplots(figsize=(12,5))
sns.barplot(x='location', y='mean', hue='Year', data= df_yearly, ax=axs)
plt.title("Location Value Yearly Aggregation", weight="bold", size=16)
plt.show()
plt.clf()
plt.close()

Related

Python Pandas apply function not being applied to every row when using variables from a DataFrame

I have this weird Pandas problem, when I use the apply function using values from a data frame, it only gets applied to the first row:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [10, 20]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
#variable data frame - pull values from this to edit main data frame
headerVariables = [['varA', 'varB']]
valuesVariables = [[2, 10]]
dfVariables = pd.DataFrame(valuesVariables, columns = headerVariables)
dfVariables.to_csv('Variables.csv', index=False)
readVariablesCSV = pd.read_csv('Variables.csv')
readVarA = readVariablesCSV['varA']
readVarB = readVariablesCSV['varB']
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 NaN NaN
But when I just use regular variables (not being called from a data frame), it functions just fine:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [20, 40]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
# variables
readVarA = 2
readVarB = 10
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 100.0 200.0
Help please I'm pulling my hair out.
If you take readVarA and readVarB from the dataframe by selecting the column it is a pandas Series with an index, which gives a problem in the calculation (dividing a series by another series with a different index doesn't work).
You can take the first value from the series to get the value like this:
def formula(x):
return (x / readVarA[0]) * readVarB[0]

Horizontal grouped Barplot with Bokeh using dataframe

I have a Dataframe and I want to group by Type, and then Flag and plot a graph for count of ID and another graph grouped by Type , Flag and sum of Total column in Bokeh.
')
p.hbar(df,
plot_width=800,
plot_height=800,
label='Type',
values='ID',
bar_width=0.4,
group = ' Type', 'Flag'
legend='top_right')
[![Expected Graph ][2]][2]
If it's not possible with Bokeh what other package can I use to get a good looking graph( Vibrant colours with white background)
You can do this with the holoviews library, which uses bokeh as a backend.
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension("bokeh")
df = pd.DataFrame({
"type": list("ABABCCAD"),
"flag": list("YYNNNYNY"),
"id": list("DEFGHIJK"),
"total": [40, 100, 20, 60, 77, 300, 60, 50]
})
# Duplicate the dataframe
df = pd.concat([df] * 2)
print(df)
type flag id total
0 A Y 1 40
1 B Y 2 100
2 A N 3 20
3 B N 4 60
4 C N 5 77
5 C Y 6 300
6 A N 7 60
7 D Y 8 50
Now that we have our data, lets work on plotting it:
def mainplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="total",
text="total",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
def sideplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="count",
text="count",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
# Create single bar plot for sum of the total column
total_sum = df.groupby(["type", "flag"])["total"].sum().reset_index()
total_sum_bars = hv.Bars(total_sum, kdims=["type", "flag"], vdims="total")
# Create our multi-dimensional bar plot
all_ids = sorted(df["id"].unique())
counts = df.groupby(["type", "flag"])["id"].value_counts().rename("count").reset_index()
id_counts_hmap = hv.Bars(counts, kdims=["type", "flag", "id"], vdims="count").groupby("type")
main_plot = (total_sum_bars
.opts(hooks=[mainplot_hook],
title="Total Sum",
invert_axes=True)
)
side_plots = (
id_counts_hmap
.redim.values(id=all_ids, flag=["Y", "N"])
.redim.range(count=(0, 3))
.opts(
opts.NdLayout(title="Counts of ID"),
opts.Bars(color="#1F77B4", height=250, width=250, invert_axes=True, hooks=[sideplot_hook]))
.layout("type")
.cols(2)
)
final_plot = main_plot + side_plots
# Save combined output as html
hv.save(final_plot, "my_plot.html")
# Save just the main_plot as html
hv.save(main_plot, "main_plot.html")
As you can see, the code to make plots in holoviews can be a little tricky but it's definitely a tool I would recommend you pick up. Especially if you deal with high dimensional data regularly, it makes plotting it a breeze once you get the syntax down.

How to write from loop to dataframe

I'am trying to calculate 33 stock betas and write them to dataframe.
Unfortunately, I have an error in my code:
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are vali
import pandas as pd
import numpy as np
stock1=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '1') #read second sheet of excel file
stock2=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '2') #read second sheet of excel file
stock2['stockreturn']=np.log(stock2.AdjCloseStock / stock2.AdjCloseStock.shift(1)) #stock ln return
stock2['SP500return']=np.log(stock2.AdjCloseSP500 / stock2.AdjCloseSP500.shift(1)) #SP500 ln return
stock2 = stock2.iloc[1:] #delete first row in dataframe
betas = pd.DataFrame()
for i in range(0,(len(stock2.AdjCloseStock)//52)-1):
betas = betas.append(stock2.stockreturn.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52])/stock2.SP500return.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52]))
My data looks like weekly stock and S&P index return for 33 years. So the output should have 33 betas.
I tried simplifying your code and creating an example. I think the problem is that your calculation returns a float. You want to make it a pd.Series. DataFrame.append takes:
DataFrame or Series/dict-like object, or list of these
np.random.seed(20)
df = pd.DataFrame(np.random.randn(33*53, 2),
columns=['a', 'b'])
betas = pd.DataFrame()
for year in range(len(df['a'])//52 -1):
# Take some data
in_slice = pd.IndexSlice[year*52:(year+1)*52]
numerator = df['a'].iloc[in_slice].cov(df['b'].iloc[in_slice])
denominator = df['b'].iloc[in_slice].cov(df['b'].iloc[in_slice])
# Do some calculations and create a pd.Series from the result
data = pd.Series(numerator / denominator, name = year)
# Append to the DataFrame
betas = betas.append(data)
betas.index.name = 'years'
betas.columns = ['beta']
betas.head():
beta
years
0 0.107669
1 -0.009302
2 -0.063200
3 0.025681
4 -0.000813

Identifying groups of two rows that satisfy three conditions in a dataframe

I have the df below and want to identify any two orders that satisfy all the following condtions:
Distance between pickups less than X miles
Distance between dropoffs less Y miles
Difference between order creation times less Z minutes
Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.
The df I currently have looks like the following:
DAY  Order pickup_lat pickup_long dropoff_lat dropoff_long created_time
1/3/19 234e 32.69 -117.1 32.63 -117.08 3/1/19 19:00
1/3/19 235d 40.73 -73.98 40.73 -73.99 3/1/19 23:21
1/3/19 253w 40.76 -73.99 40.76 -73.99 3/1/19 15:26
2/3/19 231y 36.08 -94.2 36.07 -94.21 3/2/19 0:14
3/3/19 305g 36.01 -78.92 36.01 -78.95 3/2/19 0:09
3/3/19 328s 36.76 -119.83 36.74 -119.79 3/2/19 4:33
3/3/19 286n 35.76 -78.78 35.78 -78.74 3/2/19 0:43
I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.
I hope I am explaining my desired output correctly. Thanks for looking!
I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:
created dataframe with all possible orders combination,
computed all needed measures and for all of the combinations, I added those measures column to the dataframe,
find the indices of the rows which fulfill the mentioned conditions.
The code:
#create dataframe with all combination
from itertools import combinations
index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)
from haversine import haversine
def distance(row):
loc_0 = (row[0],row[1]) # (lat, lon)
loc_1 = (row[2],row[3])
return haversine(loc_0,loc_1,unit='mi')
#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)
#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)
#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')
#Thresholds
Z = 600
Y = 400
X = 400
#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X = combined["pickup_dist_mi"]<X
dropoff_dist_Y = combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]
The output for your data:
Order Order_1 time_diff_min dropoff_dist_mi pickup_dist_mi
(0, 5) 234e 328s 573.0 322.988195 231.300179
(1, 2) 235d 253w 475.0 2.072803 0.896893
(4, 6) 305g 286n 34.0 19.766096 10.233550
Hope I understand you well and that will help.
Using your dataframe as above. Drop the index. I'm presuming your created_time column is in datetime format.
import pandas as pd
from geopy.distance import geodesic
Cross merge the dataframe to get all possible combinations of 'Order'.
df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
Remove all the rows where the orders are equal.
df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
Drop duplicate rows where Order_x, Order_y == [a, b] and [b, a]
# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))
# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
Create a column calculate the time difference in minutes.
df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
Create a column and calculate the distance between drop offs.
df_all['dropoff'] = df_all.apply(
(lambda row: geodesic(
(row['dropoff_lat_x'], row['dropoff_long_x']),
(row['dropoff_lat_x'], row['dropoff_long_y'])
).miles),
axis=1
)
Create a column and calculate the distance between pickups.
df_all['pickup'] = df_all.apply(
(lambda row: geodesic(
(row['pickup_lat_x'], row['pickup_long_x']),
(row['pickup_lat_x'], row['pickup_long_y'])
).miles),
axis=1
)
Filter the results as desired.
X = 1500
Y = 2000
Z = 100
mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z
print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])
Order_x Order_y time dropoff pickup
10 235d 231y 53.0 1059.026620 1059.026620
11 235d 305g 48.0 260.325370 259.275948
13 235d 286n 82.0 249.306279 251.929905
25 231y 305g 5.0 853.308110 854.315567
27 231y 286n 29.0 865.026077 862.126593
34 305g 286n 34.0 11.763787 7.842526

How to have a chart multiple columns continuously by iterating through a data-frame with matplotlib

BACKGROUND INFORMATION:
I have dataframe of x many stocks with y price sets (closing & 3 day SMA), (currently this is 5 and 2 respectively (one is closing price, the other is a 3 day Simple Moving Average SMA).
The current output is [2781 rows x 10 columns] with a ranging data set start_date = '2006-01-01' till end_date = '2016-12-31'. The output is as follows as a dataframe print(df):
CURRENT OUTPUT:
ANZ Price ANZ 3 day SMA CBA Price CBA 3 day SMA MQG Price MQG 3 day SMA NAB Price NAB 3 day SMA WBC Price WBC 3 day SMA
Date
2006-01-02 23.910000 NaN 42.569401 NaN 66.558502 NaN 30.792999 NaN 22.566401 NaN
2006-01-03 24.040001 NaN 42.619099 NaN 66.086403 NaN 30.935699 NaN 22.705400 NaN
2006-01-04 24.180000 24.043334 42.738400 42.642300 66.587997 66.410967 31.078400 30.935699 22.784901 22.685567
2006-01-05 24.219999 24.146667 42.708599 42.688699 66.558502 66.410967 30.964300 30.992800 22.794800 22.761700
... ... ... ... ... ... ... ... ... ... ...
2016-12-27 87.346667 30.670000 30.706666 32.869999 32.729999 87.346667 30.670000 30.706666 32.869999 32.729999
2016-12-28 87.456667 31.000000 30.773333 32.980000 32.829999 87.456667 31.000000 30.773333 32.980000 32.829999
2016-12-29 87.520002 30.670000 30.780000 32.599998 32.816666 87.520002 30.670000 30.780000 32.599998 32.816666
MY WORKING CODE:
#!/usr/bin/python3
from pandas_datareader import data
import pandas as pd
import itertools as it
import os
import numpy as np
import fix_yahoo_finance as yf
import matplotlib.pyplot as plt
yf.pdr_override()
stock_list = sorted(["ANZ.AX", "WBC.AX", "MQG.AX", "CBA.AX", "NAB.AX"])
number_of_decimal_places = 8
moving_average_period = 3
def get_moving_average(df, stock_name):
df2 = df.rolling(window=moving_average_period).mean()
df2.rename(columns={stock_name: stock_name.replace("Price", str(moving_average_period) + " day SMA")}, inplace=True)
df = pd.concat([df, df2], axis=1, join_axes=[df.index])
return df
# Function to get the closing price of the individual stocks
# from the stock_list list
def get_closing_price(stock_name, specific_close):
symbol = stock_name
start_date = '2006-01-01'
end_date = '2016-12-31'
df = data.get_data_yahoo(symbol, start_date, end_date)
sym = symbol + " "
print(sym * 10)
df = df.drop(['Open', 'High', 'Low', 'Adj Close', 'Volume'], axis=1)
df = df.rename(columns={'Close': specific_close})
# https://stackoverflow.com/questions/16729483/converting-strings-to-floats-in-a-dataframe
# df[specific_close] = df[specific_close].astype('float64')
# print(type(df[specific_close]))
return df
# Creates a big DataFrame with all the stock's Closing
# Price returns the DataFrame
def get_all_close_prices(directory):
count = 0
for stock_name in stock_list:
specific_close = stock_name.replace(".AX", "") + " Price"
if not count:
prev_df = get_closing_price(stock_name, specific_close)
prev_df = get_moving_average(prev_df, specific_close)
else:
new_df = get_closing_price(stock_name, specific_close)
new_df = get_moving_average(new_df, specific_close)
# https://stackoverflow.com/questions/11637384/pandas-join-merge-concat-two-dataframes
prev_df = prev_df.join(new_df)
count += 1
# prev_df.to_csv(directory)
df = pd.DataFrame(prev_df, columns=list(prev_df))
df = df.apply(pd.to_numeric)
convert_df_to_csv(df, directory)
return df
def convert_df_to_csv(df, directory):
df.to_csv(directory)
def main():
# FINDS THE CURRENT DIRECTORY AND CREATES THE CSV TO DUMP THE DF
csv_in_current_directory = os.getcwd() + "/stock_output.csv"
csv_in_current_directory_dow_distribution = os.getcwd() + "/dow_distribution.csv"
# FUNCTION THAT GETS ALL THE CLOSING PRICES OF THE STOCKS
# AND RETURNS IT AS ONE COMPLETE DATAFRAME
df = get_all_close_prices(csv_in_current_directory)
print(df)
# Main line of code
if __name__ == "__main__":
main()
QUESTION:
From this df I want to create x many lines graphs (one graph per stock) with y many lines (price, and SMAs). How can I do this with matplotlib? Could this be done with a for loop and save the individuals plots as the loop gets iterated? If so how?
First import import matplotlib.pyplot as plt.
Then it depends whether you want x many individual plots or one plot with x many subplots:
Individual plots
df.plot(y=[0,1])
df.plot(y=[2,3])
df.plot(y=[4,5])
df.plot(y=[6,7])
df.plot(y=[8,9])
plt.show()
You can also save the individual plots in a loop:
for i in range(0,9,2):
df.plot(y=[i,i+1])
plt.savefig('{}.png'.format(i))
Subplots
fig, axes = plt.subplots(nrows=2, ncols=3)
df.plot(ax=axes[0,0],y=[0,1])
df.plot(ax=axes[0,1],y=[2,3])
df.plot(ax=axes[0,2],y=[4,5])
df.plot(ax=axes[1,0],y=[6,7])
df.plot(ax=axes[1,1],y=[8,9])
plt.show()
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html for options to customize your plot(s).
The best approach is to make a function that is dependent on the size of your lists x and y. Thereby the function should be as follows:
def generate_SMA_graphs(df):
columnNames = list(df.head(0))
print("CN:\t", columnNames)
print(len(columnNames))
count = 0
for stock in stock_list:
stock_iter = count * (len(moving_average_period_list) + 1)
sma_iter = stock_iter + 1
for moving_average_period in moving_average_period_list:
fig = plt.figure()
df.plot(y=[columnNames[stock_iter], columnNames[sma_iter]])
plt.xlabel('Time')
plt.ylabel('Price ($)')
graph_title = columnNames[stock_iter] + " vs. " + columnNames[sma_iter]
plt.title(graph_title)
plt.grid(True)
plt.savefig(graph_title.replace(" ", "") + ".png")
print("\t\t\t\tCompleted: ", graph_title)
plt.close(fig)
sma_iter += 1
count += 1
With the code above, irrespective how ever long either list is (for x or y, stock list or SMA list) the above function will generate a graph comparing the original price with every SMA for that given stock.

Resources