How to force plotly plots to correct starting point on x axis? - python-3.x

I'm plotting the sales numbers (amount) per week YYYYWW per product product_name.
All the data appears on the graph, however some of the products are showing incorrectly. If product A only started having sales figures from year 2019 (ie no sales figures for the whole of 2018); then I want the line for that product to be zero in 2018 and begin showing values from 2019.
What's happening instead is Product A is showing the line graph from the origin of the graph. So week 1 of sales is at YYYYWW 201801 instead.
Is there a more efficient way to solve this than to place zero values for the product with a list comprehension?
import plotly.graph_objs as go
import plotly.offline as pyo
data = [go.Scatter(x=sorted(df.YYYYWW.unique().astype(str)),
y=list(df.loc[df.product_name == 'Product A',
['amount','YYYYWW']].groupby('YYYYWW').sum().amount),
mode='lines+markers',
)
]
pyo.plot(data)
The values in x are: 201801, 201802, ... 201920
The values in y are:
YYYYWW amount
2019/15 454.32
2019/16 1131.15
2019/17 1152.96
2019/18 2822.77
2019/19 3580.86
2019/20 2265.06

solved it!
My x values should be taken from a subset of the dataframe just as done in my y values:
x = df.loc[df.product_name == i].YYYYWW.unique().astype(str)

Related

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Use KDTree/KNN Return Closest Neighbors

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.
I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.
The feature names in both dataframes are the same. Here is what the dataframes look like:
Player Year Team GP Comp % YDS TD INT Label
Player A 2020 ASU 12 65.5 3053 25 6 Average
For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.
The way I thought of doing this was to use Scipy's KDTree and run a query tree:
tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []
for row in college.iterrows():
distances, ndx = tree.query(row[features], k=2)
closest.append(ndx)
print(closest)
However, the print statement returned an empty list. Is this the right way to solve my problem?
.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).
As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.
Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.
import pandas as pd
nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
features = ['GP','Comp %','YDS','TD','INT']
from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []
for idx, row in college.iterrows():
X = row[features].values.reshape(1, -1)
distances, ndx = tree.query(X, k=2, return_distance=True)
closest.append(ndx)
collegePlayer = college.loc[idx,'Player']
closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]
print ('%s closest to: %s' %(collegePlayer, closestPlayers))
print(closest)
Output:
Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

How to make a categorical count bar plot with time on x-axis

I want to count the number of occurrences of categories in a variable and plot it against time.
The data looks like following:
Date_column Categorical_variable
20-01-2019 A
20-01-2019 B
20-01-2019 C
21-01-2019 A
21-02-2019 A
22-02-2019 B
........................
23-04-2020 A
I want to show that in month of Jan I had 1 occurrence of B/C whereas 2 occurrences of A. In feb, I had 1 occurrence of A/B and so on. The bar plots can be stacked to know the total number of occurrences.
I've been very close to it. But haven't been able to draw plot out of it.
df['Date_column'].groupby([df.Date_column.dt.year, df.Date_column.dt.month]).agg('count')
The other way is to change the dates to 1st of every month, and then group by to count a occurence. But I'm unable to draw plot out of it.
df.groupby(df['Date_column'], df['Categorical_variable']).count()
Use crosstab with Series.dt.to_period:
df['Date_column'] = pd.to_datetime(df['Date_column'])
df = pd.crosstab(df['Date_column'].dt.to_period('m'), df['Categorical_variable'])
df.plot.bar()

with python3 need to Draw a count plot to show the number of each type of crime discovered each year

i need to Draw a count plot to show the number of each type of crime discovered each year some columns from csv file
i have 2 columns will make process only on it (primary type and date )
so any help to can implement in python
Try This,
df=pd.read_csv('FileName.csv')
df1 = df[['ColumnName1','ColumnName2']]
print(df1)
plt.xlabel('ColumnName1')
plt.ylabel('ColumnName2')
a=plt.bar(df1['ColumnName1'], df1['ColumnName2'])
plt.show()

Plot number of occurrences in Pandas dataframe (2)

this is a followup from the previous question: Plot number of occurrences from Pandas DataFrame
I'm trying to produce a bar chart in descending order from the results of a pandas dataframe that is grouped by "Issuing Office." The data comes from a csv file which has 3 columns: System (string), Issuing Office (string), Error Type (string). The first four commands work fine - read, fix the column headers, strip out the offices I don't need, and reset the index. However I've never displayed a chart before.
CSV looks like:
System Issuing Office Error Type
East N1 Error1
East N1 Error1
East N2 Error1
West N1 Error3
Looking for a simple horizontal bar chart that would show N1 had a count of 3, N2 had a count of 2.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('mydatafile.csv',index_col=None, header=0) #ok
df.columns = [c.replace(' ','_') for c in df.columns] #ok
df = df[df['Issuing_Office'].str.contains("^(?:N|M|V|R)")] #ok
df = df.reset_index(drop=True) #ok
# produce chart that shows how many times an office came up (Decending)
df.groupby([df.index, 'Issuing_Office']).count().plot(kind='bar')
plt.show()
# produce chart that shows how many error types per Issuing Office (Descending).
There are no date fields in this which makes it different than the original question. Any help is greatly appreciated :)
JohnE's solution worked. Used the code:
# produce chart that shows how many times an office came up (Decending)
df['Issuing_Office'].value_counts().plot(kind='barh') #--JohnE
plt.gca().invert_yaxis()
plt.show()
# produce chart that shows how many error types per Issuing Office N1 (Descending).
dfN1 = df[df['Issuing_Office'].str.contains('N1')]
dfN1['Error_Type'].value_counts().plot(kind='barh')
plt.gca().invert_yaxis()
plt.show()

Resources