Histogram bins size to equal 1 day - pyplot - python-3.x

I have this list of delivery times in days for cars that are 0 years old. The list contains nearly 20,000 delivery days with many days being repeated. My question is how do i get the histogram to show bin sizes as 1 day. I have set the bin size to the amount of unique delivery days there by:
len(set(list))
but when i generate the histogram, the frequency of 0 delivery days is over 5000, however when i do list.count(0) it returns with 4500.

As you pointed out, len(set(list)) is the number of unique values for the "delivery days" variable. This is not the same thing as the bin size; it's the number of distinct bins. I would use "bin size" to describe the number of items in one bin; "bin count" would be a better name for the number of bins.
If you want to generate a histogram, supposing the original list of days is called days_list, a quick high-level approach is:
Make a new set unique_days = set(days_list)
Iterate over each value day in unique_days
For the current day, set the height of the bar (or size of the bin) in the
histogram to be equal to days_list.count(day). This will tell you the number
of times the current "day" value for number of delivery days appeared in the
days_list list of delivery times.
Does this make sense?
If the problem is not that you're manually calculating the histogram wrong but that pyplot is doing something wrong, it would help if you included some code for how you are using pyplot.

The number of bins would be determined by the number of days up to the maximum number of possible days.
Say daylist is the list you want to histogram (never call a list list, because that overwrites the python command with the same name), you would use the maximum of that list and create a range of bins like
maxi = max(daylist)
bins = range(0, maxi)
plt.hist(daylist, bins=bins)
or, if you want to use numpy,
bins = np.arange(0,np.max(daylist))
plt.hist(daylist, bins=bins)

Related

Pandas - Sum of variant number of products

I have two dataframes:
First one contains the information about number of daily arrivals of people in reference to some particular day of interest e.g. '-2' column header refers to number of people arriving on a day that is 2 days prior to my day of interest etc. These arrivals are split between two groups ('Rest of US' and 'International') in terms of a multiindex. Each row of a dataframe showcases different expected number of arrivals based on the number of tickets owned. To sum up, a single value of that dataframe showcases the number of people arriving on a certain day, given particular group of interest and the number of ticket holdings we consider.
import pandas as pd
import numpy as np
arrivals_US=np.array(
[[0.00000000e+00, 1.32853314e+03, 7.92624282e+03, 2.36446211e+04,
4.70225402e+04, 7.01360136e+04, 8.36885523e+04, 8.32165654e+04,
7.09262060e+04, 5.28946471e+04, 3.50642212e+04, 2.09198795e+04,
1.13464885e+04, 5.64124825e+03, 2.58896895e+03, 1.10330043e+03,
4.38831214e+02, 1.63633610e+02, 5.74273229e+01, 1.90344814e+01,
5.97698863e+00, 1.78298388e+00],
[0.00000000e+00, 5.95406456e+03, 2.65918120e+04, 5.93816593e+04,
8.84026873e+04, 9.87051626e+04, 8.81666331e+04, 6.56277324e+04,
4.18720044e+04, 2.33758901e+04, 1.16000548e+04, 5.18077143e+03,
2.10346912e+03, 7.82869343e+02, 2.68955436e+02, 8.57998806e+01,
2.55464064e+01, 7.13089804e+00, 1.87339647e+00, 4.64827256e-01,
1.09262813e-01, 2.43992667e-02]])
arrivals_Int = np.array(
[[1.80595142e+02, 1.01558052e+03, 2.85556902e+03, 5.35278380e+03,
7.52537259e+03, 8.46381767e+03, 7.93274228e+03, 6.37284860e+03,
4.47973074e+03, 2.79909538e+03, 1.57407707e+03, 8.04714088e+02,
3.77110526e+02, 1.63129911e+02, 6.55260304e+01, 2.45657991e+01,
8.63414246e+00, 2.85613404e+00, 8.92307156e-01, 2.64100407e-01,
7.42587051e-02, 1.98854935e-02],
[3.33606865e+03, 1.20883137e+04, 2.19011273e+04, 2.64530626e+04,
2.39633048e+04, 1.73663061e+04, 1.04878615e+04, 5.42899758e+03,
2.45901062e+03, 9.90030648e+02, 3.58739652e+02, 1.18172777e+02,
3.56834584e+01, 9.94613456e+00, 2.57428744e+00, 6.21865585e-01,
1.40833925e-01, 3.00185083e-02, 6.04292696e-03, 1.15245636e-03,
2.08797472e-04, 3.60277122e-05]])
arrivals = [arrivals_US, arrivals_Int]
days=list(range(-5,17))
tickets=[1,2]
arrivals_df = pd.concat([pd.DataFrame(dict(zip(days,arrival_type.T)), index=tickets) for arrival_type in arrivals], axis=1, keys=["Rest of US", "International"])
Second one has similar structure as the first one but instead of the number of arrivals, it represents a distribution of number of days stayed given the group type and number of tickets owned e.g. on average 24% of people who are being part of group 'Rest of US' and having a single ticket will stay for 3 days etc.
stays_dist_US=np.array(
[[4.43244820e-03, 5.39982734e-02, 2.42003472e-01, 3.98996272e-01,
2.42003472e-01, 5.39982734e-02, 4.43244820e-03, 1.33848338e-04,
1.48692072e-06, 6.07670514e-09, 9.13595666e-12, 5.05295484e-15,
1.02811648e-18, 7.69563998e-23, 2.11910601e-27, 2.14667422e-32,
7.99991029e-38, 1.09675497e-43, 5.53145805e-50, 1.02630195e-56,
7.00513005e-64, 1.75898756e-71],
[1.33830425e-04, 4.43185500e-03, 5.39910468e-02, 2.41971084e-01,
3.98942874e-01, 2.41971084e-01, 5.39910468e-02, 4.43185500e-03,
1.33830425e-04, 1.48672173e-06, 6.07589189e-09, 9.13473400e-12,
5.05227860e-15, 1.02797889e-18, 7.69461007e-23, 2.11882241e-27,
2.14638693e-32, 7.99883965e-38, 1.09660819e-43, 5.53071778e-50,
1.02616460e-56, 7.00419255e-64]])
stays_dist_Int = np.array(
[[5.05227106e-15, 9.13472036e-12, 6.07588282e-09, 1.48671951e-06,
1.33830225e-04, 4.43184839e-03, 5.39909662e-02, 2.41970723e-01,
3.98942278e-01, 2.41970723e-01, 5.39909662e-02, 4.43184839e-03,
1.33830225e-04, 1.48671951e-06, 6.07588282e-09, 9.13472036e-12,
5.05227106e-15, 1.02797735e-18, 7.69459859e-23, 2.11881924e-27,
2.14638372e-32, 7.99882771e-38],
[1.02797735e-18, 5.05227106e-15, 9.13472036e-12, 6.07588282e-09,
1.48671951e-06, 1.33830225e-04, 4.43184839e-03, 5.39909662e-02,
2.41970723e-01, 3.98942278e-01, 2.41970723e-01, 5.39909662e-02,
4.43184839e-03, 1.33830225e-04, 1.48671951e-06, 6.07588282e-09,
9.13472036e-12, 5.05227106e-15, 1.02797735e-18, 7.69459859e-23,
2.11881924e-27, 2.14638372e-32]])
stays_dist = [stays_dist_US, stays_dist_Int]
len_of_stay = list(range(1,23))
tickets=[1,2]
stays_df = pd.concat([pd.DataFrame(dict(zip(len_of_stay,stays_dist_type.T)), index=tickets) for stays_dist_type in stays_dist], axis=1, keys=["Rest of US", "International"])
I would like to create a 3rd dataframe which would depict number of departures on each day (having exactly the same structure of columns and index as the 1st one), by doing some sort of 'variant sum of products' (not sure how to name it) for example:
number of 'Rest of US' people leaving at day '-4', given that they have k number of tickets should be represented as a product of number of these type of people arriving at day '-5' (value from the 1st dataframe) and percentage of these type of people leaving after single day (value from my length of stay distribution - 2nd dataframe),
number of 'Rest of US' people leaving at day '-3', given that they have k number of tickets should be represented as sum of two products:
Number of people arriving at day '-5' and percentage of people staying for 2 days,
Number of people arriving at day '-4' and percentage of people staying for 1 day,
number of 'Rest of US' people leaving at day '-2', given that they have k number of tickets should be represented as sum of three products:
Number of people arriving at day '-5' and percentage of people staying for 3 days,
Number of people arriving at day '-4' and percentage of people staying for 2 days,
Number of people arriving at day '-3' and percentage of people staying for 1 day.
and so it goes all the way till all days are considered for given group and then the same is repeated for the next one.
So far I've done it in Excel by manually expanding the formula with each new day I was considering:
example of the calculation in excel for day '-2'
I tried to replicate it in Python with some for loops but it's messy and probably not efficient.
Is there any clever way one can do this with some pandas/numpy methods?
Thank you for any help and suggestions!

How to generate a random number in Google Sheets / Excel through a discrete list of percentage of influence in random outcome?

Let's say I'm randomly picking up a number 1, 2, 3, and I take notes of how many times they were picked out of 10 times I did this. After this experiment, and taking the notes of the percentage of the times these numbers were picked in this 10 randomly generated picks, I want to randomly pick a number but this time having the weight of the percentage of times that I just took note from the original procedure.
For instance, if 3 was picked 20% of times, then the random generator tool will have it 20% of the times in consideration instead of going equally ~33% for each number 1,2 and 3.
The thing I'm missing is if there is any way to (either in Excel or Google Sheets) give this "weight" of the percentages a random picker.
to generate 10 numbers from fixed set (1, 2, 3) you can use:
=INDEX(ROUND(RANDARRAY(10)*(3-1))+1)
if this gives you distribution like:
1
2
1
2
1
2
3
2
3
1
where number 3 is picked up 20% of times you can find out the distribution like:
=INDEX(QUERY({A2:A11, COUNTIFS(A2:A11, A2:A11)},
"select Col1,count(Col2)/10 group by Col1 label count(Col2)/10''"))
now to assign a weight we can reuse it like:
=INDEX(ROUND(RANDARRAY(10)*(MAX(A2:A11)-A2:A11))+MIN(A2:A11))
where you can notice that the % distribution of number 3 is always significantly lower or none:
for more precision and to avoid ghost values you can use:
=INDEX(SORTN(SORT(FLATTEN(SPLIT(QUERY(REPT(SORT(UNIQUE(A2:A11))&"×",
QUERY({A2:A11, COUNTIFS(A2:A11, A2:A11)},
"select count(Col2)*10 group by Col1 label count(Col2)*10''")),,9^9),
"×"))), 10, 1, RANDARRAY(100), 1))
if you wish to freeze the random generation follow the white fox into the forest of ice

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Find a growth rate that creates values adding to a determined total

I am trying to create a forecast tool that shows a smooth growth rate over a determined number of steps while adding up to a determined value. We have variables tied to certain sales values and want to illustrate different growth patterns. I am looking for a formula that would help us to determine the values of each individual step.
as an example: say we wanted to illustrate 100 units sold, starting with sales of 19 units, over 4 months with an even growth rate we would need to have individual month sales of 19, 23, 27 and 31. We can find these values with a lot of trial and error, but I am hoping that there is a formula that I could use to automatically calculate the values.
We will have a starting value (current or last month sales), a total amount of sales that we want to illustrate, and a period of time that we want to evaluate -- so all I am missing is a way to determine the change needed between individual values.
This basically is a problem in sequences and series. If the starting sales number is a, the difference in sales numbers between consecutive months is d, and the number of months is n, then the total sales is
S = n/2 * [2*a + (n-1) * d]
In your example, a=19, n=4, and S=100, with d unknown. That equation is easy to solve for d, and we get
d = 2 * (S - a * n) / (n * (n - 1))
There are other ways to write that, of course. If you substitute your example values into that expression, you get d=4, so the sales values increase by 4 each month.
For excel you can use this formula:
=IF(D1<>"",(D1-1)*($B$1-$B$2*$B$3)/SUMPRODUCT(ROW($A$1:INDEX(A:A,$B$3-1)))+$B$2,"")
I would recommend using Excel.
This is simply a Y=mX+b equation.
Assuming you want a steady growth rate over a time with x periods you can use this formula to determine the slope of your line (growth rate - designated as 'm'). As long as you have your two data points (starting sales value & ending sales value) you can find 'm' using
m = (y2-y1) / (x2-x1)
That will calculate the slope. Y2 represents your final sales goal. Y1 represents your current sales level. X2 is your number of periods in the period of performance (so how many months are you giving to achieve the goal). X1 = 0 since it represents today which is time period 0.
Once you solve for 'm' this will plug into the formula y=mX+b. Your 'b' in this scenario will always be equal to your current sales level (this represents the y intercept).
Then all you have to do to calculate the new 'Y' which represents the sales level at any period by plugging in any X value you choose. So if you are in the first month, then x=1. If you are in the second month X=2. The 'm' & 'b' stay the same.
See the Excel template below which serves as a rudimentary model. The yellow boxes can be filled in by the user and the white boxes should be left as formulas.

Resources