Reverse a rolling total based on historic data - excel

Say I have a list of rolling x-day page view totals. That is, each data point is the sum of the previous x days of page views, but I do not have each individual day's page view total. Would it be possible to get the individual values?
For example, say someone gathers the following page view metrics:
{4 days before Day 1: {1,2,3,8}, Day 1: 4, Day 2: 2, Day 3: 5, Day 4: 2, Day 5: 9, Day 6: 8, Day 7: 10, Day 8: 10, Day 9: 7, Day 10: 6}
They provide me with the following list of 5-day running totals:
{Day 1: 18 (1+2+3+8+4), Day 2: 19 (2+3+8+4+2), Day 3: 22 (3+8+4+2+5), Day 4: 21 (etc.), Day 5: 22, Day 6: 26, Day 7: 34, Day 8: 39, Day 9: 44, Day 10: 41}
Would it be possible for me to take only the second dataset and determine at least some of the values in the first dataset?

In your example, the history
{1, 2, 3, 8, 4, 2, 5, 2, 9, 8, 10, 10, 7, 6}
gives the following 5-day running totals:
{18, 19, 22, 21, 22, 26, 34, 39, 44, 41}
But so would the history:
{3, 8, 1, 3, 3, 4, 11, 0, 4, 7, 12, 16, 5, 1}
So no, in general you can't reconstruct any of the values.
...Unless you have five days in a row with no views, giving you a zero in the list of running totals. If that happens, you can reconstruct the entire history before and after.

Related

print all possible routes of a conditional binary tree by python3

I want to print a conditioned binary tree.
Take an example of five different lists a b c d e:
1
2, 4
3, 5, 7
6, 8
9
The condition is that the following number must be larger than the previous number, so printing 1, 4, 3, 6, 9 is wrong.
The desired result is:
1, 2, 3, 6, 9
1, 2, 5, 6, 9
1, 4, 5, 8, 9
1, 4, 7, 8, 9
How to get those lists by python3?
Thank you very much.

create new dataframe based upon max value in one column and corresponding value in a second column

I have a dataframe created by extracting data from a source (network wireless controller).
Dataframe is created off of a dictionary I build. This is basically what I am doing (a sample to show structure - not the actual dataframe):
df = pd.DataFrame({'AP-1': [30, 32, 34, 31, 33, 35, 36, 38, 37],
'AP-2': [30, 32, 34, 80, 33, 35, 36, 38, 37],
'AP-3': [30, 32, 81, 31, 33, 101, 36, 38, 37],
'AP-4': [30, 32, 34, 95, 33, 35, 103, 38, 121],
'AP-5': [30, 32, 34, 31, 33, 144, 36, 38, 37],
'AP-6': [30, 32, 34, 31, 33, 35, 36, 110, 37],
'AP-7': [30, 87, 34, 31, 111, 35, 36, 38, 122],
'AP-8': [30, 32, 99, 31, 33, 35, 36, 38, 37],
'AP-9': [30, 32, 34, 31, 33, 99, 88, 38, 37]}, index=['1', '2', '3', '4', '5', '6', '7', '8', '9'])
df1 = df.transpose()
This works fine.
Note about the data. Columns 1,2,3 are 'related'. They go together. Same for columns 4,5,6 and 7,8,9. I will explain more shortly.
Columns 1, 4, 7 are client count. Columns 2, 5, 8 are channel util on the 5 Ghz spectrum. Columns 3, 6, 9 are channel util on the 2.4 Ghz spectrum.
Basically I take a reading at 5 minute intervals. The above would represent three readings at 5 minute intervals.
What I want is two new dataframes, two columns each, constructed as follows:
Examine the 5 Ghz columns (here it is 2, 5, 8). Which ever has the highest value becomes column 1 in the new dataframe. Column 2 would be the value of the client count column related to the 5 Ghz column with the highest value. In other words, if column 2 were the highest out of columns 2, 5, 8, then I want the value in column 1 to be the value in the new dataframe for the second column. If the value in column 8 were highest, then I want to also pull the value in column 7. I want the index to be same in the new dataframes as the original -- AP name.
I want to do this for all rows in the 'main' dataframe. I want two new dataframes -- so I will repeat this exact procedure for the 5 Ghz columns and the 2.4 (columns 3, 6, 9 -- also grabbing the corresponding highest client count value for the second column in the new dataframe.
What I have tried:
First I broke the main dataframe into three: df1 has all the client count columns, df2 has the 5 Ghz, and df3 has the 2.4 info, using this:
# create client count only dataframe
df_cc = df[df.columns[::3]]
print(df_cc)
print()
# create 5Ghz channel utilization only dataframe
df_5Ghz = df[df.columns[1::3]]
print(df_5Ghz)
print()
# create 2.4Ghz channel utilization only dataframe
df_24Ghz = df[df.columns[2::3]]
print(df_24Ghz)
print()
This works.
I thought I could then reference the main dataframe, but I don't know how.
Then I found this:
extract column value based on another column pandas dataframe
The query option looked great, but I don't know the value. I need to first discover the max value of the 2.4 and 5 Ghz columns respectively, then grab the corresponding client count value. That is why I first created dataframes containing the 2.4 and 5 Ghz values only, thinking I could first get the max value of each row, then do a lookup on the main dataframe (or use the client count onlydataframe I created), but I just do not know how to realize this idea.
Any assistance would be greatly appreciated.
You can get what you want in 3 steps:
# connection between columns
mapping = {'2': '1', '5': '4', '8': '7'}
# 1. column with highest value among 5GHz values (pandas series)
df2 = df1.loc[:, ['2', '5', '8']].idxmax(axis=1)
df2.name = 'highest value'
# 2. column with client count corresponding to the highest value (pandas series)
df3 = df2.apply(lambda x: mapping[x])
df3.name = 'client count'
# 3. build result using 2 lists of columns (pandas dataframe)
df4 = pd.DataFrame(
{df.name: [
df1.loc[idx, col]
for idx, col in zip(df.index, df.values)]
for df in [df2, df3]},
index=df1.index)
print(df4)
Output:
highest value client count
AP-1 38 36
AP-2 38 36
AP-3 38 36
AP-4 38 103
AP-5 38 36
AP-6 110 36
AP-7 111 31
AP-8 38 36
AP-9 38 88
I guess while not sure it would be easier to solve the issue (and faster to compute) without pandas using just built-in python data types - dictionaries and lists.

Airflow schedule a task to run on the Monday before the 15th of the month

Is it possible to schedule an airflow DAG to run at a specific time on the Monday directly before the 15th of each month? I think this cron string might do it but I'm not sure that I have understood correctly
0 10 8-14 * MON
So I think that this should run at 10:00 on a Monday only between the 8th and the 14th of each month. As there can only be one Monday between the 8th and the 14th, this should run only once a month and it will be the Monday preceding the 15th of the month.
Is that correct?
The croniter module (which Airflow uses for the execution date/time calculations) supports the hash symbol for the day-of-week field which would allow you to schedule, what I believe will work, the second Monday of each month.
For example, "30 7 * * 1#2" says to run at 7:30AM, every month, on the second Monday. Using this code to test it:
from croniter import croniter
from datetime import datetime
cron = croniter("30 7 * * 1#2")
for i in range(10):
print(cron.get_next(datetime))
yields:
datetime.datetime(2018, 10, 8, 7, 30)
datetime.datetime(2018, 11, 12, 7, 30)
datetime.datetime(2018, 12, 10, 7, 30)
datetime.datetime(2019, 1, 14, 7, 30)
datetime.datetime(2019, 2, 11, 7, 30)
datetime.datetime(2019, 3, 11, 7, 30)
datetime.datetime(2019, 4, 8, 7, 30)
datetime.datetime(2019, 5, 13, 7, 30)
datetime.datetime(2019, 6, 10, 7, 30)
datetime.datetime(2019, 7, 8, 7, 30)

Creating a vector containing the next 10 row-column values for each pandas row

I am trying to create a vector of the previous 10 values from a pandas column and insert it back into the pandas data frame as a list in a cell.
The below code works but I need to do this for a dataframe of over 30 million rows so it will take too long to do it in a loop.
Can someone please help me convert this to a numpy function that I can apply. I would also like to be able to apply this function in a groupby.
import pandas as pd
df = pd.DataFrame(list(range(1,20)),columns = ['A'])
df.insert(0,'Vector','')
df['Vector'] = df['Vector'].astype(object)
for index, row in df.iterrows():
df['Vector'].iloc[index] = list(df['A'].iloc[(index-10):index])
I have tried in multiple ways but have not been able to get it to work. Any help would be appreciated.
IIUC
df['New']=[df.A.tolist()[max(0,x-10):x] for x in range(len(df))]
df
Out[123]:
A New
0 1 []
1 2 [1]
2 3 [1, 2]
3 4 [1, 2, 3]
4 5 [1, 2, 3, 4]
5 6 [1, 2, 3, 4, 5]
6 7 [1, 2, 3, 4, 5, 6]
7 8 [1, 2, 3, 4, 5, 6, 7]
8 9 [1, 2, 3, 4, 5, 6, 7, 8]
9 10 [1, 2, 3, 4, 5, 6, 7, 8, 9]
10 11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
11 12 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
12 13 [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
13 14 [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
14 15 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
15 16 [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
16 17 [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
17 18 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
18 19 [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

How to return the number of values greater than X with multiple criteria

I am seeking for a formula that returns the counts the values greater than 20 after applying two criterias.
I have a table with 3 fields:
Field A: 18, 18, 19, 19, 21, 21, 44, 55, 55, 56, 61, 61, 75, 76, 86
Field B: 1, 4, 1, 5, 1, 6, 3, 1, 2, 1, 1, 3, 1, 1, 1
Field C: 5, 2, 14, 7, 38, 1, 100, 76, 32, 65, 83, 20, 17, 41, 88
I have two criterias:
Criteria1: 18, 55, 61, 75, 86 (this is an array)
Criteria2: 1
Steps:
Step 1 - Apply Criteria_1 to Field_A
Step 2 - Apply Criteria_2 to Field_B
Step 3 - Return number of values greater than 20
Regards,
Elio Fernandes
=SUM(ISNUMBER(MATCH(A1:A15, {18,55,61,75,86}, 0)) * (B1:B15 = 1) * (C1:C15 > 20))
Ctrl+Shift+Enter
This uses the property that TRUE counts as 1 and FALSE counts as 0.

Resources