How do I partition by more than one variable to retrieve only the last order for a customer for each month? - partition

starting data source:
consumer, order #, month 2
consumer, order #, month 2
consumer, order #, month 3
resulting data source I want to produce:
only one row per customer per month
month 2, consumer, max order #
month 3, consumer, max order #

Related

Compare successive rows Alteryx

I have a spreadsheet with order item in column A, order quantity in column B, order start date in column C, and order finish date in column D. What I would like to do is treat orders on consecutive start dates for the same item as one single order. So until there is at least one days break between order start dates for an order item, treat it as one single order. Then I need to count the orders, sum the order quantities and calculate the average gap in days between orders (gap between order finish date and the next order start date). So if an order item was ordered on the 1st, 2nd, 3rd and 4th of March, and then again on the 10th and 11th of March, and then again on the 20th March (with all orders having the same start and finish date), there would be 2 gaps, which the average gap being 7.5 days ((6+9)/2). So the input and output will look like this;
Any help would be much appreciated. Many thanks!
Discussion...
The fields I've defined are OrderItem, OrderQty, OrderStartDate, and OrderEndDate, plugging in values identical to those you provided.
The Select tool just forces OrderQty to Int32
MultiRow Formula, creates new Int32 variable Gap using this expression:
IIF(IsNull([Row-1:OrderStartDate]), 1, DateTimeDiff([OrderStartDate], [Row-1:OrderStartDate],"Days"))
First Summary tool:
Group By OrderItem ...
Group By Gap ...
Sum OrderQty to new output field OrdersPerGap
a. Top avenue Summary tool:
Group By OrderItem ...
Sum OrdersPerGroup to output field name OrderQty ...
Count OrderItem to output field name NumOrders
b. Bottom avenue, simple filter as shown Gap > 1 and then another summary:
Group By OrderItem ...
Avg Gap to new output field AvgGap
Join the two strains back together on OrderItem and exclude Right_OrderItem from the output (uncheck its checkbox).
In Alteryx, this provides the output requested. There may be other ways but this is straight-forward without too much going on any step.

How to use assertions in python 3 using a pandas Data Frame?

If I have two columns in a Pandas Dataframe and I want to perform an assertion to see if they're equal-to or greater than or some other logical boolean test on the two columns.
Right now I'm doing something like this:
# Roll the fields up so we can compare both reports.
# Goal: Show that `Gross Sales per Bar` is equal to `Gross Sales per Category`
#
# Do a GROUP BY of all the service bars and sum their Gross Sales per Bar
# Since the same value should be in this field for every 'Gross Sales per Bar' field,
# grab the first one, so we can compare them below
df_bar_sum = sbbac.groupby(['Bar'], as_index=False)['Gross Sales per Bar'].first()
df_bar_sum2 = sbbac.groupby(['Bar'], as_index=False)['Gross Sales per Category'].sum()
# Rename the 'Gross Sales per Category' column to 'Summed Gross Sales per Category'
df_bar_sum2.rename(columns={'Gross Sales per Category':'Summed Gross Sales per Category'}, inplace=True)
# Add the 'Gross Sales per Bar' column to the df_bar_sum2 Data Frame.
df_bar_sum2['Gross Sales per Bar'] = df_bar_sum['Gross Sales per Bar']
# See if they match...they should since the value of 'Gross Sales per Bar' should be equal to 'Gross Sales per Category' summed.
df_bar_sum2['GrossSalesPerCat_GrossSalesPerBar_eq'] = df_bar_sum2.apply(lambda row: 1 if row['Summed Gross Sales per Category'] == row['Gross Sales per Bar'] else 0, axis=1)
# Print the result
df_bar_sum2
And I just end up with a column that equals 1 if it matches and 0 if it doesn't.
I'd like to use assertions here to test if they match or not, since that'll cause the whole thing to crap out when doing tests if they don't match with some sort of an error displayed; maybe that's not a good way to do it for tabular data, I'm not sure, but if it is a good idea, I'd rather use assertions to compare them.
It may also be harder to read with assertions, which would be bad, I'm not sure...
assert np.allclose(your_df['Summed Gross Sales per Category'],
your_df['Gross Sales per Bar'])

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Count number of sporadic events matched to reoccuring events [excel]

I have two tables of different length:
a table of monthly data (e.g.: value of inventory at start of month)
a table of sporadic events which happened at random point throughout the year (e.g.: truck delivery to storage)
In table one, I would like to count in an extra column, the number of events from table 2 that occurred in that month. The table with the value of the inventory would show a count per row of how many trucks were unloaded.
I've been fighting with countifs but I just cannot get it to work due to different table lengths, the weird way to enter criteria etc.
I've tried to match the month and year of a truck delivery with the period in the inventory table.
=COUNTIFS(
<range: Dates of Truck deliveries from Table2>, "=MONTH(" &
<cellOfInventoryDate> & ")",
<range: Dates of Truck deliveries from Table2>, "=Year(" &
<cellOfInventoryDate> & ")")
I have a feeling there is a simple solution to this an dI just hit a wall.
Thanks
Table
1 - Inventory at start of month
01/01/2015 1000
01/02/2015 1200
01/03/2015 1100
01/04/2015 900
...
Table 2 - Date of Truck Delivery
01/01/2015
04/02/2015
07/02/2015
03/04/2015
11/07/2015
Ok, so here the answer.
I created a helper column in Table 2 which normalises the dates to the first of the month
=EOMONTH(cellwithdate,-1)+1
then I used countif (not countifs) to count when the date of the inventory matches the helper column.
=COUNTIF(helpercolumn, dateOfInventory)
this then counts how many deliveries were made in the month of that inventory's month.

Dax Rolling Average with multiple records per day

Take a simple table
SalesTime
Product
UnitsSold
There is one row per sale. So there are multiple rows per day. I need a chart that will show the average units sold per sale over 7 days and average units sold per day over 7 days.
The examples that I found all used DATESBETWEEN or DATESINPERIOD and those throw an error if the table has multiple records per date.
I will name this table Sales and assume that Sales[SalesTime] is a date type rather than a datetime type. If not, create a new calculated column
Sales[SalesDate] = Sales[SalesTime].[Date]
and work with that instead.
Your rolling average units per sale can be calculated something like this:
AvgUnitsPerSale =
VAR CurrDay = MIN(Sales[SalesTime])
RETURN CALCULATE(
AVERAGE(Sales[UnitsSold]),
DATESBETWEEN(Sales[SalesTime], CurrDay-7, CurrDay))
You can get an average count of sales per day by using COUNT instead of AVERAGE. To get the average units sold per day, multiply the average count of sales and the average units per sale.

Resources