NetSuite Saved Search Running Total Amount Based On Running Average Cost * Transaction Quantity - netsuite

I am building a saved search in NetSuite -it is currently a Inventory Detail search- Here is what I am trying to do in a simple sense:
Quantity per transaction = pulled from the transaction
Running total quantity = based on quantity per transaction (add/subtract)
Rate per transaction = pulled from the transaction
Running average cost = calculated by running total amount / running total quantity
Amount per transaction = IF transaction amount = 0 THEN quantity per transaction * running average cost ELSE pull transaction amount
Running total amount = calculated by quantity per transaction * (rate per transaction OR running average cost if rate is null)
I am able to get all of the above fields except for the running total amount because I am utilizing a trick I found online: 'SUM/* comment */(x)...'
Here is what I'm trying and failing at:
SUM/* comment */(CASE WHEN {transaction.amount} = 0 THEN {itemcount} * NVL(ROUND((SUM/* comment */({transaction.amount}) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))/(SUM/* comment */({itemcount}) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),5),0) ELSE {transaction.amount} END) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
I believe I know the reason its failing; because you can't nest a sum in a sum? But I am not sure of another way, I had the idea of declaring a variable to store the average cost and amount in then call them, but I don't think you can declare in a NetSuite saved search.

Related

How to track number of distinct values incrementally from a spark table?

Suppose we have a very large table that we'd like to process statistics for incrementally.
Date
Amount
Customer
2022-12-20
30
Mary
2022-12-21
12
Mary
2022-12-20
12
Bob
2022-12-21
15
Bob
2022-12-22
15
Alice
We'd like to be able to calculate incrementally how much we made per distinct customer for a date range. So from 12-20 to 12-22 (inclusive), we'd have 3 distinct customers, but 12-20 to 12-21 there are 2 distinct customers.
If we want to run this pipeline once a day and there are many customers, how can we keep a rolling count of distinct customers for an arbitrary date range? Is there a way to do this without storing a huge list of customer names for each day?
We'd like to support a frontend that has a date range filter and can quickly calculate results for that date range. For example:
Start Date
End Date
Average Income Per Customer
2022-12-20
2022-12-21
(30+12+12+15)/2 = 34.5
2022-12-20
2022-12-22
(30+12+12+15+15)/3 = 28
The only approach I can think of is to store a set of customer names for each day, and when viewing the results calculate the size of the joined set of sets to calculate distinct customers. This seems inefficient. In this case we'd store the following table, with the customer column being extremely large.
Date
Total Income
Customers
2022-12-20
42
set(Mary, Bob)
2022-12-21
27
set(Mary, Bob)
2022-12-22
15
set(Alice)
For me the best solution is to do some pre calculations for the existing data, then for the new data that come everyday, do the caclulation only on new data, and add the results to the previous calclulated data, also do partitioning on date column as we filter on dates, this will trigger spark push down filters and accelerate your queries.
There's 2 approach: one to get the sum amount between 2 dates, and other for the distinct customers between 2 dates:
For amout use prefix sum by adding the sum of all previous days to the last day, then to get the difference between the 2 dates you can just substract these 2 days only without looping all dates between.
For distinct customers, the best approach I can think of is to save the date and customer columns in a new file, and partition by dates, that would help to optimize the queries, then use the fast approx_count_distinct.
Here's some code:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022-12-20", 30, "Mary"],
["2022-12-21", 12, "Mary"],
["2022-12-20", 12, "Bob"],
["2022-12-21", 15, "Bob"],
["2022-12-22", 15, "Alice"],
]
df = spark.createDataFrame(data).toDF("Date", "Amount", "Customer")
def init_amout_data(df):
w = Window.orderBy(col("Date"))
amount_sum_df = df.groupby("Date").agg(sum("Amount").alias("Amount")) \
.withColumn("amout_sum", sum(col("Amount")).over(w)) \
.withColumn("prev_amout_sum", lag("amout_sum", 1, 0).over(w)).select("Date", "amout_sum", "prev_amout_sum")
amount_sum_df.write.mode("overwrite").partitionBy("Date").parquet("./path/amount_data_df")
amount_sum_df.show(truncate=False)
# keep only customer data to avoid unecessary data when querying, partitioning by Date will make query faster due to spark filter push down mechanism
def init_customers_data(df):
df.select("Date", "Customer").write.mode("overwrite").partitionBy("Date").parquet("./path/customers_data_df")
# each day update the amount data dataframe (example at midnight), with only yesterday data: by talking the last amout_sum and adding to it the amount of the last day
def update_amount_data(last_partition):
amountDataDf = spark.read.parquet("./path/amount_data_df")
maxDate = getMaxDate("./path/amount_data_df") # implement a hadoop method to get the last partition date
lastMaxPartition = amountDataDf.filter(col("date") == maxDate)
lastPartitionAmountSum = lastMaxPartition.select("amout_sum").first.getLong(0)
yesterday_amount_sum = last_partition.groupby("Date").agg(sum("Amount").alias("amount_sum"))
newPartition = yesterday_amount_sum.withColumn("amount_sum", col("amount_sum") + lastPartitionAmountSum) \
.withColumn("prev_amout_sum", lit(lastPartitionAmountSum))
newPartition.write.mode("append").partitionBy("Date").parquet("./path/amount_data_df")
def update_cusomers_data(last_partition):
last_partition.write.mode("append").partitionBy("Date").parquet("./path/customers_data_df")
def query_amount_date(beginDate, endDate):
amountDataDf = spark.read.parquet("./path/amount_data_df")
endDateAmount = amountDataDf.filter(col("Date") == endDate).select("amout_sum").first.getLong(0)
beginDateDf = amountDataDf.filter(col("Date") == beginDate).select("prev_amout_sum").first.getLong(0)
diff_amount = endDateAmount - beginDateDf
return diff_amount
def query_customers_date(beginDate, endDate):
customersDataDf = spark.read.parquet("./path/customers_data_df")
distinct_customers_nb = customersDataDf.filter(col("date").between(lit(beginDate), lit(endDate))) \
.agg(approx_count_distinct(df.Customer).alias('distinct_customers')).first.getLong(0)
return distinct_customers_nb
# This is should be executed the first time only
init_amout_data(df)
init_customers_data(df)
# This is should be executed everyday at midnight with data of the last day only
last_day_partition = df.filter(col("date") == yesterday_date)
update_amount_data(last_day_partition)
update_cusomers_data(last_day_partition)
# Optimized queries that should be executed with
beginDate = "2022-12-20"
endDate = "2022-12-22"
answer = query_amount_date(beginDate, endDate) / query_customers_date(beginDate, endDate)
print(answer)
If calculating the distinct customer is not fast enough, there's another approach using the same pre sum calculation of all distinct customers and another table for distinct customer, each day if there's a new customer increment the first table and add that customer to the second table, if not don't do anything.
Finally there are some tricks for optimizing the goupBy or window functions using salting oo extended partitioning.
You can achieve this by filtering rows with dates between start_date and end_date then grouping by customer_id and calculating the sum of amounts and then getting avg of these amounts. this approach works for only one start_date and end_date and you should run this code with different parameters to solve with different date ranges
start_date = '2022-12-20'
end_date = '2022-12-21'
(
df
.withColumn('isInRange', F.col('date').between(start_date, end_date))
.filter(F.col('isInRange'))
.groupby('customer')
.agg(F.sum('amount').alias('sum'))
.agg(F.avg('sum').alias('avg income'))
).show()

How do I group datetimes with a sqlite windowing function?

Let's say I have a table with the following fields:
customerid, transactiontime, transactiontype
I want to group a customer's transactions by time, and select the customerid and the count of those transactions. But rather than simply grouping all transaction times into certain increments (15 min, 30 min, etc.), for which I've seen various solutions here, I'd like to group a set a customer's transactions based on how soon each transaction occurs after the previous.
In other words, if any transaction occurs more than 15 minutes after a previous transaction, I'd like it to be grouped separately.
I expect the customer to generate a few transactions close together, and potentially generate a few more later in the day. So if those two sets of transactions occur more than 15, 30 minutes apart, they'll be grouped into separate windows. Is this possible?
Yes, you can do this using a window function in SQLite. This syntax is a bit new to me, but this is how it would start:
select customer_id,
event_start_minute,
sum(subgroup_start) over (order by customer_id, event_start_minute) as subgroup
from (
select customer_id,
event_start_minute,
case
when event_start_minute - lag(event_start_minute) over win > 15
then 1
else 0
end as subgroup_start
from t1
window win as (
partition by b
order by c
)
) as groups
order by customer_id, event_start_minute

Azure Stream Analytics : remove duplicates while aggregating

I'm working on a system of temperature and pressure sensors, where my data is flowing through a Stream analytics job. Now there maybe duplicate messages sent in because of acknowledgements not being received and various other reasons. So my data could be of the format:-
DeviceID TimeStamp MeasurementName Value
1 1 temperature 50
1 1 temperature 50
1 2 temperature 60
Note that the 2nd record is a duplicate of the 1st one as DeviceId and Timestamp and MeasurementName are same.
I wish to take an average over 5 min tumbling window for this data in the stream analytics job. So I have this query
SELECT
AVG(Value)
FROM
SensorData
GROUP BY
DeviceId,
MeasurementName,
TumblingWindow(minute, 5)
This query is expected to give me average measurement of temperature and pressure values for each device in 5 min.
In doing this average I need to eliminate duplicates. The actual average is (50+60)/2 = 55.
But the average given my this query will be (50+50+60)/3 = 53.33
How do I tweak this query for the right output?
Thanks in advance.
According to the Query Language Elements in ASA,it seems that distinct is not supported by ASA directly. However, you could find it could be used with COUNT from here.
So,may be you could refer to my below sql to get avg of Value without duplicate data.
with temp as
(
select count(distinct DeviceID) AS device,
count(distinct TimeStamp) AS time,
count(distinct MeasurementName) AS name,
Value as v
from jsoninput
group by Value,TumblingWindow(minute, 5)
)
select avg(v) from temp
group by TumblingWindow(minute, 5)
Output with your sample data:

DAX Measure: IF sum of all users = Max THEN return value of individual user

He Everyone,
I newer to DAX measures and I am trying to get my measure to have the following logic:
When sum of all users = Max, THEN return value of individual user
The Data Model has the following columns: CustomerID \ Usage \ Interval (DATETIME). What is tripping me up is that DATETIME is in 15-minute increments. I have approximately 700 unique CustomerIDs and I need to be able to return the usage of each CustomerID during the MAXSUM of all the CustomerIDs.
I am not sure if this would be an IF-THEN statement or if I need to use a time function. I am writing this DAX measure in Power Pivot to send to a Pivot Table within Excel.
Thanks in advance,
This may not be the simplest solution, but it works for me (I've named the table Load).
CustomerUsage =
VAR Summarized = CALCULATETABLE(
SUMMARIZE(Load, Load[Interval], "TotalUsage", SUM(Load[Usage])),
ALLSELECTED(Load))
VAR MaxUsage = MAXX(Summarized, [TotalUsage])
VAR MaxInterval = MAXX(FILTER(Summarized, [TotalUsage] = MaxUsage), [Interval])
RETURN CALCULATE(SUM(Load[Usage]), Load[Interval] = MaxInterval)
First, you generate a table that finds the total usage per interval by summing over all selected users.
The MaxUsage is simply the largest TotalUsage over all of the intervals.
The MaxInterval is the interval for which the TotalUsage = MaxUsage.
Then we can find the Usage for a customer for the interval with maximum usage.

Level of Detail on the primary data sources in tableau

I have one excel that contains the demand for each part by city:
e.g: the demand for part a for New york is 100 and 1+7=8 for Atlanta
I have another excel containing the inventory level for two warehouses: rural and urban:
e.g: Warehouse "Rural" stocks 50 for part a and warehouse "Urban" stocks zero for part c.
First I joined these two excels with the demand excel being the primary:
I googled about LOD (level of detail) in order to find out the inventory fulfillment for each warehouse by City
-- count the number of unique parts by each city for the demand:
calculated field [a] = { fixed [City]: countd([Part Number demand]) }
-- count the number of parts that are in stock (inventory level>0) by each warehouse:
calculated field [b] = { fixed [City],[Warehouse Location],[Part Number volume]: countd (if [Inventory Level] > 0 then [Part Number demand] end )}
-- calculate the inventory fulfillment %:
calculated field [c] = calculated field [a] / calculated [b]
and I got the following table and I think it is showing the correct fulfillment % by warehouse for each city: e.g: Warehouse "Rural" stocks 33% of unique parts needed by Atlanta.
Question 1: as I include more part numbers into the excel, I only want to consider the top 10 parts by volume needed for each city. I was trying to do the same thing with LOD to first find the total quantity needed per part per city:
{fixed [City], [Part Number demand]: sum([Part Number volume]) }
But it counts the quantity from both excels and I am just wondering if it is possible to only count the quantity from the primary excel (demand not the inventory),
Question 2: once I could count the total quantity needed, how do I transfer it into a filter so that I could only select top 10 parts by demand.
Apologize if these questions are dumb and appreciate for any advice!!

Resources