I'm trying to understand the best practices around storing aggregated time series based data.
For instance if I am building a weather service application that's ingesting lots of weather metrics from sensors around the world and storing that weather data in the form of the weather for today, the week, for the month, what's a good way to model that?
Would the day level, week level, and month level each have their own column family?
Then there's the factor of location. Each location would have it's own weather data, so would partitioning by say some zipcode or geohash for a specific area make sense?
The access patterns would be querying for the daily or weekly or monthly weather in a city.
let's say ever 5 minutes. Would that have an impact on the design?
Yes. So sensor updates every 5 minutes happen at 12x per hour or 288x per day.
The access patterns would be querying for the daily or weekly or monthly weather in a city.
That also makes for 2016x per week and 8640x per month (30 days). The reason this is important, is because Cassandra has hard limits of storing 2GB and 2 billion cells per partition. This means that storing time series data by city only, would eventually hit this limit (although things would likely grind to a halt long before that).
But the general idea is that you want to model your tables around:
How you're going to query your data.
Avoiding unlimited partition growth.
So if we're just talking about temperatures and maybe a few other data points (precipitation, etc), partitioning by month and city should work just fine.
CREATE TABLE weather_sensor_data (
city TEXT,
month INT,
temp FLOAT,
recorded_time TIMESTAMP,
PRIMARY KEY ((city,month),recorded_time))
WITH CLUSTERING ORDER BY (recorded_time DESC);
Now, I could query for weather sensor data since 8AM, like this:
> SELECT * FROM weather_sensor_data
WHERE city='Minneapolis, MN'
AND month=202111
AND recorded_time > '2021-11-01 08:00';
city | month | recorded_time | temp
-----------------+--------+---------------------------------+------
Minneapolis, MN | 202111 | 2021-11-01 08:35:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:30:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:25:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:20:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:15:00.000000+0000 | 2
(5 rows)
This should help you get started.
#dipen, you could also refer to this documentation where it walks developers through various data models by their use case. #AlexOtt has great questions to begin with the data models for your use case and #aaron has a great example demonstration.
Here is an example. You could very much customize it for your weather use case. For a given access pattern requirement like in the below example,
we would go ahead and design a Cassandra table as follows to answer them,
Related
I have a Table of homework submission hours from multiple schools.
The table:
--------------------------------------------------------
|Student | School | homework submission hours (timestamp)|
--------------------------------------------------------
How can I get the percentage of students that took them more than an hour to submit homework for each school?
Since it is homework (possibly) i don't want to give a full concrete example, but will explain some ways to get where you want to go?
countif operator can be used to count if rows meet a condition.
is hours in your table a duration? or a literal timestamp? the question says "more than an hour", but does it mean "more than an hour ago" ? and the grid says "submission hours" but then says it is a timestamp? it probably isn't both.
so you could have
summarize countif(timestamp < ago(1h)), that would count the number of items that occurred more than an hour ago (the timestamp is before an hour ago)
or possibly summarize countif(hours > 1h) if the hours value is a timespan, not a timestamp?
you can use that example to add a by to the summarize by school.
to calculate a percent, you need the full count and the matching count, so you'd probably have something like
| summarize total=count(), subset=countif(somecriteria) by something
| project something, percent = subset / (total * 1.0) * 100
things to think about:
can the same student go to more than one school?
can the same student submit hours more than once?
if so, you might also need to look at dcount instead of count, or different summarizations like avg as well?
that will really depend on what's in the rows
When using the summary function "Group" in Saved Searches (SS), is it possible to show the total count inline or even use it inline? For example, I have a SS that counts the number of cases closed in a certain date range and it groups by the assigned employee and the total is listed at the bottom, as per usual. However, when trying to calculate the percent of the total each employee closed, they all show as 100%.
Here is a picture of the results and I have also added the formulas I am currently using. Here is what it looks like when I'm editing the search. The right most columns were my attempt at getting the total inline.
I'm fairly certain this is because I am grouping by the employees (or else there would be almost 3k lines in the report), but I don't think there is a better way to solve that problem other than by grouping by the employee.
We have tried doing an actual report in NetSuite (as opposed to saved search), however, the report times out quickly and we are hoping for a quicker solution. We also considered a KPI scorecard, but the issue would be that we would need to make a SS for each employee which isn't a good long-term solution due to team changes.
Is there a way of calculating the percent of the total when using grouping? Sorry for the long post, I was trying to be as descriptive as possible. The goal is to see how much (percent wise) each employee contributed to the total cases closed.
Take your SUM column that counts the closed cases.
duplicate it but add the function % of total
So
| Field | Summary Type | Function | Formula |
| ----- | ------------ | -------- | ---------------------------------------------------- |
|Formula Numeric | Sum | | Case When {status} like 'Closed%' then 1 else 0 End |
|Formula Numeric | Sum | % of Total | Case When {status} like 'Closed%' then 1 else 0 End |
I have rankings by country and month, I need to reset the ranking at the beginning of the month 1st day at 00:00.
I have a cron task every 15 minutes checking if with the timezone of the country is day 1 of the month at 0:00 (because some timezones have deviations +x.45 or +x.30 minutes). But, what is the criteria when the country has different timezones?
By example, when is the start time of BlackFriday in one country like USA with different timezones?.
Welcome to the wonderful, horrible world of time! ;-)
In the USA we don't have a unique start time of "Black Friday". We don't even all celebrate New Years at the same moment. Instead we follow our local time zone rules. Not only are there multiple time zone rules covering the single country, even some individual states observe more than one time zone (e.g. Tennessee).
Here is a global map of time zones.
By example, when is the start time of BlackFriday in one country like USA with different timezones?.
Assuming you mean the shopping holiday that follows Thanksgiving Day, which falls this year on Friday, November 23rd 2018:
Time Zone | Local Start Time | UTC Equivalent
----------------------------+------------------------------|------------------------
Guam & Northern Mariana Is. | 2018-11-23T00:00:00+10:00 | 2018-11-22T14:00:00Z
Atlantic (PR, VI) | 2018-11-23T00:00:00-04:00 | 2018-11-23T04:00:00Z
Eastern | 2018-11-23T00:00:00-05:00 | 2018-11-23T05:00:00Z
Central | 2018-11-23T00:00:00-06:00 | 2018-11-23T06:00:00Z
Mountain | 2018-11-23T00:00:00-07:00 | 2018-11-23T07:00:00Z
Pacific | 2018-11-23T00:00:00-08:00 | 2018-11-23T08:00:00Z
Alaska | 2018-11-23T00:00:00-09:00 | 2018-11-23T09:00:00Z
Hawaii | 2018-11-23T00:00:00-10:00 | 2018-11-23T10:00:00Z
American Samoa | 2018-11-23T00:00:00-11:00 | 2018-11-23T11:00:00Z
I have no idea if Black Friday is actually observed in all US territories or not.
As you can see, a single calendar date cannot be mapped to a single range of absolute time, especially when applied to a country with multiple time zones.
BTW, things are even more complicated if you were talking about a date that fell during daylight saving time. For example, much of Arizona doesn't observe DST while the rest of Mountain Time does.
I am fairly new to DAX / PowerPivot but have built a nifty Sales Dashboard which is working well. Once combined with the PowerUpdate utility and Office365 PowerBI we will have a awesome solution. I'm excited!
One final thing that I am struggling to get my head around, is how we handle changing targets/budgets when a Sales Rep is promoted, as we assign budget by Job Level.
Let's say we have Sales Rep A - Junior Sales Rep with a target/budget of 30K per Quarter and they they get promoted in Q2 to Mid-Level with a target/budget of 45K per Quarter.
In my current dashboard, as soon as the Job Title is changed, this changes historical budget to the new target which warps any historic performance.
We have too many Sales Reps to create a quarterly budget table rows entry for each Sales Rep and each Quarter which seems to be how all the examples out there handle it - i.e:
Sales Rep A | 2015 | Q1 | 30000
Sales Rep A | 2015 | Q2 | 45000
Sales Rep A | 2015 | Q3 | 45000
My initial thoughts were to add a new user to the User Table with the same name and a "Budget Valid To" date and calculating target based off those dates.
Sales Rep A | Junior Sales Rep | 30000 | 01/01/2015
Sales Rep A | Mid - Sales Rep | 45000 | 01/04/2015
I just can't get my head around how I would handle this in a DAX calculation and I can't help thinking that there must be a more elegant way of handling this anyway! I'm so close to delivering this it tingles.
Let me know if you need an example or more detail.
Cheers,
Ben
Not sure the granularity of your data (daily | weekly | monthly). But one idea that could work well is to divide the target by the granularity so you're effectively pro rating the target. Then you can use basic SUM functions at each quarter boundary to show accurate targets since the sum of the 'daily' targets would equal your hard coded targets. You'd need to be careful if your periods vary in number of days... dividing into months would make it uniform; but mid month promotions would still be hard. HTH.
I have to implement a set of formulas in Excel whose requirements seem to defy any logical or succinct way to do them with just formulas.
Short Description:
determine (and display) a timeline of a product's "effective" internal cost each day based on 1) the price paid for the product as a series of bulk purchases, and 2) the company's per-day consumption of the product using the price paid for each purchase, until the purchased amount is consumed and then moving on to the next purchase.
Simplified Details:
We have a sheet "PurchSums" with the product's bulk purchase history: (this is actually a pivot table)
| A | B | C | D |
1: Product Name Purch Date Rate Amount
...
23: Prod-Q 12/18/2012 $3.15 123.7
24: 1/24/2013 $3.05 748.2
25: 2/27/2013 $2.86 650.0
... ...
So initially, the internal price should be $3.15/pound until the first 123.7 pounds are used us, then it should switch to $3.05 and so on.
We have a sheet "Volumes" with consumption by day: (product consumption starts on March 1st)
| A | B | C | D |
1: Product Name 3/1/2013 3/2/2013 3/3/2013 ...
...
12: Prod-Q 87.2 77.1 101.5 ...
... ...
And I need to fill in formulas for the daily price sheet "Prices", which has the product's effective internal price for each day (filled-in by hand now):
| A | B | C | D |
1: Product Name 3/1/2013 3/2/2013 3/3/2013 ...
...
32: Prod-Q 3.15 3.15 3.05 ...
... ...
Note here that the price for "Prod-Q" changed to 3.05 on March 3rd. This is because its cumulative consumption reached 265.8 pounds (87.2 + 77.1 + 101.5) which exceeded the 12/18 purchase of 123.7 pounds and so switched from its price of 3.15 to the price of the next purchase from 1/24.
More Details:
The first sheet "PurchSums" is actually a pivot table, if that helps. Unfortunately, I don't think that it does, nor that GETPIVOTDATA(..) is going to be useful here (but I would be happy to be wrong). The information in it comes from another sheet "Purchases" which is just a list of all of the purchases:
| A | B | C | D |
1: Product Name Purch Date Rate Amount
...
11: Prod-Q 12/18/2012 $3.15 123.7
...
34: Prod-Q 1/24/2013 $3.05 748.2
...
67: Prod-Q 2/27/2013 $2.86 650.0
...
Additional:
First, I would like to do this as much as possible with Formulas in the existing cells/sheets. Plus I would like to minimize the number of additional sheets that have to be added to support this, as the workbook is already pretty substantial.
Secondly, yes I know that I could do this with VBA functions, and I will do that if there's no better way, but I would like to avoid that if possible because in my experience, populating the formulas of large sheets with massive VBA function calls can lead to some significant performance issues.
I would be grateful for any ideas on how to effectively approach this.
So what you are describing is a FIFO inventory accounting system. First In First Out. You might want to read up on it here:
http://en.wikipedia.org/wiki/FIFO_and_LIFO_accounting
and here:
http://accountingexplained.com/financial/inventories/fifo-method
Basically under FIFO you are making the assumption(whether right or wrong) that the first items that you are adding to your inventory are the first ones you are going to use up in your production. As opposed to LIFO which makes the assumption that the last items that you add to your inventory are the first ones that you are going to use up. That would only require a small change in my solution but let's focus on FIFO since that is your question.
So for every day that you operate you need to figure out how much of your inventory you used up and how much you paid for that inventory. So if you bought 20 items at 50 dollars and then another 20 items at 100 dollars the next day - and on the third day you used up 30 of the items in production, you used up (20 * $50 ) plus ( 10 * $100 ) = $2000 worth of inventory. 2000 / 30 = your effective cost of the product for the day(66.6666667). This effective cost is purely informational for the day - you probably wouldn't want to use it for any accounting - although you could with some rounding corrections - it would just be an indirect way to go about it.
So my solution uses a really ugly table. Basically for every day that you operate you need to figure out how much of your inventory you are using up and how much you paid for that inventory - and then divide it by the total amount used up in the day. Something like this should work for you:
=(IF(AND($E8>G$4,$E8>G$5),G$3,0)+IF(AND($E8>G$4,$E8<G$5),($E8-G$4),0))-SUM(G$7:G7)
=(IF(AND($E9>G$4,$E9>G$5),G$3,0)+IF(AND($E9>G$4,$E9<G$5),($E9-G$4),0))-SUM(G$7:G8)
=G8*$C8
=SUM(G13:G15)/G3
These formulas are all auto-fillable. And you could add on to this table as much space as you need(and probably improve it quite a bit). A small program in almost any language could accomplish this much easier and more intuitively than excel - but this works fine.
In practice, oftentimes nobody really knows which pieces of inventory are being used up and thus how much you paid for them. But the system persists nonetheless - usually the reason many accounting practices are done are the same as why the ancient peoples of the earth sacrificed animals - because that's how it's always been done - tradition and folklore. And for the most part it serves them quite well - routine sources of protein are healthy. I could think of some better ways to do it but - alas. Good Luck.
Assuming you know how to:
display product names on every row from the pivot table
extract lower- and upper-bound interval points from the Amount values from the pivot table
sum Volumes from all previous days from the correct row for each product
cover edge cases like volumes outside defined amount intervals
use Office 2007+ tables
use named ranges
...
For the first day, you want the maximum price = the first row for a given product name:
=INDEX(PurchSums[Rate], MATCH(this_row_product_name, PurchSums[Product Name], 0))
For every other day, you want to sum Volumes from all previous days and extract the rate from matching interval of Amounts for a given product name:
=SUMIFS(PurchSums[Rate],
PurchSums[Product Name], this_row_product_name,
PurchSums[Amount above], TEXT(sum_of_previous_volumes_for_product, "<=General"),
PurchSums[Amount up to], TEXT(sum_of_previous_volumes_for_product, ">General"))
Note that sum_of_previous_volumes_for_product > PurchSums[Amount above] and all the other conditions should be met in 1 row only, so we sum exactly 1 row.
Example:
For cell E16 the named ranges would be replaced / point to:
this_row_product_name = Prices[[#This Row],[Product Name]] = Prices[#Product Name] = $B16
sum_of_previous_volumes_for_product = =SUM(OFFSET(Volumes[[#Headers],[2013-03-01]:[2013-03-02]], MATCH(this_row_product_name, Volumes[Product Name], 0), 0)) = SUM($C12:D12)
So basically you want to summarize your "PurchSums" sheet and "Volumes" ones into one that combines the date fields??
If this is the case....
In your summary sheet make a row filled with the dates
Make a column for Purchase amount
Make a column for Volume amount
Under the Purchase amount use a match / index formula like this:
=index({Select the column C from "PurchSums"},match( {Select date from third sheet},{Select date column from "PurchSums"},0))
Similarly do the same thing for the Volume column.
Hope this gets you on your way if I am not far off on what you are trying to accomplish.