I'm learning Cassandra, and as a practice data set, I'm grabbing historical stock data from Yahoo. There is going to be one record for each trading day.
Obviously, I need to make the stock symbol as a part of the partitioning key. I'm seeing conflicting information on whether I should make the date as part of the partitioning key, or make it a clustering column?
Realistically, the stock market is open ~253 days per year. So a single stock will have ~253 records per year. I'm not building a full scale database, but would like to design it to accommodate / correctly.
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
Yes, correct on both accounts. That modeling approach is called "time bucketing," and its primary use case is for time/event data that grows over time. The good news is, that you wouldn't need to do that, unless your partitions were projected to get big. With your current projection of 253 rows written per partition per year, that's only going to be < 40kb each year (see calculation with nodetool tablehistograms below).
For your purposes I think partitioning by symbol and clustering by day should suffice.
CREATE TABLE stockquotes (
symbol text,
day date,
price decimal,
PRIMARY KEY(symbol, day))
WITH CLUSTERING ORDER BY (day DESC);
With most time-based use cases, we tend to care about recent data more (which may or may not be true with your case). If so, then writing the data in descending order by day will improve the performance of those queries.
Then (after writing some data), date range queries like this will work:
SELECT * FROM stockquotes
WHERE symbol='AAPL'
AND day >= '2020-08-01' AND day < '2020-08-08';
symbol | day | price
--------+------------+--------
AAPL | 2020-08-07 | 444.45
AAPL | 2020-08-06 | 455.61
AAPL | 2020-08-05 | 440.25
AAPL | 2020-08-04 | 438.66
AAPL | 2020-08-03 | 435.75
(5 rows)
To verify the partition sizes can use nodetool tablehistograms (once the data is flushed to disk).
bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 124 5
75% 0.00 0.00 0.00 124 5
95% 0.00 0.00 0.00 124 5
98% 0.00 0.00 0.00 124 5
99% 0.00 0.00 0.00 124 5
Min 0.00 0.00 0.00 104 5
Max 0.00 0.00 0.00 124 5
Partition size each year = 124 bytes x 253 = 31kb
Given the tiny partition size, this model would probably be good for at least 30 years of data before any slow-down (I recommend keeping partitions <= 1mb). Perhaps bucketing on something like quartercentiry might suffice? Regardless, in the short term, it'll be fine.
Edit:
Seems like any date portion used in the PK would spread the data across nodes, no?
Yes, a date portion used in the partition key would spread the data across nodes. That's actually the point of doing it. You don't want to end up with the anti-pattern of unbound row growth, because the partitions will eventually get so large that they'll be unusable. This idea is all about ensuring adequate data distribution.
lets say 1/sec and I need to query across years, etc. How would that bucketing work?
So the trick with time bucketing, is to find a "happy medium" between data distribution and query flexibility. Unfortunately, there will likely be edge cases where queries will hit more than one partition (node). But the idea is to build a model to handle most of them well.
The example question here of 1/sec for a year, is a bit extreme. But the idea to solve it is the same. There are 86400 seconds in a day. Depending on row size, that may even be too much to bucket by day. But for sake of argument, say we can. If we bucket on day, the PK looks like this:
PRIMARY KEY ((symbol,day),timestamp)
And the WHERE clause starts to look like this:
WHERE symbol='AAPL' AND day IN ('2020-08-06','2020-08-07');
On the flip side of that, a few days is fine but querying for an entire year would be cumbersome. Additionally, we wouldn't want to build an IN clause of 253 days. In fact, I don't recommend folks exceed single digits on an IN.
A possible approach here, would be fire 253 asynchronous queries (one for each day) from the application, and then assemble and sort the result set there. Using Spark (to do everything in a RDD) is a good option here, too. In reality, Cassandra isn't a great DB for a reporting API, so there is value in exploring some additional tools.
Related
I'm trying to understand the best practices around storing aggregated time series based data.
For instance if I am building a weather service application that's ingesting lots of weather metrics from sensors around the world and storing that weather data in the form of the weather for today, the week, for the month, what's a good way to model that?
Would the day level, week level, and month level each have their own column family?
Then there's the factor of location. Each location would have it's own weather data, so would partitioning by say some zipcode or geohash for a specific area make sense?
The access patterns would be querying for the daily or weekly or monthly weather in a city.
let's say ever 5 minutes. Would that have an impact on the design?
Yes. So sensor updates every 5 minutes happen at 12x per hour or 288x per day.
The access patterns would be querying for the daily or weekly or monthly weather in a city.
That also makes for 2016x per week and 8640x per month (30 days). The reason this is important, is because Cassandra has hard limits of storing 2GB and 2 billion cells per partition. This means that storing time series data by city only, would eventually hit this limit (although things would likely grind to a halt long before that).
But the general idea is that you want to model your tables around:
How you're going to query your data.
Avoiding unlimited partition growth.
So if we're just talking about temperatures and maybe a few other data points (precipitation, etc), partitioning by month and city should work just fine.
CREATE TABLE weather_sensor_data (
city TEXT,
month INT,
temp FLOAT,
recorded_time TIMESTAMP,
PRIMARY KEY ((city,month),recorded_time))
WITH CLUSTERING ORDER BY (recorded_time DESC);
Now, I could query for weather sensor data since 8AM, like this:
> SELECT * FROM weather_sensor_data
WHERE city='Minneapolis, MN'
AND month=202111
AND recorded_time > '2021-11-01 08:00';
city | month | recorded_time | temp
-----------------+--------+---------------------------------+------
Minneapolis, MN | 202111 | 2021-11-01 08:35:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:30:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:25:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:20:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:15:00.000000+0000 | 2
(5 rows)
This should help you get started.
#dipen, you could also refer to this documentation where it walks developers through various data models by their use case. #AlexOtt has great questions to begin with the data models for your use case and #aaron has a great example demonstration.
Here is an example. You could very much customize it for your weather use case. For a given access pattern requirement like in the below example,
we would go ahead and design a Cassandra table as follows to answer them,
In the book Designing Data-Intensive Applications, there is this sentence:
For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
The confusing part is the saying that 95 of these requests will take less than 1.5 seconds. Isn't that supposed to be that 95 of requests take 1.5 seconds or less, and the remaining 5 takes more than 1.5 seconds? Or, the one percent in the 95th percentile takes exactly 1.5 seconds, 89th percentile and below take less than 1.5, and the 96th and above percentiles take more than 1.5? What is the correct reading of these numbers?
I have done some research on this and found several articles. The interesting part is that some say what I say and some don't.
Some of the links that read the percentile similar to 95 of the requests take 1.5 or less:
average 90th percentile response time and average response time
90% percentile is a statistical measurement, in case of JMeter it means that 90% of the sampler response times were smaller than or equal to this time
https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
so 90 percent of the requests are processed in 3.0 seconds or less
https://www.adfpm.com/adf-performance-monitor-monitoring-with-percentiles
If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower.
Other links that read the percentile similar to 95 of the requests take less than 1.5:
https://www.elastic.co/blog/averages-can-dangerous-use-percentile
In contrast, the 99th percentile says “99% of your values are less than 850ms”, which is a very different picture.
I got the answer from this website and according to them, both of them is true. It just depends on how the percentile rank is calculated:
The word “percentile” is used informally in the above definition. In common use, the percentile usually indicates that a certain percentage falls below that percentile. For example, if you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank. In statistics, it can get a little more complicated as there are actually three definitions of “percentile.” Here are the first two (see below for definition 3), based on an arbitrary “25th percentile”:
Definition 1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the scores. In this example, or n is 25, so we’re looking for the lowest score that is greater than 25%.
Definition 2: The nth percentile is the smallest score that is greater than or equal to a certain percentage of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is the definition used in AP statistics. In this example, the 25th percentile is the score that’s greater or equal to 25% of the scores.
I've been playing with the predict-appointment-noshow notebook tutorial and I'm confused by the output of the PERCENT_TRUE primitive.
My understanding is that after feature generation, a column like locations.PERCENT_TRUE(appointments.sms_received) gives the percent of rows for which sms_received is True, given a single location, which was defined as its own Entity earlier. I'd expect that column to be the same for all rows of a single location, because that's what it was conditioned on, but I'm not finding that to be the case. Any ideas why?
Here's an example from that notebook data to demonstrate:
>>> fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()
count 144.00
mean 0.20
std 0.09
min 0.00
25% 0.20
50% 0.23
75% 0.26
max 0.31
Name: locations.PERCENT_TRUE(appointments.sms_received), dtype: float64
Even though the location is restricted to just 'HORTO', the column ranges from 0.00-0.31. How is this being calculated?
This is a result of using cutoff times when calculating this feature matrix.
In this example, we are making predictions for every appointment at the time the appointment is scheduled. The feature locations.PERCENT_TRUE(appointments.sms_received) therefore is calculated at a specific time given by the cutoff times. It is calculating for each appointment "the percentage of appointments at this location received an an sms prior to the scheduled_time"
That construction is necessary to prevent the leakage of future information into the prediction for that row at that time. If we were calculated PERCENT_TRUE using the whole dataset, we'd necessarily be using information from appointments that hadn't yet happened, which isn’t valid for predictive modeling.
If you instead want to make the predictions after all of the data is known, all you have to do is remove the cutoff_time argument to the ft.dfs call:
fm, features = ft.dfs(entityset=es,
target_entity='appointments',
agg_primitives=['count', 'percent_true'],
trans_primitives=['weekend', 'weekday', 'day', 'month', 'year'],
max_depth=3,
approximate='6h',
# cutoff_time=cutoff_times[20000:],
verbose=True)
Now you can see that the feature is the same when we condition on a specific location
fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()
count 175.00
mean 0.32
std 0.00
min 0.32
25% 0.32
50% 0.32
75% 0.32
max 0.32
You can read more about how Featuretools handles time in the documentation.
I have a basic data set with a ton of slicers that roughly looks like this:
Hours SpreadPerHr Spread
5.00 5.00 25.00
10.00 2.00 20.00
8.00 10.00 80.00
Where Spread is a calculated value where Spread = Hours*SpreadPerHour. The problem is, the totals for these columns follow this formula too, so it looks like this:
Hours SpreadPerHr Spread
5.00 5.00 25.00
10.00 2.00 20.00
8.00 10.00 80.00
Total: 23.00 17.00 391.00
And while the hours total up just fine, SpreadPerHour is dynamic and so Spread is as well. It is incorrect to say Total Spread = Total Hours * Total SpreadPerHour. Totals should be:
Total: 23.00 17.00 125.00
Is there a way I can make excel leave totals for Hours as-is, but sum the column for Spread instead of multiplying totals?
Here is what I think you have in your Power Pivot Model:
You have a calculated measure for Spread, which I have labeled SpreadCalc1. The problem with this is that it does the aggregation before it does the multiplication. You need this operation to be done on a row-by-row basis and then aggregated. So instead of a calculated measure, you need to create a calculated column and then sum that column.
The column I have labeled as SpreadCalc has the formula =[Hours] * [SpreadPerHr].
The calculated measure I called Spread is just Sum([SpreadCalc]). You can see there that the total is 125 as desired instead of 391.
I know this might be a bit redundant now, but I would suggest a slightly different approach.
Adding calculated columns in "small" tables is fine, but it can cause serious performance issues with large databases.
So to solve your problem, I believe the "correct" way is to use SUMX function.
It calculates the expression specifically for each row, which is exactly what you need. And it is smart as far as performance goes (no need to add calculated columns or perform any source-data manipulations).
If you use this formula (correct the name of the table / measures), you should get the desired results:
SUMX(YourTable, [Sum Hour] * [Sum SpreadPerHr])
I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.
Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.
Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.