Database Application - Store or compute on-the-fly? - database-agnostic

I have a table of purchase list with fields: ItemName, Quantity, UnitPrice, Amount. Note that Amount is equal to Quantity * UnitPrice.
My simple problem is, should I STORE the amount or COMPUTE it when retrieving the data? To what should I concern about, storage or processing?

Storage is cheap, I'd just pre-compute it for convenience. It probably won't matter much depending on your application though unless you are searching or reporting on the Amount column. If you are reporting on this column then I'd definitely pre-compute. I'd be more worried about joins and sub-queries.

Related

How better to design MongoDB for multi time serias data

I have a time series data, for example myData, and display it to my ui, it could be Day, Week, Month, Year.
How it's better to store in MongoDB, should I create separate Collections for this like:
myDataDay
myDataWeek
...
or it's better to store it in one Collections with Day, Week, Month, Year keys?
How could it impact the performance?
You will need to answer following questions:
Number and type of paralel queries you send to the database?
Are there other fields that data will be searched on?
Are the 90% of all queries in the range of last year/month/date/hour or other?
If you split the data between many collections logic on app side will become more complex , from the other hand if you keep everything in same collection at some point in time your database will become bigger and more difficult to mantain...
You may take a look to the special collection types dedicated to time series data , but in general it very depends on the amount of data and and distribution you expect ...

Querying split partitions on Cassandra in a single request

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

What's a better counting algorithm for Azure Table Storage log data?

I'm using Windows Azure and venturing into Azure Table Storage for the first time in order to make my application scalable to high density traffic loads.
My goal is simple, log every incoming request against a set of parameters and for reporting count or sum the data from the log. In this I have come up with 2 options and I'd like to know what more experienced people think is the better option.
Option 1: Use Boolean Values and Count the "True" rows
Because each row is written once and never updated, store each count parameter as a bool and in the summation thread, pull the rows in a query and perform a count against each set of true values to get the totals for each parameter.
This would save space if there are a lot of parameters because I imagine Azure Tables store bool as a single bit value.
Option 2: Use Int Values and Sum the rows
Each row is written as above, but instead each parameter column is added as a value of 0 or 1. Summation would occur by querying all of the rows and using a Sum operation for each column. This would be quicker because Summation could happen in a single query, but am I losing something in storing 32 bit integers for a Boolean value?
I think at this point for query speed, Option 2 is best, but I want to ask out loud to get opinions on the storage and retrieval aspect because I don't know Azure Tables that well (and I'm hoping this helps other people down the road).
Table storage doesn't do aggregation server-side, so for both options, you'd end up pulling all the rows (with all their properties) locally and counting/summing. That makes them both equally terrible for performance. :-)
I think you're better off keeping a running total, instead of re-summing everything everytime. We talked about a few patterns for that on Cloud Cover Episode 43: http://channel9.msdn.com/Shows/Cloud+Cover/Cloud-Cover-Episode-43-Scalable-Counters-with-Windows-Azure

Resources