I have many questions on whether to store my data into SQL or Table Storage and the best way to store them for efficiency.
Use Case:
I have around 5 million rows of objects that are currently stored in mysql database. Currently the metadata is stored only in the database. (Lat, Long, ID, Timestamp). The other 150 columns about the object that are not used much were moved into the Table Storage.
In the table storage, should these all be stored in one row with all the 150 columns not used much in one column instead of multiple rows?
For each of these 5 million objects in the database, there are certain information about them (temperature readings, trajectories, etc). The trajectory data used to be stored in SQL (~300 rows / object) but were moved to table storage to be cost effective. Currently they are stored in the table storage in a relational manner where each row looks like (PK: ID, RK: ID-Depth-Date, X, Y, Z).
Currently it takes time time grab many of the trajectories data. Table Storage seems to be pretty slow in our case. I want to improve the performance of the gets. Should the data be stored where each Objects has 1 row for its trajectory and all the XYZ's are stored in 1 column in a JSON format? Instead of 300 rows to get, it only needs to get 1 row.
Is the table storage the best place to store all of this data? If I wanted to get a X,Y,Z at a certain Measured Depth, I would have to get the whole row and parse through the JSON. THis is probably a trade-off.
Is it feasible to have the trajectory data, readings, etc in a sql database where there can be (5,000,000 x 300 rows) for the trajectory data. THere is also some information about the objects where it can be (5,000,000 x 20,000 rows). This is probably too much for a SQL database and would have to be in a Azure CLoud Storage. If so, would the JSON option be the best one? The tradeoff is that if I want a portion which is 1000 rows, I would have to get the whole table, however, isnt that faster than querying through 20,000 rows. I can probably split the data into sets of 1000 rows and use sql as a meta data for finding out which sets of data I need from the Cloud Storage.
Pretty much I'm having trouble understanding how to group data and format it into Azure Cloud Tables to be efficient and fast when grabbing data for my application.
Here's an example of my data and how I am getting it: http://pastebin.com/CAyH4kHu
As an alternative to table storage, you can consider using Azure SQL DB Elastic Scale to spread trajectory data (and associated object metadata) across multiple Azure SQL DBs. This allows you to overcome capacity (and compute) limits of a single database. You would be able to perform object-specific queries or inserts efficiently, and have options to perform queries across multiple databases -- assuming you are working with a .Net application tier. You can find out more by looking at http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-get-started/
Related
I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.
Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.
We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.
I am looking at creating a Cassandra timeseries database for storing millions of series of daily data that can potentially have altogether up to 100B data points.
I looked at this article:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
This design is very sound. So essentially I can put the daily timestamps as columns and if necessary shard the columns by appending the day to the row.
Two questions I have:
I am looking at storing up to 20,000 timestamped (daily) columns. Is it even necessary to shard rows by eg. year with this amount of columns? Is there any advantage/disadvantage to sharding rows to reduce the number of columns down to 365 per year.
Another idea I have is to rather than sharding columns by row is to create column family per each year. This way when accessing the data from multiple years I would have to query multiple column families rather than one column family and join the results on the client side. Would this approach speed things up or rather slow everything down?
If you are ever going to manage huge quantities of writes there is one problem with your approach.
Writing always to 1 key means that all writes for that key will go to one node. Basically you will use one node per day out of your cluster, so you might as well have one huge instance of Cassandra rather than bother setting up a cluster.
If your write frequency gets really high you might bring down the nodes responsible for that day/key.
My advise is to bucket one day in multiple rows that are used simultaneously. Time bucketing could be dangerous as a sudden surge during one bucket could bring everything down.
you could create your bucket (row key) like this :
[ROW_BASE_NAME] + [DAY] + someHashFunction(timestamp) % 10
[ROW_BASE_NAME] + [DAY] + random.nextInt(10)
[ROW_BASE_NAME] + [DAY] + nextbucket <--- that is if you have a secure way to rotate the bucket yourself
There is many ways to do it. You could also use some element of the column being saved to do that.
But I think it should be important to do that in order to leverage the whole cassandra cluster at all times.
My answer is only valid for Write heavy application/functionality since you will have to use a multi_get (multiple keys whole row reads) to read all the data and reconstitute the whole time line for that day.
I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?
Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.
The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.