Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What load can azure table storage handle at most (one account)? For example can it handle 2000 reads/sec, where response must come in less than a second (requests would be made from many different machines and the payload of one entity is something like 500Kb on average)? What are the practices to accommodate for such load (how many tables, partitions, giving that there is only one type of entity and in principle there could be any number of table/partitions. Also the Rowkeys are uniformly distributed 32 character hash strings and PartitionKeys are also uniformly distributed).
Check the Azure Storage Scalability and Performance Targets documentation page. That should answer part of your question.
http://msdn.microsoft.com/en-us/library/azure/dn249410.aspx
I would suggest reading the best practices here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
The following are the scalability targets for a single storage account:
•Transactions – Up to 5,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
◦Up to 500 entities per second
◦Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).
As long as you correctly partition your data so you don't have a bunch of data all going to one machine, one table should be fine. Also keep in mind how you will query the data, if you don't use the index (PartitionKey|RowKey) it will have to do a full table scan which is very expensive with a large dataset.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 days ago.
Improve this question
When I worked at Google, I used a Bigtable a lot, and ever since I left I've been missing it. One of the nice things about Bigtable is that its data model had time-based versioning built in. In other words, your key structure was:
(row:string, column:string, time:int64) -> string
This meant that when you issued a query against bt, you could get the most recent value, and you usually would. But you could also transparently wind the clock back to a specific timestamp. This is especially useful for tasks which involve repeatedly processing the same key:column combo.
Does any technology out there support the same timestamp-based data model? I see plenty of similar technologies that support extremely scalable key:column, value lookups, but I can't seem to find any support for this sort of time-based rewind feature.
I believe DynamoDB can satisfy your need here. It supports two different kinds of keys. The second one below is the one you'd want to use.
Partition key – A simple primary key, consists of a single attribute.
Partition key and sort key – Referred to as a composite primary key, which is composed of two attributes. The first attribute is the partition key (PK), and the second attribute is the sort key (SK).
The PK value is used as input to an internal hash function which determines the partition (physical storage internal to DynamoDB) where a record will be stored. All records with the same PK value are stored together, in sorted order by SK value.
Therefore, for your scenario you would set your SK as number attribute and populate it with timestamps. This will allow you to write efficient queries using the mathematical operators (<, >) and the between operator.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How do one implement live like and dislike [or say views count] in couchdb/couchbase in the most efficient way.
Yeah one can use reduce to calculate count each time and on front end only use increment and decrement to one API call to get results.
But for every post there will be say millions of views, like and dislikes.
If we will have millions of such post [in a social networking site], the index will be simply too big.
In terms of Cloudant, the described use case requires a bit of care:
Fast writes
Ever-growing data set
Potentially global queries with aggregations
The key here is to use an immutable data model--don't update any existing documents, only create new ones. This means that you won't have to suffer update conflicts as the load increases.
So a post is its own document in one database, and the likes stored separately. For likes, you have a few options. The classic CouchDB solution would be to have a separate database with "likes" documents containing the post id of the post they refer to, with a view emitting the post id, aggregated by the built-in _count. This would be a pretty efficient solution in this case, but yes, indexes do occupy space on Couch-like databases (just like as with any other database).
Second option would be to exploit the _id field, as this is an index you get for free. If you prefix the like-documents' ids with the liked post's id, you can do an _all_docs query with a start and end key to get all the likes for that post.
Third - recent CouchDBs and Cloudant has the concept of partitioned databases, which very loosely speaking can be viewed as a formalised version of option two above, where you nominate a partition key which is used to ensure a degree of storage locality behind the scenes -- all documents within the same partition are stored in the same shard. This means that it's faster to retrieve -- and on Cloudant, also cheaper. In your case you'd create a partitioned "likes" database with the partition key being the post-id. Glynn Bird wrote up a great intro to partitioned DBs here.
Your remaining issue is that of ever-growth. At Cloudant, we'd expect to get to know you well once your data volume goes beyond single digit TBs. If you'd expect to reach this kind of volume, it's worth tackling that up-front. Any of the likes schemes above could most likely be time-boxed and aggregated once a quarter/month/week or whatever suits your model.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Any one have some idea which is the best way to implement continuous Replication of some DB tables from Azure SQL DB to Azure SQL DB(PaaS) in incremental way.
I have tried Data Sync preview (schema is not loading even after couple of hours),
Data Factory (Copy Data) - Fast but it is always copying entire data(duplicate records) - not an incremental way.
Please suggest.
What is the business requirement behind this request?
1 - Do you have some reference data in database 1 and want to replicate that data to database 2?
If so, then use cross database querying if you are in the same logical server. See my article on this for details.
2 - Can you have a duplicate copy of the database in a different region? If so, use active geo-replication to keep the database in sync. See my article on this for details.
3 - If you just need a couple tables replicated and the data volume is low, then just write a simple PowerShell program (workflow) to trickle load the target from the source.
Schedule the program in Azure Automation on a timing of your choice. I would use a flag to indicate which records have been replicated.
Place the insert into the target and update of the source flag in a transaction to guarantee consistency. This pattern is a row by agonizing row pattern.
You can even batch the records. Look into using the SQLBulkCopy in the system.data.sqlclient library of .Net.
4 - Last but not least, Azure SQL database now supports the OPENROWSET command. Unfortunately, this feature is a read only from blob storage file pattern when you are in the cloud. The older versions of the on premise command allows you to write to a file.
I hope these suggestions help.
Happy Coding.
John
The Crafty DBA
If you wanted to use Azure Data Factory, in order to do incremental updates, you would need to change your query to look at a created/modified date on the source table. You can then take that data and put it into a "Staging Table" on the destination side, then use a stored proc activity to do your insert/update into the "Real table" and finally truncate the staging table.
Hope this helps.
I am able to achive Cloud to Cloud Migration using Data Sync Preview from Azure ASM Portal
Below are the limitations
Maximum number of sync groups any database can belong to : 5
Characters that cannot be used in object names : The names of objects (databases, tables, columns) cannot contain the printable characters period (.), left square bracket ([) or right square bracket (]).
Supported limits on DB Dimensions
Reference: http://download.microsoft.com/download/4/E/3/4E394315-A4CB-4C59-9696-B25215A19CEF/SQL_Data_Sync_Preview.pdf
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I cant help think that there aren't many use case that can be effectively served by Cassandra better than Druid. As a time series store or key value, queries can be written in Druid to extract data however needed.
The argument here is more around justifying Druid than Cassandra.
Apart from the Fast writes in Cassandra, is there really anything else ? Esp given the real time aggregations/and querying capabilities of Druid, does it not outweigh Cassandra.
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
Not at all, they are not comparable. We are talking about two very different technologies here. Easy way is to see Cassandra as a distributed storage solution, but Druid a distributed aggregator (i.e. an awesome open-source OLAP-like tool (: ). The post you are referring to, in my opinion, is a bit misleading in the sense that it compares the two projects in the world of data mining, which is not cassandra's focus.
Druid is not good at point lookup, at all. It loves time series and its partitioning is mainly based on date-based segments (e.g. hourly/monthly etc. segments that may be furthered sharded based on size).
Druid pre-aggregates your data based on pre-defined aggregators -- which are numbers (e.g. Sum the number of click events in your website with a daily granularity, etc.). If one wants to store a key lookup from a string to say another string or an exact number, Druid is the worst solution s/he can look for.
Not sure this is really a SO type of question, but the easy answer is that it's a matter of use case. Simply put, Druid shines when it facilitates very fast ad-hoc queries to data that has been ingested in real time. It's read consistent now and you are not limited by pre-computed queries to get speed. On the other hand, you can't write to the data it holds, you can only overwrite.
Cassandra (from what I've read; haven't used it) is more of an eventually consistent data store that supports writes and does very nicely with pre-compute. It's not intended to continuously ingest data while providing real-time access to ad-hoc queries to that same data.
In fact, the two could work together, as has been proposed on planetcassandra.org in "Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!".
It depends on the use case . For example I was using Cassandra for aggregation purpose i.e. stats like aggregated number of domains w.r.t. users ,departments etc . Events trends (bandwidth,users,apps etc ) with configurable time windows . Replacing Cassandra with Druid worked out very well for me because druid is super efficient with aggregations .On the other hand if you need timeseries data with eventual consistency Cassandra is better ,Where you can get details of the events .
Combination of Druid and Elasticsearch worked out very well to remove Cassandra from our Big Dada infrastructure
.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working with a system that takes in a 50/s 10kb write stream which runs 24 hours a day. The data is ingested via a messaging system in to a sql database, and then used in an overnight aggregation that takes around 15 hours to produce queryable data for an application.
This is currently all in sql, but we are moving to a new architecture.
The plan is to move the ingested writes in to a distributed database like Cassandra or dynamodb, and then perform the aggregation in hadoop. This makes those parts of the system scalable.
My question is, when people have this architecture, where do they put the data after the writes and aggregation have been performed so that it can be queried.
In more detail:
The query model our application uses is quite complicated, to make the data queryable in cassandra, we would have to denormalise it for all queries, this is possible, but would mean a massive growth in data size. Is this normal practice? Or would you prefer to move the data back in to sql?
We could move the data in to redshift, but this seems to be more for ad hoc data analytics, and its purpose is not to be the backend for a data analytics application. I also think the queries are too complicated in their current form to be written in an orm which is what is required for redshift.
Does this mean that I still need to put the data in to sql server?
I am looking for examples of what people are doing at the moment.
I am sorry this question is a bit abstract, please do not close it, I will add more detail. I have read lots on big data, but most articles are about the ingestion of data using messaging / workers and distributed databases, but I have not found any that show what they do with this ingested data and how it is queried from the application.
*answer to JosefN's comment: Yes, we are not planning to denormalise in to a sql db. The choice is, denormalise in to cassandra, for all clients and queries, this would probably mean 100x the current data size, as there will be so much duplication in the denormalised model. The other option is to store it as it is now, so that it is queryable, but then, is my only option a sql db?
*after more research I have more information. The best options at the moment seems to be:
store back in sql
denormalise in cassandra
use one of the real time sql engines on top of hadoop / hdfs like impala
drpc with storm
I do not have any experience with Impala or DRPC with storm, so if anyone has any info on latency and the type of queries that can be performed with these that would be great.
Please do not refer to documentation or blog posts, I know how these technologies work, I only want to know if someone has used them in production and has their own information on this subject. thanks
I would suggest moving the aggregated data into HDFS. Using Hive, which provides a relational view over data stored inside HDFS, you can very well use adhoc sql like queries. At the same time you will be benefitted from parallelism of MapReduce jobs that gets invoked when you use Hive. This would help you to decrease query latencies that you would be having while using a RDBMS. Also think about doing the aggregation jobs in Hadoop itself.
Since the data after aggregation is small and you are looking for good latency keeping it in hdfs and query it using hive is not preferable.
I have seen people using hbase to store aggregated data and query it but as you mentioned earlier you will have to denormalize the data. For this case I will recommend writing aggregated data back to mysql and query it there if aggregated data is not big.
I think one traditional approach is to run your Hadoop/Hive jobs to aggregate across all possible dimensions, and then store in a key/value store like HBase, and look up at runtime with a key based on the aggregation done ( ie. /state=NJ/dt=20131225/ ) This can cause an explosion in size, especially if there are many dimensions to roll up
If you want/need a more realtime solution as well, take a look at Twitter's summingbird.