Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 days ago.
Improve this question
When I worked at Google, I used a Bigtable a lot, and ever since I left I've been missing it. One of the nice things about Bigtable is that its data model had time-based versioning built in. In other words, your key structure was:
(row:string, column:string, time:int64) -> string
This meant that when you issued a query against bt, you could get the most recent value, and you usually would. But you could also transparently wind the clock back to a specific timestamp. This is especially useful for tasks which involve repeatedly processing the same key:column combo.
Does any technology out there support the same timestamp-based data model? I see plenty of similar technologies that support extremely scalable key:column, value lookups, but I can't seem to find any support for this sort of time-based rewind feature.
I believe DynamoDB can satisfy your need here. It supports two different kinds of keys. The second one below is the one you'd want to use.
Partition key – A simple primary key, consists of a single attribute.
Partition key and sort key – Referred to as a composite primary key, which is composed of two attributes. The first attribute is the partition key (PK), and the second attribute is the sort key (SK).
The PK value is used as input to an internal hash function which determines the partition (physical storage internal to DynamoDB) where a record will be stored. All records with the same PK value are stored together, in sorted order by SK value.
Therefore, for your scenario you would set your SK as number attribute and populate it with timestamps. This will allow you to write efficient queries using the mathematical operators (<, >) and the between operator.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How do one implement live like and dislike [or say views count] in couchdb/couchbase in the most efficient way.
Yeah one can use reduce to calculate count each time and on front end only use increment and decrement to one API call to get results.
But for every post there will be say millions of views, like and dislikes.
If we will have millions of such post [in a social networking site], the index will be simply too big.
In terms of Cloudant, the described use case requires a bit of care:
Fast writes
Ever-growing data set
Potentially global queries with aggregations
The key here is to use an immutable data model--don't update any existing documents, only create new ones. This means that you won't have to suffer update conflicts as the load increases.
So a post is its own document in one database, and the likes stored separately. For likes, you have a few options. The classic CouchDB solution would be to have a separate database with "likes" documents containing the post id of the post they refer to, with a view emitting the post id, aggregated by the built-in _count. This would be a pretty efficient solution in this case, but yes, indexes do occupy space on Couch-like databases (just like as with any other database).
Second option would be to exploit the _id field, as this is an index you get for free. If you prefix the like-documents' ids with the liked post's id, you can do an _all_docs query with a start and end key to get all the likes for that post.
Third - recent CouchDBs and Cloudant has the concept of partitioned databases, which very loosely speaking can be viewed as a formalised version of option two above, where you nominate a partition key which is used to ensure a degree of storage locality behind the scenes -- all documents within the same partition are stored in the same shard. This means that it's faster to retrieve -- and on Cloudant, also cheaper. In your case you'd create a partitioned "likes" database with the partition key being the post-id. Glynn Bird wrote up a great intro to partitioned DBs here.
Your remaining issue is that of ever-growth. At Cloudant, we'd expect to get to know you well once your data volume goes beyond single digit TBs. If you'd expect to reach this kind of volume, it's worth tackling that up-front. Any of the likes schemes above could most likely be time-boxed and aggregated once a quarter/month/week or whatever suits your model.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need to know which one is better for performance
TRIGGERS || TTL ?
Or which among st them are more convenient to use.
So while "better" is definitely subjective, what I'll do here is talk about what you need to do to enable both a TTL and a trigger in Cassandra. Then you can decide for yourself.
To implement a TTL, you need to modify the default_time_to_live property on your table to a value (in seconds) starting from write-time to when that data should be deleted. It defaults to zero (0), which effectively disables TTL for that table. The following CQL will set a TTL of 7 days:
ALTER TABLE my_keyspace_name.my_table_name WITH default_time_to_live = 604800;
Do note that TTL'd data still creates tombstones, so be careful of that. It works best for time series use cases, where data is clustered by a timestamp/timeUUID in DESCending order. The descending order part is key, as that keeps your tombstones at the "bottom" of the partition, so queries for recent data should never encounter them.
Implementing a trigger is somewhat more complex. For starters, you'll need to write a class which implements the ITrigger interface. That class will also need to overload the augment method, like this:
public class MyTrigger implements ITrigger {
public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update)
Inside that method, you'll want to grab the table and keyspace names from the update metadata, build your DELETE statement:
{
String keyspace = update.metadata().ksName;
String table = update.metadata().cfName;
StringBuilder strCQLHeader = new StringBuilder("DELETE FROM ")
.append(keyspace)
.append(".")
.append(table)
.append(" WHERE ");
...As well as some logic to grab the key values to delete/move and anything else you'd need to do here.
Next, you'll need to build that class into a JAR and copy it to the $CASSANDRA_HOME/lib/triggers/ dir.
Then (yep, not done yet) you'll need to add that trigger class to your table:
CREATE TRIGGER myTrigger ON my_table_name USING 'com.yourcompany.packagename.MyTrigger';
While you're free to choose whichever works best for you, I highly recommend using a TTL. Implementing triggers in Cassandra is one of those things that tends to be on a "crazy" level of difficulty.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I cant help think that there aren't many use case that can be effectively served by Cassandra better than Druid. As a time series store or key value, queries can be written in Druid to extract data however needed.
The argument here is more around justifying Druid than Cassandra.
Apart from the Fast writes in Cassandra, is there really anything else ? Esp given the real time aggregations/and querying capabilities of Druid, does it not outweigh Cassandra.
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
For a more straight question that can be answered - doesnt Druid provide a superset of features as comapred to Cassandra and wouldn't one be better off in using druid rightaway? For all use cases?
Not at all, they are not comparable. We are talking about two very different technologies here. Easy way is to see Cassandra as a distributed storage solution, but Druid a distributed aggregator (i.e. an awesome open-source OLAP-like tool (: ). The post you are referring to, in my opinion, is a bit misleading in the sense that it compares the two projects in the world of data mining, which is not cassandra's focus.
Druid is not good at point lookup, at all. It loves time series and its partitioning is mainly based on date-based segments (e.g. hourly/monthly etc. segments that may be furthered sharded based on size).
Druid pre-aggregates your data based on pre-defined aggregators -- which are numbers (e.g. Sum the number of click events in your website with a daily granularity, etc.). If one wants to store a key lookup from a string to say another string or an exact number, Druid is the worst solution s/he can look for.
Not sure this is really a SO type of question, but the easy answer is that it's a matter of use case. Simply put, Druid shines when it facilitates very fast ad-hoc queries to data that has been ingested in real time. It's read consistent now and you are not limited by pre-computed queries to get speed. On the other hand, you can't write to the data it holds, you can only overwrite.
Cassandra (from what I've read; haven't used it) is more of an eventually consistent data store that supports writes and does very nicely with pre-compute. It's not intended to continuously ingest data while providing real-time access to ad-hoc queries to that same data.
In fact, the two could work together, as has been proposed on planetcassandra.org in "Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!".
It depends on the use case . For example I was using Cassandra for aggregation purpose i.e. stats like aggregated number of domains w.r.t. users ,departments etc . Events trends (bandwidth,users,apps etc ) with configurable time windows . Replacing Cassandra with Druid worked out very well for me because druid is super efficient with aggregations .On the other hand if you need timeseries data with eventual consistency Cassandra is better ,Where you can get details of the events .
Combination of Druid and Elasticsearch worked out very well to remove Cassandra from our Big Dada infrastructure
.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What load can azure table storage handle at most (one account)? For example can it handle 2000 reads/sec, where response must come in less than a second (requests would be made from many different machines and the payload of one entity is something like 500Kb on average)? What are the practices to accommodate for such load (how many tables, partitions, giving that there is only one type of entity and in principle there could be any number of table/partitions. Also the Rowkeys are uniformly distributed 32 character hash strings and PartitionKeys are also uniformly distributed).
Check the Azure Storage Scalability and Performance Targets documentation page. That should answer part of your question.
http://msdn.microsoft.com/en-us/library/azure/dn249410.aspx
I would suggest reading the best practices here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
The following are the scalability targets for a single storage account:
•Transactions – Up to 5,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
◦Up to 500 entities per second
◦Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).
As long as you correctly partition your data so you don't have a bunch of data all going to one machine, one table should be fine. Also keep in mind how you will query the data, if you don't use the index (PartitionKey|RowKey) it will have to do a full table scan which is very expensive with a large dataset.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Besides the well known data Structure which can be used in Multi-Threaded programs such as Concurrent Stack, Concurrent Queue, Concurrent List, Concurrent Hashing.
Are there any other lesser know but useful Data Structures which can be used in parallel/Multi-Threaded programming.
Even if they are some different versions of above listed data structures with some optimization, then kindly do share.
Please do include some references.
Edit: Will keep listing what I find
1) ConcurrentCuckooHashing (Optimized version of ConcurrentHashing)
2) ConcurrentSkipList
I will try to answer with examples from JDK, if you do not mind:
Lists:
CopyOnWriteArrayList is a list that achieves thread-safe usage by recreating backing array each time the list is modified;
Lists returned by Collections.synchronizedList() are thread-safe as they include exclusive locking for most operations (iteration over is an exception);
ArrayBlockingQueue. Queue that has a fixed size and blocks when there's nothing to pull out or no space to push in;
ConcurrentLinkedQueue is a lock-free queue based on Michael-Scott algorithm;
Concurrent stack, based on Treiber algorithm. Surprisingly, I didn't find that in JDK;
Sets:
Sets, returned by factory Collections.newSetFromMap() with a backing ConcurrentHashMap. With these sets you can be sure that its iterators is not prone to ConcurrentModificationException, and they use a striping technique for locking it - locking all set is not neccesary to perform some operations. For example, when you want to add element, only the part of the set determined by element hashCode() will be locked;
ConcurrentSkipListSet. The thread-safe set based on a Skip List data structure;
Sets, returned by Collections.synchronizedSet(). All points written about similar lists are applicable here.
Maps:
ConcurrentHashMap which I already mentioned and explained. Striping is based on item keys;
ConcurrentSkipListMap. Thread-safe map, based on skip list;
Maps, returned by Collections.synchronizedMap(). All points written about similar lists and maps are applicable here.
These were more or less standard data structures intended for multithreaded usage, which should be enough for most practical tasks. I also found some links you may find useful:
Wait-free red-black tree;
A huge article about concurrent data structures in general;
Concurrent structures and synchronization primitives used in .NET;
Articles about transactional memory - not really a data structure, but since your request serves academic purposes, worth reading;
One more article about transactional memory - Easy to read, but it is in Russian language. If you can read it, definetely worth reading;