Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need to know which one is better for performance
TRIGGERS || TTL ?
Or which among st them are more convenient to use.
So while "better" is definitely subjective, what I'll do here is talk about what you need to do to enable both a TTL and a trigger in Cassandra. Then you can decide for yourself.
To implement a TTL, you need to modify the default_time_to_live property on your table to a value (in seconds) starting from write-time to when that data should be deleted. It defaults to zero (0), which effectively disables TTL for that table. The following CQL will set a TTL of 7 days:
ALTER TABLE my_keyspace_name.my_table_name WITH default_time_to_live = 604800;
Do note that TTL'd data still creates tombstones, so be careful of that. It works best for time series use cases, where data is clustered by a timestamp/timeUUID in DESCending order. The descending order part is key, as that keeps your tombstones at the "bottom" of the partition, so queries for recent data should never encounter them.
Implementing a trigger is somewhat more complex. For starters, you'll need to write a class which implements the ITrigger interface. That class will also need to overload the augment method, like this:
public class MyTrigger implements ITrigger {
public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update)
Inside that method, you'll want to grab the table and keyspace names from the update metadata, build your DELETE statement:
{
String keyspace = update.metadata().ksName;
String table = update.metadata().cfName;
StringBuilder strCQLHeader = new StringBuilder("DELETE FROM ")
.append(keyspace)
.append(".")
.append(table)
.append(" WHERE ");
...As well as some logic to grab the key values to delete/move and anything else you'd need to do here.
Next, you'll need to build that class into a JAR and copy it to the $CASSANDRA_HOME/lib/triggers/ dir.
Then (yep, not done yet) you'll need to add that trigger class to your table:
CREATE TRIGGER myTrigger ON my_table_name USING 'com.yourcompany.packagename.MyTrigger';
While you're free to choose whichever works best for you, I highly recommend using a TTL. Implementing triggers in Cassandra is one of those things that tends to be on a "crazy" level of difficulty.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 days ago.
Improve this question
When I worked at Google, I used a Bigtable a lot, and ever since I left I've been missing it. One of the nice things about Bigtable is that its data model had time-based versioning built in. In other words, your key structure was:
(row:string, column:string, time:int64) -> string
This meant that when you issued a query against bt, you could get the most recent value, and you usually would. But you could also transparently wind the clock back to a specific timestamp. This is especially useful for tasks which involve repeatedly processing the same key:column combo.
Does any technology out there support the same timestamp-based data model? I see plenty of similar technologies that support extremely scalable key:column, value lookups, but I can't seem to find any support for this sort of time-based rewind feature.
I believe DynamoDB can satisfy your need here. It supports two different kinds of keys. The second one below is the one you'd want to use.
Partition key – A simple primary key, consists of a single attribute.
Partition key and sort key – Referred to as a composite primary key, which is composed of two attributes. The first attribute is the partition key (PK), and the second attribute is the sort key (SK).
The PK value is used as input to an internal hash function which determines the partition (physical storage internal to DynamoDB) where a record will be stored. All records with the same PK value are stored together, in sorted order by SK value.
Therefore, for your scenario you would set your SK as number attribute and populate it with timestamps. This will allow you to write efficient queries using the mathematical operators (<, >) and the between operator.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 days ago.
Improve this question
Context:
I have a large number of CSV files containing logged information.
Some of the log files may "overlap" in that the same information may exist in two or more CSV files.
Not all of the information in the log files is required so a Python script is used to extract only the relevant information for insertion into a SQL (lite) database for easy querying.
The information in the log file that is "relevant" is a serial number, error ID, timestamp of event start, timestamp of event end, error description, longnitude and latitude at the time of the error event.
Problem:
I want to ensure that information duplicated in the CSV files is not fed into the SQL database.
Since the data has a serial number, timestamp and location, this should aid in filtering repeated events.
I did think that I could create a hash for the relevant information from the CSV file, when Python parsed the file, and use this to determine if the "same" record already existed within the SQL database being added to but maybe this isn't very efficient?
I guess the alternative is for SQL to only add the information if it doesn't already exist, but I'm not entirely sure how to do this.
Which would be the most efficient way of achieving this?
I know how to hash the data (by putting it into a tuple) in Python and to not add a record if a hash already exists but I'm not sure whether SQL can already do this for me.
If you have a unique identifier in your various csv files which can help you to filter duplicated information it's quite to build a table with this ID has primary and use the on conlict clause in your insert query to not insert several time the same row. Here is an example the table, of course you need other columns for the remaining data:
CREATE TABLE data (
id TEXT PRIMARY KEY
);
then you safely unduplicated data with such an insert clause:
INSERT INTO data (id)
VALUES (?)
ON CONLICT DO NOTHING
Duplicated data will just be ignored.
You can read this sqlite documentation page on the insert query type: https://www.sqlite.org/lang_insert.html
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How do one implement live like and dislike [or say views count] in couchdb/couchbase in the most efficient way.
Yeah one can use reduce to calculate count each time and on front end only use increment and decrement to one API call to get results.
But for every post there will be say millions of views, like and dislikes.
If we will have millions of such post [in a social networking site], the index will be simply too big.
In terms of Cloudant, the described use case requires a bit of care:
Fast writes
Ever-growing data set
Potentially global queries with aggregations
The key here is to use an immutable data model--don't update any existing documents, only create new ones. This means that you won't have to suffer update conflicts as the load increases.
So a post is its own document in one database, and the likes stored separately. For likes, you have a few options. The classic CouchDB solution would be to have a separate database with "likes" documents containing the post id of the post they refer to, with a view emitting the post id, aggregated by the built-in _count. This would be a pretty efficient solution in this case, but yes, indexes do occupy space on Couch-like databases (just like as with any other database).
Second option would be to exploit the _id field, as this is an index you get for free. If you prefix the like-documents' ids with the liked post's id, you can do an _all_docs query with a start and end key to get all the likes for that post.
Third - recent CouchDBs and Cloudant has the concept of partitioned databases, which very loosely speaking can be viewed as a formalised version of option two above, where you nominate a partition key which is used to ensure a degree of storage locality behind the scenes -- all documents within the same partition are stored in the same shard. This means that it's faster to retrieve -- and on Cloudant, also cheaper. In your case you'd create a partitioned "likes" database with the partition key being the post-id. Glynn Bird wrote up a great intro to partitioned DBs here.
Your remaining issue is that of ever-growth. At Cloudant, we'd expect to get to know you well once your data volume goes beyond single digit TBs. If you'd expect to reach this kind of volume, it's worth tackling that up-front. Any of the likes schemes above could most likely be time-boxed and aggregated once a quarter/month/week or whatever suits your model.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Assume that I have the following table CQL (well a fragment of the table):
CREATE TABLE order (
order_id UUID PRIMARY KEY,
placed timestamp,
status text,
)
Now if status could be one of PLACED, SHIPPED, or DELIVERED as an enum, I want to find all of the orders that are in PLACED status to process them. Given there are millions of orders and all orders are ultimately ending up in DELIVERED status a materialized view doesn't feel like the right solution to the problem. I am wondering what ideas there are to solve the problem of this low cardinality index without passing through the whole data set. Ideas?
I would recommend a table like
CREATE TABLE order_by_status (
order_id UUID,
placed timestamp,
status text,
PRIMARY KEY ((status), order_id)
)
Then you can iterate through the query to SELECT * FROM order_by_status WHERE status = 'PLACED';. Millions shouldnt be too much of an issue but it would be good to prevent it from getting too large by partitioning by some date window.
CREATE TABLE order_by_status (
order_id UUID,
placed timestamp,
bucket text,
status text,
PRIMARY KEY ((status, bucket), order_id)
)
Where bucket is a string generated from timestamp like 2017-10 from the YYYY-MM. You might wanna stay away from MV's for a little bit yet, it has some bugs in current version. I would also recommend against secondary indexes for this model, using a 2nd table and issuing inserts to both is going to be your best solution.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What load can azure table storage handle at most (one account)? For example can it handle 2000 reads/sec, where response must come in less than a second (requests would be made from many different machines and the payload of one entity is something like 500Kb on average)? What are the practices to accommodate for such load (how many tables, partitions, giving that there is only one type of entity and in principle there could be any number of table/partitions. Also the Rowkeys are uniformly distributed 32 character hash strings and PartitionKeys are also uniformly distributed).
Check the Azure Storage Scalability and Performance Targets documentation page. That should answer part of your question.
http://msdn.microsoft.com/en-us/library/azure/dn249410.aspx
I would suggest reading the best practices here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
The following are the scalability targets for a single storage account:
•Transactions – Up to 5,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
◦Up to 500 entities per second
◦Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target).
As long as you correctly partition your data so you don't have a bunch of data all going to one machine, one table should be fine. Also keep in mind how you will query the data, if you don't use the index (PartitionKey|RowKey) it will have to do a full table scan which is very expensive with a large dataset.