Ok, we all know that in traditional SQL databases you have to escape date when inserting to databases, so that their is no SQL injection. In Cassandra NoSQL databases, is their any problems like that? Do we need to escape any data before we insert into Cassandra? Any security related things I need to know?
An injection attack is much less of a concern with CQL for a number of reasons. For one, Cassandra will only execute one complete statement per query so any attack that, for example, attempted to concatenate a DROP, DELETE, or INSERT onto a SELECT, would fail. And, with the exception of BATCH (which requires a collection of complete INSERT, UPDATE, and DELETE statements), there are no nested queries.
That said, you should always sanitize your input, and you should make use of prepared statements, rather than constructing complete query statements in code.
Related
I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim
I'm learning Cassandra now and I understand I should make a table for each query. I'm not sure when I should make separate tables or materialized views. For example, I have the following queries for users and posts:
users_by_id
users_by_email
users_by_session_key
posts_by_id
posts_by_category
posts_by_user
Should I always use materialized views?
It seems to me that if you want to keep the Posts or Users consistent across queries, then I have to use materialized views. However materialized views I read have a read before write latency.
On the other hand, if I use different tables, am I supposed to make 3 Inserts every time a new post is created? I noticed that I get the error batch with conditions cannot span multiple tables, which means I have to insert it one at a time into each separate table, which can cause consistency problems if one of the queries fails. (A batch statement, would fail all 3 if one of them failed).
So, since it makes sense to have consistency, then it seems to me that I will always want to use materialized views, and have to take the read before write penalty.
I guess my other question is when would it ever be okay for data to be inconsistent?
So hoping someone can provide more clarity for me for how to handle multiple queries in cassandra on a 'theoretical model` like Users or Posts. Should I be using materialized views? If I use 3 different tables for each model, how do I keep them consistent? Just hope that all 3 inserts don't fail? Doesn't seem right.
Read my deep dive blog post for all the trade-offs when using materialized views. Once you understand the trade-offs, choose wisely: http://www.doanduyhai.com/blog/?p=1930
No, you shouldn't always use materialized views. The perfect solution is a interface for your database. In this application, you handle all your different tables. But there's are also some use case for the materialized views: If you haven't the time for this application but you need this feature, use materialized views. You have a performance trade off but in this scenario, the time is more important. If you also need real updates instead of upserts on all tables: use materialized views.
Batch is useful for buffering or putting data-sets with the same partition key together. For example: You have a high data troughput application. Between your heartbeats or between execution another query with QUORUM, you got 10 other events with the same partition key. But you won't execute them because you're waiting for a successful response. If a success comes back, you execute a batch query. But please keep in mind: Use only a batch for the same partition keys.
Generally, remember one important thing: Cassandra has an eventually consistency model. That means: If you use qourum, you will have consistency but not every time. If your application needs a full consistency, not only eventually use another solution. E.g. SQL with sharding. Cassandra is optimized for writes and you will only get happy when you're using the cassandra features.
Some performance tips:
If you need a better consistency: Use QUORUM, never use ALL. And, generally, write you queries standalone. Sometimes batch is useful. Don't execute queries with ALLOW FILTERING. Don't use token ranges or IN operator on partition keys :)
How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.
A lot of people know that it is important to use parameterized queries to prevent sql injection attacks.
Parameterized queries are also much faster in sqlite and oracle when doing online transaction processing because the query optimizer doesn't have to reparse every parameterized sql statement before executing. I've seen sqlite becoming 3 times faster when you use parameterized queries, oracle can become 10 times faster when you use parameterized queries in some extreme cases with a lot of concurrency.
How about other db's like mysql, ms sql, db2 and postgresql?
Is there an equal difference in performance between parameterized queries and literal queries?
With respect to MySQL, MySQLPerformanceBlog reported some benchmarks of queries per second with non-prepared statements, prepared statements, and query cached statements. Their conclusion is that prepared statements is actually 14.5% faster than non-prepared statements on MySQL. Follow the link for details.
Of course the ratio varies based on the query.
Some people suppose that there's some overhead because you're making an extra round-trip from the client to the RDBMS -- one to prepare the query, the second to pass parameters and execute the query.
But the reality is that these are false assumptions made without actually measuring. I've never heard of prepared statements being slower in any brand of database.
I've nearly always seen an increase in speed - but only the first time generally. After the plans are loaded and cached I would have surmised that the various db engines will behave the same for either type.