I've started to learn cassandra, at first I want to learn cassandra data model but I don't know from where I must start, I have seen many web pages and the cassandra documentation (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html)
but I really confused. In its documentation it only talks about some examples that is so similiar to relational db without talking about super column concept or others concepts which we can find these concepts in others urls.
I need a step by step tutorial for data modeling which is straight forward.
Regards
Although CQL looks similar to SQL, they are very different. CQL is very limited compared to SQL and you need to understand how data is stored and retrieved in Cassandra based on the partition key and clustering columns. Until you understand how the keys work, you will be lost.
I haven't seen a very good overview of Cassandra on the web, but if you're willing to spring for a book, a good introduction to Cassandra and how it works is called "Apache Cassandra Hands-On Training Level One".
Related
I am developing ERP system. I am using AWS lambda(Node) with Dynamo DB. I am new in dynamo. We are using multiple tables.
My one small use case is like this,
I have one company table in which company's address will be store. In that address fields there is city, state, country fields which will have primary key of master tables(City, state, country).
So I need to fetch city,state and country also when i fetch company details in single query. Also filter, sorting should be done from master table.
So can anyone help me with this that there is any way to implement this? And if possible also provide some node js query code for reference.
If possible provide some official doc so I can review that.
Thanks for any help :)
DynamoDB is a fantastic technology. However, it is completely different than SQL databases. As such, you'll need to approach the entire process differently than you would with a SQL database.
Entire books have been written on the subject, so I won't try to cover all the differences here. However, I'll pass along a few pointers to get you in the right direction.
Single table design is considered a best practice in DynamoDB. Having multiple tables isn't necessarily an anti-pattern, but spend some time investigating why this is the case. A good place to start: DynamoDB does not have a SQL join operation!
The DynamoDB Book is the best book on this topic, hands down. Spending just a few hours reading this book will drastically reduce the learning curve.
Watch AWS Re:Invent talks on DynamoDB. This talk from 2019 is a great place to start. (This talk is from Alex DeBrie, the author of the DynamoDB Book).
The rules of SQL databases do not apply when working with NoSQL databases. In other words, forget everything you know about SQL databases when working with NoSQL. It's a paradigm shift, and I find it best to set aside any preconceived notions about databases and approach the topic with an open mind. Easier said than done, but it's my advice anyway :)
DynamoDB requires that you build your data model to match the use cases of your application, aka your application "access patterns". Throwing your data into DynamoDB before thinking about how you need to access the data is the wrong approach. Instead, ask yourself "what are all the ways my application needs to get data?" and design your data model accordingly.
NoSQL data modeling has a steep learning curve, but it's awesome when you get it right! Good luck!
I was watching one of the Cassandra videos on DataSax Academy. One concept they talk a lot about is query driven modelling. This makes sense when you know your queries upfront like in the KillrVideo example.
However, in big data cases, I hope I am not the only one to think that we barely know what kind of queries analysts will perform on the data 5 months or one year down the road.
If this is the case, what are the best practices for storing your data? My guess is that for advanced querying of such data, you likely will end up loading your data into Spark. But what do I have to consider at storage time to avoid operational troubles and troubles at retrieval time? What retrieval approaches are less problematic?
Cassandra is also a database for analytics use cases, but not always for Ad-Hoc Analaytics (Only one report and this query will never perform again stuff).
For this use cases is a hadoop cluster a better option for your. (Maybe parquete on hadoop) If you see that queries will perform over and over again, Cassandra is your friend. Generally you can use Cassandra for 50 to 70% of your use cases. With column keys and secondary indizies you can perform really a wide spectrum of queries. Go to your Analytics Guys and ask them what they need. Then: Create your tables :)
Datastax has a course on doing analysis on Cassandra with Apache Spark.
I'm trying to migrate our postgres database containing millions of clicks (few years click history) to more performing system. Our current analytic queries, which are running on postgres are taking forever to complete and it degrades performance of the whole database. I've been investigating possible solutions and I've decided to closely investigate 2 options:
HBase with Hadoop (mapreduce)
Cassandra with Spark
I was working with NoSQL before, however never used it for analytical purposes. At first I was a bit disapointed how little analytical query options those databases provide (missing groupBy, count, ...). After reading many articles and presentations I've found out, that I need to design my schema according how I intend to read my data and that storage layer is separated from query layer. Which adds more redundant data, however in the world of NoSQL this is not an issue.
Eventually I've found one nice grails plugin cassandra-orm, which internally encapsulates orderBy feature in cassandra counters counters. However I'm still worried about howto make this design extendable. What about the queries, that will come in the future, which I have no clue about today, how can I design my schema prepared for that ?
One option would be to use Spark, but Spark doesn't provide data in real time.
Could you give me some insight or advice what are the best possible options for bigdata analysis. Should I use combination of real time queries vs. pre-aggregated ones?
Thanks,
If you are looking at near real time data analysis, Spark + HBase combination is one of the solutions.
If you want to compromise on throughput, Solr + Cassandra combination from Datastax can be used.
I am using Solr + Cassandra from Datastax for my use case, which does not require real time processing. The performance of search option is not that great with this combo but I am OK with the throughput.
Spark+HBase combination seems to be promising. Depending on your business requirement & expertise, you can chose the right combination.
If you want the ability to analyse data in near-real-time with complete flexibility in query structure, I think your best bet would be to throw a scalable indexing engine such as Elasticsearch or Solr into your polyglot persistence mix. You could still use Cassanra as the primary data store and then index those fields you're interested in querying and/or aggregating.
Have a look at Datastax Enterprise which bundles together Cassandra and Solr. Also have a look at Solr's Stats component and its faceting capabilities. These, combined with the indexing engine's rich query language, are handy for implementing many analytics use cases.
If your data set consists of a few million records 'only', I think you'll be able to get some good response times from Solr or ES on a reasonably spec'ed cluster.
Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...
I am working in a Java web application, using NoSQL (target is Cassandra). I use Astyanax as Cassandra client since it is suggested the best client of Cassandra for now. I've just approached Cassandra for 2 weeks, so many things is so weird to me.
During my working, I encountered some problems and I do not know how to overcome:
Is table created from CQL like column family created by Thrift API? I feel they are similar, but maybe there are some differences behind. For example:
table create by CQL command cannot be accessed by Thrift API
Thrift-based APIs cannot work with tables created by CQL, but CQL methods can access column family created by Thrift API!
Is primary key in table correspond to row key in column family?
In CQL I can declare a table which contains a collection/set/map inside. Can I do the same thing in Thrift API?
If my application needs both of them (column families and tables), how can they deal with each other?
I recognize one thing: I cannot use Thrift API to do manipulating data on tables create by CQL, and vice versa. I wonder that that, how can I remember which table/column family created from which way so that I can use the correct APIs to process data? For the time being, we don't have a general way to handle two of them, do we? AFAIK, Thrift API and CQL do not have a same interface, so they cannot understand each other?!
Could you please help me explain these things? Thank you so much.
Yes. It's impossible to update the Thrift APIs to be CQL-aware without breaking existing applications. So if you use CQL you are committing to using CQL clients only like the Java driver, and not Astyanax, Hector, et al. But this is no great sacrifice since CQL is much more usable.
For a simple PK (i.e., single column), yes. For a compound PK, it's a bit more complicated.
No. The Thrift API operates at a lower level, by design. (So you'd see the individual storage cells that make up the Map, for instance.)
I don't understand the question. With CQL you can do everything you could do with Thrift, but more easily.
Simple; don't mix the two. Stick with one or the other.
In my opinion, I believe focus is shifting towards making cassandra look like a RDBMS with SQL Queries to gain wider adoption.
But with inconsistencies between work done using Hector/Astyanax(thrift) and CQL, i think it will hurt adoption. Its almost a U turn from hector/astyanax to CQL in the middle of the journey.
Atleast CQL should have been planned in such a way that Thrift api (and high level java apis on top of it) have no problem in transitioning.