I saw some questions about use of Cassandra to store RDF data but they are really old (2013).
So, I have RDF time annotated data (RDFStream) and I'd like to store them in a Cassandra cluster. I know that probably it's not the best idea but I have research interest.
I'm looking for a way to do it because I'm not sure this is really possible. For now I'm really confused on how organise db.
Someone has experience or any hint about that? Is it possibile to do?
Related
In a POC, we are using cassandra for storing (among other things) Apache access logs (parsed) and use together with apache spark + zeppelin. We have managed to get things working BUT we are very uncertain about how to model the data correctly.
Edit: Our queries will span over months and years rather than weeks and days. Against production jobs are likely executed perhaps daily (at least for now) and we will use a smaller dataset during development.
Since this will be used for analytics ONLY, the queries can be pretty much anything but of course we could consider a handful of queries in advance.
I.e
latency percentiles
geo distribution
sum of requests
Popular rest resources
... etc
Partition key + Primary key. This is really difficult... the only thing that I can think of is something like ((userid, [webresource]), timestamp).
At least this would give a fairly even distribution. Otherwise we would have to use a checksum or something which feels wrong.
Or should I have different tables for different types, like latency, geo etc? Or is this a good option for materialized views?
I have googled for something like this without any luck so perhaps cassandra is a poor solution for this BUT still, we would really like to see how far we can get.
Anyway, any input is highly appreciated!
Regards /Johan
I am new to NoSql databases. I am trying to build a project and stuck with the approach of whether to choose sql databases or NoSql Databases for the project.
The requirements of my project are a legal firm would have many clients and each client can have different matter Type such as Immigration, Conveyancing, Family and etc and each MatterType can also have different fields which are never constant and they can fairly change in future.
Due to this nature I thought Nosql databases might be a good choice as they are document based and I can add any new fields to the document structure instead of always adding new columns to a sql data table dynamically which is not a good approach ( atleast i think)
Can anyone please kindly suggest me or refer me to an article which can assist me in deciding my approach
To give my clarity into my question let me explain with an example
For a client name xyz and matterType Immigration I can have fields such as firstName,lastName,Dob at this moment but later on for the same client I might have to add Dependants and their details
For a client name def and matterType conveyancing I would have different fields but those fields should also be added dynamically depending on the matter Type
Thank you in advance
Regards
Anand
In my opinion, you shouldn't only consider this feature in other to decide between NoSql or RBMDS.
In fact, this flexibility sounds very good, but it might be dangerous, once systems tend to raise, then things can get out of hand.
I have a system where I use MongoDB, but even though, I decided creating a schema for my collections.
I would suggest you finish modeling, then after that, conclude if it's really necessary to use NoSql.
I would like to suggest you to look into postgres sql if you are expecting large datasets. It offers the advantages of no sql databases such as support for key value pair and also keeping a rigid data structure like sql databases. Following are links to a few articles which may help you decide which approach to choose:
NoSql vs Sql
postgres vs mongodb
I was watching one of the Cassandra videos on DataSax Academy. One concept they talk a lot about is query driven modelling. This makes sense when you know your queries upfront like in the KillrVideo example.
However, in big data cases, I hope I am not the only one to think that we barely know what kind of queries analysts will perform on the data 5 months or one year down the road.
If this is the case, what are the best practices for storing your data? My guess is that for advanced querying of such data, you likely will end up loading your data into Spark. But what do I have to consider at storage time to avoid operational troubles and troubles at retrieval time? What retrieval approaches are less problematic?
Cassandra is also a database for analytics use cases, but not always for Ad-Hoc Analaytics (Only one report and this query will never perform again stuff).
For this use cases is a hadoop cluster a better option for your. (Maybe parquete on hadoop) If you see that queries will perform over and over again, Cassandra is your friend. Generally you can use Cassandra for 50 to 70% of your use cases. With column keys and secondary indizies you can perform really a wide spectrum of queries. Go to your Analytics Guys and ask them what they need. Then: Create your tables :)
Datastax has a course on doing analysis on Cassandra with Apache Spark.
I've started to learn cassandra, at first I want to learn cassandra data model but I don't know from where I must start, I have seen many web pages and the cassandra documentation (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html)
but I really confused. In its documentation it only talks about some examples that is so similiar to relational db without talking about super column concept or others concepts which we can find these concepts in others urls.
I need a step by step tutorial for data modeling which is straight forward.
Regards
Although CQL looks similar to SQL, they are very different. CQL is very limited compared to SQL and you need to understand how data is stored and retrieved in Cassandra based on the partition key and clustering columns. Until you understand how the keys work, you will be lost.
I haven't seen a very good overview of Cassandra on the web, but if you're willing to spring for a book, a good introduction to Cassandra and how it works is called "Apache Cassandra Hands-On Training Level One".
I'm looking at using Cassandra for an enterprise web-site I'm working on, which could be used by up to 250 million users. Cassandra seems like an obvious choice because of the way it scales, although I was a little sad not to be able to use a schema-less database like Couch (for political reasons I won't go in to).
I've read that you can still use Cassandra like a schema-less database, using either a super-column or simply serializing objects in to normal columns. At the moment I'm using .NET for my front-end.
Are there any libraries out there already that help with using Cassandra in this way?
Has anyone done anything like this already using .NET? Any tips?
Any advice gratefully received!
Thanks,
Steve.
Datomic is schemaless. Attributes are modeled and generic objects can be created, saved, queried with any combination of attributes.
http://www.datomic.com
http://docs.datomic.com/storage.html#cassandra