How to represent spatial data in Cassandra - cassandra

Can someone tell me how to represent spatial data (coming from postgis) in Cassandra?

This presentation was pretty interesting, on the topic of spatial data in Cassandra, and may help:
http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php

Please provide a bit more detail on what you are trying to achieve.
This is particularly important for Cassandra (as opposed to a relational database), because you need to model the data to support the specific queries you need, rather than modelling the domain in a fairly generic way and using SQL to define queries afterwards.
Are you just trying to look up lat/longs for entities with unique identifiers, or do you have more complex shapes associated with your entities - or what?

Responding to Mr. Roland (and hopefully the OP):
You'd need to come up with your own indexing scheme, and store the indexes in Cassandra.
For example, you could subdivide the space into squares (perhaps using a hierarchical structure such as a quadtree) and store each square in a Cassandra row, with the columns storing the objects that fall within the square. Your client code would need to determine the correct square for each lat,long, then look up the objects in that square (or squares) that cover the radius you desire, then do a final client-side filter to remove any objects that are just outside the radius due to them being stored in squares.

Related

Spotfire for cross fact table joins using conformed dimensions

A few years I ascertained that Spotfire cannot perform multi-fact table queries using conform dimensions a la Ralph Kimball - like Tableau in which this is still the case.
Is this still so? Most people I speak to are not aware of this. I am not in a position to quickly assess this, hence my question.
If you are reading from a DB, you can create custom information links using SQL (or what Spotfire calls SQL, its a little different) that can certainly join multiple fact tables together through conforming dimensions. These may perform well or poorly depending on the amount of data and structure of the tables in question.
You can also 'join' fact tables across dimensions (or directly to each other if you have the right keys) within the tool itself. These are called relations and work under the same principles, but don't kick off joined SQL statements.
If you create a view in the DB that does the joins as you have said, Spotfire can read those as well into an information link.

question about data model design in arangodb

Update 2:
The original question is too long, a simple way:
In The City Graph, how to query the city that can be reached directly from Berlin by germanHighway. I don't want the internationalHighway.
Original Question:
I now use ArangoDB to store a graph. I have one question for the data model design.
Use the knows_graph for example, social_graph
In my original opition, I think I will design two collections, the Document collection is person, and the Edge collection is marriedWith or friendWith.
But when I want to query the person who marriedWith someone, I can't filter the unwanted friendWith edges.(I'm not very familiar with the AQL, maybe this is not true).
In contrast to the examples in AQL Documents, it used to define a more common edge collection, for example, relation in social_graph, and define the more specific type in attribute. for example, "type":"married" as an attribute of a relation.
and thus in AQL, I can use FILTER p.edges[0].type== 'married' to filter the unwanted relation.
My question is:
Which method of data model design is better, or any suggestions for this?
Now I think, put married as a type of a person, may be more flexible, easy to extend to student, neighbour... with one relation Edge collection.
Otherwise, many Edge collections, isStudent, neighbourWith... shoud be created.
Can AQL could filter nodes by edge type but not attributes? Maybe looks like:
FILTER 'isStudent' edge
Update:
I just tried, one relation can only used for two node type.
For example, one isFriend edge is used for person and dog nodes, then you can't use isFriend edge for dog and cat!
so many edges is must needed.
For the original question:
If you have a finite, well defined, number of edges, then using multiple edge collections is fine specially if you expect to have a large number of edge of each type. If in the other hand, you foresee having to a large number of relationship types (friend , best friend, wife, etc) and the number of relationships of each type is not huge, then a single edge collection with a type indicator is fine and may simplify things.
The only two ways I can think of filtering edges from a traversal are:
IS_SAME_COLLECTION function. This will tell you if a document is of particular type. Keep an eye on performance if you use this in a big dataset though
Adding a type attribute in each edge collection that indicates what type of collection this is. Yes, it is basically a static field and is a bit of a waste of space but it works and space is cheap nowadays
Use anonymous graph traversals where you can define which edges to use explicitly
Having said that, Arango is a multi-model DB, and as such you could just ignore the traversal syntax, and just join the tables that you need, which would work just fine as well. It is the great thing about multi-model DBs, you use them in any way you need them.
In terms of your last update, you could check the edge collection by doing something like:
FILTER IS_SAME_COLLECTION('internationalHighway', e._id) == false
I think the way to design the data model depends on your business, If your model is more or less stable, and without many edges, you can select the many edges way, the edges is a finite set.
But I don't know how to filter by edge names :-)
otherwise, I think less edge and more attribute will be good.

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Do we need to denormalize model in Cassandra?

We usually store a graph of objects in databases. in rdbms, we need to male joins to retry the relationships between objects. In cassandra, it is promoted to denormalize model to fit the queries. But making this, we make the update of the model more complex or more specified.
In Cassandra, it exists complex data types like set, map, list ou tuples. These types make possible to store the relationships between object in a straitghforward manner (association, aggregation, composition of object) by storing inside for instance a list the ids of the connected objects.
The only drawback is then to have to divide a sql complex join request in several requests.
I ve not seen papers on cassandra dealing with this kind of solution. Has someone in mind the reason why this solution is not promoted?
Cassandra is highly write optimized database. So writes are cheap, meaning an extra three or four writes will hardly matter considering the difficulties it would create if it were not otherwise.
Regarding graphs of objects, the answer is: No. Cassandra isn't meant to store graphs of objects. Cassandra is meant to store data for queries. The RDBMS equivalent would be views in PostgreSQL. Data has to be stored in a way that a query can be easily serviced. The main reason being that reads are slow. The goal of data modeling in Cassandra is to make sure a read is almost always from a single partition.
If it were normalized data, a query would need to hit a minimum of two partitions and worst case scenarios would create latencies that would render the application unusable for any practical purpose.
Hence data modeling in Cassandra is always centered on queries and not the relationship between objects.
More on these basic rules can be found in Datastax's documentation
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

How to perform intersection operation on two datasets in Key-Value store?

Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?
The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo

Resources