So I have two tables in the query I am using:
SELECT
R.dst_ap, B.name
FROM airports as A, airports as B, routes as R
WHERE R.src_ap = A.iata
AND R.dst_ap = B.iata;
However it is throwing the error:
mismatched input 'as' expecting EOF (..., B.name FROM airports [as] A...)
Is there anyway I can do what I am attempting to do (which is how it works relationally) in Cassandra CQL?
The short answer, is that there are no joins in Cassandra. Period. So using SQL-based JOIN syntax will yield an error similar to what you posted above.
The idea with Cassandra (or any distributed database) is to ensure that your queries can be served by a single node (cutting down on network time). There really isn't a way to guarantee that data from different tables could be queried from a single node. For this reason, distributed joins are typically seen as an anti-pattern. To that end, Cassandra simply doesn't allow them.
In Cassandra you need to take a query-based modeling approach. So you could solve this by building a table from your post-join result set, consisting of desired combinations of dst_ap and name. You would have to find an appropriate way to partition this table, but ultimately you would want to build it based on A) the result set you expect to see and B) the properties you expect to filter on in your WHERE clause.
Related
I am working with graphframes, pyspark, and hive to work with graph data. As I process data I will be building a graph and eventually will be persisting this data into a Hive table, where I will not update it ever again.
Subsequent runs may have relationships to nodes from previous runs, so I will want to ensure I don't duplicate data.
For example, run #1 might find nodes: A, B, C. Run #2 might re-find node A, and also find new nodes X, Y, Z. I do not want A to appear twice in my table.
I am looking for the best way to handle this and would like to address the following issues:
I will need to track the status of the node as I process metadata associated with it. I will only want to persist the node's data to Hive after I have finished this processing.
I want to ensure that I don't create duplicate data when I encounter the same node (e.g. when I re-find A node above, I don't want to insert another row into Hive)
I am currently tinkering with the best way to do this. I know hive supports ACID transactions now, but it does not appear as though pyspark currently supports CRUD type operations. So here is what I'm planning on:
On each run, create a dataframe to store the nodes I have found.
When a new node is found: Check if the node already exists in Hive (e.g. sqlContext.sql("SELECT * FROM existingTable WHERE name="<NAME>"). If it does not exist update the dataframe with x = vertices.withColumn("name", F.when(F.col("id")=="a", "<THE-NEW-NAME>").otherwise(F.col("name"))) to add it to our Dataframe.
Once all the nodes have finished processing, create a temporary view: x.createOrReplaceTempView("myTmpView")
Finally, insert data from my temporary view into an existing table with sqlContext.sql("INSERT INTO TABLE existingTable SELECT * FROM myTmpView")
I think this will work, but it seems extremely hacky. I'm not sure if this is a function of my lack of understanding of Hive/Spark, or if this is just the nature of the tech stack. Is there a better way to do this? Is there a performance cost to handling it in this way?
In deltalake api, upserts(Merge) are supported using scala and also python. Which is exactly you are trying to implement.
https://docs.delta.io/latest/delta-update.html#merge-examples
Here is an alternate solution
Have a column updated_time timestamp in your table
union prev_run_results and current_run_results
group by 'node', select the latest timestamp
save the results
i want to use the result of select query as input of another queries condition like this:
DELETE FROM message_user WHERE id = 8a81de70-1991-11e9-a38f-9e0aa7c9f25f and group = e5b04c50-1982-11e9-abf3-b17ecbb80329 and receiver in (SELECT member FROM chat_group_member WHERE id = e5b04c50-1982-11e9-abf3-b17ecbb80329)
Cassandra is distributed database, Nested queries are type of joins. In Cassandra Data might be stored on multiple host. In order to make joint large data might need to be downloaded on single node. This might cause performance issues as all nodes are on commodity hardware (peer to peer). Hence I think its not supported.
I have two questions about query results in Cassandra.
When I make a "full" select of a table in Cassandra (ie. select * from table) is it guaranteed that the results will be returned in increasing order of partition tokens?
For instance, having the following table:
create table users(id int, name text, primary key(id));
Is it guaranteed that the following query will return the results with increasing values in the token column?
select token(id), id from users;
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
If the anwer to the above question is 'yes', is it still valid if we use secondary index? For instance, if we would have the following index:
create index on users(name);
and we query the table by using the index:
select token(id), id from users where name = 'xyz';
is there any guarantee regarding the order of results?
The motivation for the above questions is if the token is the right thing to use in order in implement paging and/or resuming of broken longer "data exports".
EDIT: There are multiple resources on the net that state that the order matches the token order (eg. in description of partitioner results or this Datastax page):
Without a partition key specified in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of userid.
However the order of results is not specified in official Cassandra documentation, eg. of SELECT statement.
Is it guaranteed that the following query will return the results with increasing values in the token column?
Yes it is
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
The data distribution is orthogonal to the ordering of the retrieved data, no relationship
If the anwer to the above question is 'yes', is it still valid if we use secondary index?
Yes, even if you query data using a secondary index (be it SASI or the native implementation), the returned results will always be sorted by token order. Why ? The technical explanation is given in my blog post here: http://www.doanduyhai.com/blog/?p=13191#cluster_read_path
That's the main reason that explain why SASI is not a good fit if you want the search to return data ordered by some column values. Only a real search engine integration (like Datastax Enterprise Search) can yield you the correct ordering because it bypasses the cluster read path layer.
I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.
How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.