Nested query not working in Cassandra - cassandra

USE users_tracking;
SELECT user_name FROM visits
where port_name IN
(SELECT port_name FROM ports where location = 'NY' )//as temp;
It gives an error
mismatched input 'SELECT' expecting RULE_T_R_PAREN
Is there any way I can store the inner query in a variable and then use that?
I tried using set#varname := query but it does not recognize the set command.

Nested queries are not allowed in Cassandra CQL. For this kind of complex querying feature you'll need to use Hive or SparkSQL.
Here is a full CQL reference,
http://cassandra.apache.org/doc/cql3/CQL.html

Related

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Apache Cassandra "no viable alternative at input 'OR' "

My table looks like :
CREATE TABLE prod_cust (
pid bigint,
cid bigint,
effective_date date,
expiry_date date,
PRIMARY KEY ((pid, cid))
);
My below query is giving no viable alternative at input 'OR' error
SELECT * FROM prod_cust
where
pid=101 and cid=201
OR
pid=102 and cid=202;
Does Cassandra not support OR operator if not, Is there any alternate way to achieve my result.
CQL does not support the OR operator. Sometimes you can get around that by using IN. But even IN won't let you do what you're attempting.
I see two options:
Submit each side of your OR as individual queries.
Restructure the table to better-suit what you're trying to do. Doing a "port-over" from a RDBMS to Cassandra almost never works as intended.

Howto expose a native SQL function as a predicate

I have a table in my database which stores a list of string values as a jsonb field.
create table rabbits_json (
rabbit_id bigserial primary key,
name text,
info jsonb not null
);
insert into rabbits_json (name, info) values
('Henry','["lettuce","carrots"]'),
('Herald','["carrots","zucchini"]'),
('Helen','["lettuce","cheese"]');
I want to filter my rows checking if info contains a given value.
In SQL, I would use ? operator:
select * from rabbits_json where info ? 'carrots';
If my googling skills are fine today, I believe that this is not implemented yet in JOOQ:
https://github.com/jOOQ/jOOQ/issues/9997
How can I use a native predicate in my query to write an equivalent query in JOOQ?
For anything that's not supported natively in jOOQ, you should use plain SQL templating, e.g.
Condition condition = DSL.condition("{0} ? {1}", RABBITS_JSON.INFO, DSL.val("carrots"));
Unfortunately, in this specific case, you will run into this issue here. With JDBC PreparedStatement, you still cannot use ? for other usages than bind variables. As a workaround, you can:
Use Settings.statementType == STATIC_STATEMENT to prevent using a PreparedStatement in this case
Use the jsonb_exists_any function (not indexable) instead of ?, see https://stackoverflow.com/a/38370973/521799

Jdbc update statement in spark

I am connected to a database using JDBC and I am trying to run an update query. First I am typing the query, then I am executing it (in the same way I do the SELECT which works perfectly fine).
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES') alias_output "
spark.read.jdbc(url=jdbcUrl, table=caseoutputUpdateQuery, properties=connectionProperties)
When I run this I have the following error:
A nested INSERT, UPDATE, DELETE, or MERGE statement must have an OUTPUT clause.
I tried to fix this in different ways but there is always another error. For example, I tried to rewrite the query in the following way:
caseoutputUpdateQuery = "(UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT DELETED.*, INSERTED.* FROM dbo.CASEOUTPUT_TEST) alias_output "
but I encounter this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
The other way I tried to rewrite it was:
caseoutputUpdateQuery = "(INSERT INTO dbo.UpdateOutput(OldCaseID,NotifiedOld) SELECT * FROM( UPDATE dbo.CASEOUTPUT_TEST SET NOTIFIED = 'YES' OUTPUT deleted.OldCaseID,DELETED.NotifiedOld ) AS tbl) alias_output "
but I've got this error:
A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed inside another nested INSERT, UPDATE, DELETE, or MERGE statement.
I've literally tried everything I found on the internet but without luck. Do you have any suggestion on how I can fix this and run my update statement?
I think Spark is not designed for that UPDATE statement use case. That's not the scenario where Spark can help to deal with RDBMS. I suggest to use a direct connection using a JDBC from the code you are writing (I mean calling that JDBC directly). If you are using Scala you can use as suggested here (for example, but there are other multiple ways) or from Python as explained here. Those samples reach Oracle engine, but please change the driver/connector if you are using MySQL, SQL Server, Postgres or any other RDMBS
spark.read under the covers does a select * from the source jdbc table. if you pass a query, spark translates it to
select your query
from ( their query select *)
Sql complains because you are trying to do an update on a view "select * from"

Composite key in Cassandra with Pig

We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151

Resources