How to convert teradata recursive query to spark sql - apache-spark

I am trying to convert below Teradata SQL to Spark SQL but unable to. Can someone suggest a solution?
create multiset table test1 as
(
WITH RECURSIVE test1 (col1, col2, col3) AS
(
sel col11, col2, col3
from
test2 root
where
col3 = 1
UNION ALL
SELECT
indirect.col11,
indirect.col2 || ',' || direct.col2 as col2,
indirect.col3
FROM
test1 direct,
test2 indirect
WHERE
direct.col1 = indirect.col11
and direct.col3 + 1 = indirect.col3
)
sel col1 as col11,
col2
from
test1 QUALIFY ROW_NUMBER() OVER(PARTITION BY col1
ORDER BY
col3 DESC) = 1
)
with data primary index (col11) ;
Thanks.

I tried the approach myself as set out here http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/ some time ago.
I cannot find my simplified version, but this approach is the only way to do it currently. I assume that in future Spark SQL support will be added for this - although???
On a further note: I have seen myself the requirement to develop KPIs along this while loop approach. I would suggest that the recursive SQL as well as while loop for KPI-generation not be considered a use case for Spark, and, hence to be done in a fully ANSI-compliant database and sqooping of the result into Hadoop - if required.

Related

hive How to use conditional statements to execute different query based on result

I have query select col1, col2 from view1 and I wanted execute only when (select columnvalue from table1) > 0 else do nothing.
if (select columnvalue from table1)>0
select col1, col2 from view1"
else
do thing
How can I achieve this in single hive query?
If check query returns scalar value (single row) then you can cross join with check result and filter using > 0 condition:
with check_query as (
select count (*) cnt
from table1
)
select *
from view1 t
cross join check_query c
where c.cnt>0
;

Creating SQL Queries Dynamically in Scala and Apache Spark Dataframe

I am having two tables TableA and TableB with below structure
Table1
PkCol1
PkCol2
Col3
Col4
Col5
Table2
PkCol1
PkCol2
Col3
Col4
Col5
But i am getting the primary key information as input. For example, i receive it as PkCol2,PkCol2. I may receive more primaryKey columns as input too.
How do i dynamically add my where condition to spark sql ?
Below is my code
df.createOrReplaceTempView("Table1")
df2.createOrReplaceTempView("Table2")
primaryKeyString = ar(1)
val df3 = spark.sql("Select * from table1 where "+primaryKeyString+" not in (select "+primaryKeyString+" from table2)").toDF()
If there is a better way to do it with Dataframes let me know.
I am able to acheive my purpose by concatenating in spark sql as below
val df3 = spark.sql("Select * from table1 where CONCAT("+primaryKeyString+") not in (select CONCAT("+primaryKeyString+") from table2)").toDF()
Trying to find if there is a better way to achieve it in scala.

Error in Cassandra while exporting data

I have a column family (eventData) in cassandra keyspace. Which has the following definition :
CREATE TABLE eventData (
col1 text,
col2 text,
col3 text,
col4 int,
col5 int,
col6 int,
col7 timestamp,
PRIMARY KEY (col7, col1, col2)
);
I have scheduled on cron job which export the data from cassandra in a text file and it is defined like this :
dd=$(date --date yesterday "+%Y-%m-%d")
echo "select * FROM keyspacename.eventData where col7 = '$dd' ;" | /home/cassuser/Desktop/cassandra212/bin/cqlsh > /home/cassuser/Desktop/cassandra212/Dump/output-$dd.txt
So, above statement give the following error every day when cron job runs. But i run the same query from cqlsh manually it export the data without any error. Can anyone tell me the reason for this?
Error :
<stdin>:2:errors={}, last_host=localhost
I read many post on SO, most them say this error might be because of read_timeout. But my question is why can't i get the same error while running the same query from cqlsh manually.

ERROR CASSANDRA: 'ascii' codec can't decode bye 0xe1 in position 27: ordinal not in range(128) cqlsh

I'm new in Cassandra and I have a trouble inserting some rows in a database getting the error of the title.
I use cassandra 1.0.8 and cqlsh for doing changes in my database.
Next, I explain the given steps before I get the error:
CREATE A COLUMN FAMILY
CREATE TABLE test (
col1 int PRIMARY KEY,
col2 bigint,
col3 boolean,
col4 timestamp
);
INSERT SEVERAL ROWS WITHOUT SPECIFICYING ALL OF COLUMNS OF THE TABLE
insert into test (col1, col2, col3) values (1, 100, true);
insert into test (col1, col2, col3) values (2, 200, false);
SELECT FOR CHECKING THAT ROWS HAVE BEEN INSERTED CORRECTLY
select * from test;
The result is the following:
INSERT A ROW SPECIFICYING A VALUE FOR THE col4 (NOT SPECIFIED BEFORE)
insert into test (col1, col2, col3, col4) values (3, 100, true, '2011-02-03');
SELECT FOR CHECKING THAT ROW HAS BEEN INSERTED CORRECTLY
select * from test;
In this SELECT is the error. The result is the following:
SELECT EACH COLUMN OF THE TABLE SEPARATELY
select col1 from test;
select col2 from test;
select col3 from test;
select col4 from test;
it works fine and shows the right values:
Then, my question is: what's the problem in the first SELECT? what's wrong?
Thanks in advance!!
NOTE:
If I define col4 as Integer rather than a timestamp it works. However, I've tried to insert col4 as the normalized format yyyy-mm-dd HH:mm (I've tried with '2011-02-03 01:05' and '2011-02-03 01:05:10') but it doesn't work.
Cassandra 1.0.8 shipped with CQL2 and that's where your problem is coming from. I managed to recreate this in 1.0.8 but it works fine with 1.2.x so my advice is upgrade if you can.
In C* 1.2.10
cqlsh> update db.user set date='2011-02-03 01:05' where user='JCTYpjJlM';
cqlsh> SELECT * from db.user ;
user | date | password
-----------+--------------------------+----------
xvkYQKerQ | null | 765
JCTYpjJlM | 2011-02-03 01:05:00+0200 | 391
#mol
Weird, try to insert col4 as Integer (convert to milliseconds first) or use the normalized format : yyyy-mm-dd HH:mm
Accodring to the doc here, you can omit the time and just input the date but it seems that breaks something in your case

Cassandra: Query with where clause containing greather- or lesser-than (< and >)

I'm using Cassandra 1.1.2 I'm trying to convert a RDBMS application to Cassandra. In my RDBMS application I have following table called table1:
| Col1 | Col2 | Col3 | Col4 |
Col1: String (primary key)
Col2: String (primary key)
Col3: Bigint (index)
Col4: Bigint
This table counts over 200 million records. Mostly used query is something like:
Select * from table where col3 < 100 and col3 > 50;
In Cassandra I used following statement to create the table:
create table table1 (primary_key varchar, col1 varchar,
col2 varchar, col3 bigint, col4 bigint, primary key (primary_key));
create index on table1(col3);
I changed the primary key to an extra column (I calculate the key inside my application).
After importing a few records I tried to execute following cql:
select * from table1 where col3 < 100 and col3 > 50;
This result is:
Bad Request: No indexed columns present in by-columns clause with Equal operator
The Query select col1,col2,col3,col4 from table1 where col3 = 67 works
Google said there is no way to execute that kind of queries. Is that right? Any advice how to create such a query?
Cassandra indexes don't actually support sequential access; see http://www.datastax.com/docs/1.1/ddl/indexes for a good quick explanation of where they are useful. But don't despair; the more classical way of using Cassandra (and many other NoSQL systems) is to denormalize, denormalize, denormalize.
It may be a good idea in your case to use the classic bucket-range pattern, which lets you use the recommended RandomPartitioner and keep your rows well distributed around your cluster, while still allowing sequential access to your values. The idea in this case is that you would make a second dynamic columnfamily mapping (bucketed and ordered) col3 values back to the related primary_key values. As an example, if your col3 values range from 0 to 10^9 and are fairly evenly distributed, you might want to put them in 1000 buckets of range 10^6 each (the best level of granularity will depend on the sort of queries you need, the sort of data you have, query round-trip time, etc). Example schema for cql3:
CREATE TABLE indexotron (
rangestart int,
col3val int,
table1key varchar,
PRIMARY KEY (rangestart, col3val, table1key)
);
When inserting into table1, you should insert a corresponding row in indexotron, with rangestart = int(col3val / 1000000). Then when you need to enumerate all rows in table1 with col3 > X, you need to query up to 1000 buckets of indexotron, but all the col3vals within will be sorted. Example query to find all table1.primary_key values for which table1.col3 < 4021:
SELECT * FROM indexotron WHERE rangestart = 0 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 1000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 2000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 3000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 4000 AND col3val < 4021 ORDER BY col3val;
If col3 is always known small values/ranges, you may be able to get away with a simpler table that also maps back to the initial table, ex:
create table table2 (col3val int, table1key varchar,
primary key (col3val, table1key));
and use
insert into table2 (col3val, table1key) values (55, 'foreign_key');
insert into table2 (col3val, table1key) values (55, 'foreign_key3');
select * from table2 where col3val = 51;
select * from table2 where col3val = 52;
...
Or
select * from table2 where col3val in (51, 52, ...);
Maybe OK if you don't have too large of ranges. (you could get the same effect with your secondary index as well, but secondary indexes aren't highly recommended?). Could theoretically parallelize it "locally on the client side" as well.
It seems the "Cassandra way" is to have some key like "userid" and you use that as the first part of "all your queries" so you may need to rethink your data model, then you can have queries like select * from table1 where userid='X' and col3val > 3 and it can work (assuming a clustering key on col3val).

Resources