Creating SQL Queries Dynamically in Scala and Apache Spark Dataframe - apache-spark

I am having two tables TableA and TableB with below structure
Table1
PkCol1
PkCol2
Col3
Col4
Col5
Table2
PkCol1
PkCol2
Col3
Col4
Col5
But i am getting the primary key information as input. For example, i receive it as PkCol2,PkCol2. I may receive more primaryKey columns as input too.
How do i dynamically add my where condition to spark sql ?
Below is my code
df.createOrReplaceTempView("Table1")
df2.createOrReplaceTempView("Table2")
primaryKeyString = ar(1)
val df3 = spark.sql("Select * from table1 where "+primaryKeyString+" not in (select "+primaryKeyString+" from table2)").toDF()
If there is a better way to do it with Dataframes let me know.
I am able to acheive my purpose by concatenating in spark sql as below
val df3 = spark.sql("Select * from table1 where CONCAT("+primaryKeyString+") not in (select CONCAT("+primaryKeyString+") from table2)").toDF()
Trying to find if there is a better way to achieve it in scala.

Related

Is there any alternative to merge multiple rows into a single row without using groupBy() & collect_list() in spark?

I am trying to merge multiple rows into a single row after grouping data on a different column.
col1 col2
A 1
A 2
B 1
B 3
to
col1 col2
A 1,2
B 1,3
By using the below code:
df = spark.sql("select col1, col2, col3,...., colN from tablename where col3 = 'ABCD' limit 1000")
df.select('col1','col2').groupby('col1').agg(psf.concat_ws(', ', psf.collect_list(df.col2))).display()
This is working fine when there is less data.
But if I try to increase the number of rows to 1million, the code fails with the exception:
java.lang.Exception: Results too large
Is there any alternative to merge multiple rows into a single row in spark without using the combination of groupby() & collect_list()

how to delete the data from the Delta Table?

I was actually trying to delete the data from the Delta table.
When i run the below query, I'm getting data around 500 or 1000 records.
SELECT * FROM table1 inv
join (SELECT col1, col2, col2, min(Date) minDate, max(Date) maxDate FROM table2 a GROUP BY col1, col2, col3) aux
on aux.col1 = inv.col1 and aux.col2 = inv.col2 and aux.col3 = inv.col3
WHERE Date between aux.minDate and aux.maxDate
But when i try to delete that 500 records with the below query I'm getting error with syntax.
DELETE FROM table1 inv
join (SELECT col1, col2, col2, min(Date) minDate, max(Date) maxDate FROM table2 a GROUP BY col1, col2, col3) aux
on aux.col1 = inv.col1 and aux.col2 = inv.col2 and aux.col3 = inv.col3
WHERE Date between aux.minDate and aux.maxDate
Please someone help me here.
Thanks in advance :).
Here is the sql reference:
DELETE FROM table_identifier [AS alias] [WHERE predicate]
You can't use JOIN here, so expand your where clause according to your needs.
Here are some examples:
DELETE FROM table1
WHERE EXISTS (SELECT ... FROM table2 ...)
DELETE FROM table1
WHERE table1.col1 IN (SELECT ... FROM table2 WHERE ...)
DELETE FROM table1
WHERE table1.col1 NOT IN (SELECT ... FROM table2 WHERE ...)

Cassandra select query failure

We have a table:
CREATE TABLE table (
col1 text,
col2 text,
col3 timestamp,
cl4 int,
col5 timestamp,
PRIMARY KEY (col1, col2, col3, col4)
) WITH CLUSTERING ORDER BY (col2 DESC, col3 DESC,col4 DESC)
When I try querying from this table like:
select * from table where col1 = 'something' and col3 < 'something'
and col4= 12 limit 5 ALLOW FILTERING;
select * from table where col1 = 'something' and col4 < 23
and col3 >= 'something' ALLOW FILTERING;
I always get the error: Clustering column "col4" cannot be restricted (preceding column "col3" is restricted by a non-EQ relation) .
I tried to change the table creation by making col4, col3, col2, but the second query doesn't work and throw a similar error.
Any suggetion/advice to solve this problem.
We are on : Cassandra 3.0.17.7.
You can use non-equality condition only on the last column of partition of the query.
For example, you can do use col1 = val and col2 <= ..., or col1 = val and col2 = val2 and col3 <= ..., or col1 = val and col2 = val2 and col3 = val3 and col4 <= ..., but you can't do non-equality condition on several columns - that's how Cassandra reads data.

How to convert teradata recursive query to spark sql

I am trying to convert below Teradata SQL to Spark SQL but unable to. Can someone suggest a solution?
create multiset table test1 as
(
WITH RECURSIVE test1 (col1, col2, col3) AS
(
sel col11, col2, col3
from
test2 root
where
col3 = 1
UNION ALL
SELECT
indirect.col11,
indirect.col2 || ',' || direct.col2 as col2,
indirect.col3
FROM
test1 direct,
test2 indirect
WHERE
direct.col1 = indirect.col11
and direct.col3 + 1 = indirect.col3
)
sel col1 as col11,
col2
from
test1 QUALIFY ROW_NUMBER() OVER(PARTITION BY col1
ORDER BY
col3 DESC) = 1
)
with data primary index (col11) ;
Thanks.
I tried the approach myself as set out here http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/ some time ago.
I cannot find my simplified version, but this approach is the only way to do it currently. I assume that in future Spark SQL support will be added for this - although???
On a further note: I have seen myself the requirement to develop KPIs along this while loop approach. I would suggest that the recursive SQL as well as while loop for KPI-generation not be considered a use case for Spark, and, hence to be done in a fully ANSI-compliant database and sqooping of the result into Hadoop - if required.

Cassandra: Query with where clause containing greather- or lesser-than (< and >)

I'm using Cassandra 1.1.2 I'm trying to convert a RDBMS application to Cassandra. In my RDBMS application I have following table called table1:
| Col1 | Col2 | Col3 | Col4 |
Col1: String (primary key)
Col2: String (primary key)
Col3: Bigint (index)
Col4: Bigint
This table counts over 200 million records. Mostly used query is something like:
Select * from table where col3 < 100 and col3 > 50;
In Cassandra I used following statement to create the table:
create table table1 (primary_key varchar, col1 varchar,
col2 varchar, col3 bigint, col4 bigint, primary key (primary_key));
create index on table1(col3);
I changed the primary key to an extra column (I calculate the key inside my application).
After importing a few records I tried to execute following cql:
select * from table1 where col3 < 100 and col3 > 50;
This result is:
Bad Request: No indexed columns present in by-columns clause with Equal operator
The Query select col1,col2,col3,col4 from table1 where col3 = 67 works
Google said there is no way to execute that kind of queries. Is that right? Any advice how to create such a query?
Cassandra indexes don't actually support sequential access; see http://www.datastax.com/docs/1.1/ddl/indexes for a good quick explanation of where they are useful. But don't despair; the more classical way of using Cassandra (and many other NoSQL systems) is to denormalize, denormalize, denormalize.
It may be a good idea in your case to use the classic bucket-range pattern, which lets you use the recommended RandomPartitioner and keep your rows well distributed around your cluster, while still allowing sequential access to your values. The idea in this case is that you would make a second dynamic columnfamily mapping (bucketed and ordered) col3 values back to the related primary_key values. As an example, if your col3 values range from 0 to 10^9 and are fairly evenly distributed, you might want to put them in 1000 buckets of range 10^6 each (the best level of granularity will depend on the sort of queries you need, the sort of data you have, query round-trip time, etc). Example schema for cql3:
CREATE TABLE indexotron (
rangestart int,
col3val int,
table1key varchar,
PRIMARY KEY (rangestart, col3val, table1key)
);
When inserting into table1, you should insert a corresponding row in indexotron, with rangestart = int(col3val / 1000000). Then when you need to enumerate all rows in table1 with col3 > X, you need to query up to 1000 buckets of indexotron, but all the col3vals within will be sorted. Example query to find all table1.primary_key values for which table1.col3 < 4021:
SELECT * FROM indexotron WHERE rangestart = 0 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 1000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 2000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 3000 ORDER BY col3val;
SELECT * FROM indexotron WHERE rangestart = 4000 AND col3val < 4021 ORDER BY col3val;
If col3 is always known small values/ranges, you may be able to get away with a simpler table that also maps back to the initial table, ex:
create table table2 (col3val int, table1key varchar,
primary key (col3val, table1key));
and use
insert into table2 (col3val, table1key) values (55, 'foreign_key');
insert into table2 (col3val, table1key) values (55, 'foreign_key3');
select * from table2 where col3val = 51;
select * from table2 where col3val = 52;
...
Or
select * from table2 where col3val in (51, 52, ...);
Maybe OK if you don't have too large of ranges. (you could get the same effect with your secondary index as well, but secondary indexes aren't highly recommended?). Could theoretically parallelize it "locally on the client side" as well.
It seems the "Cassandra way" is to have some key like "userid" and you use that as the first part of "all your queries" so you may need to rethink your data model, then you can have queries like select * from table1 where userid='X' and col3val > 3 and it can work (assuming a clustering key on col3val).

Resources