Databricks DELTA CTAS with LOCATION using %sql - databricks

DELTA does not have CREATE TABLE LIKE. It does have CTAS.
I want to copy the definition of a table only, but also specify the LOCATION.
E.g. this does not work:
CREATE TABLE IF NOT EXISTS NEW_CUSTOMER_FEED
AS SELECT * from NEW_CUSTOMER_FEED WHERE 1 = 0
LOCATION '/atRest/data'
What am I missing?

if you check syntax for CREATE TABLE USING, then you will see that AS SELECT ... should be the last clause, after all options. So in your case it should be
CREATE TABLE IF NOT EXISTS NEW_CUSTOMER_FEED
LOCATION '/atRest/data'
AS SELECT * from NEW_CUSTOMER_FEED WHERE 1 = 0
P.S. Instead of WHERE 1=0, you can just do LIMIT 0...

Related

How to optimize a delete on table which doesn't have any primary key but has a column which has TimeStamp?

My process is doing a insert into to a backup table 'B from a table 'A' which gets updated daily[truncate and load] in the azure sql db.
A column 'TSP' [eg value =2022-12-19T22:06:01.950994] is present in both tables. TSP for all rows inserted in a day is same.
Later in the day, I'm supposed to delete older data.
Currently using 'delete from 'B' where TSP<'today-1day' logic
Is there a way to optimize this delete using index or something?
SSMS suggested to create non clustered index on the table.TSP column.
I tested it but seems there isn't much difference.
If this was the data:
50mil TSP1
50mil TSP2
50mil TSP3
My expectation was it would skip scanning TSP2,TSP3 rows and delete TSP1.
Whereas if table doesn't have index it would need to scan all 150mil rows.
The batched delete operation utilizes a view to simplify the execution plan, and that can be achieved using Fast Ordered Delete Operation. This is achieved by refreshing the table once which in turn reduces the amount of I/O required.
Below are the sample queries: -
CREATE TABLE tableA
(
id int,
TSP Datetime DEFAULT GETDATE(),
[Log] NVARCHAR(250)
)
WHILE #I <=1000 BEGIN INSERT INTO tableA VALUES(#I, GETDATE()-1, concat('Log message ', #I) ) SET #I=#I+1 END
Option 1:- using CTE
;WITH DeleteData
AS
(SELECT id, TSP, Log FROM tableA
WHERE CAST(tsp AS DATE) = CAST(GETDATE() AS DATE))
DELETE FROM DeleteData
Option 2:- using SQL views
CREATE VIEW VW_tableA AS (SELECT * FROM tableA WHERE CAST(tsp AS DATE) = CAST(GETDATE()-1 AS DATE))
delete from VW_tableA
Reference 1: An article by John Sansom on fast-sql-server-delete.
Reference 2: Similar SO thread.

Pyspark refer to table created using sql

When I create a table using SQL in Spark, for example:
sql('CREATE TABLE example SELECT a, b FROM c')
How can I pull that table into the python namespace (I can't think of a better term) so that I can update it? Let's say I want to replace NaN values in the table like so:
import pyspark.sql.functions as F
table = sql('SELECT * FROM example')
for column in columns:
table = table.withColumn(column,F.when(F.isnan(F.col(column)),F.col(column)).otherwise(None))
Does this operation update the original example table created with SQL? If I were to run sql('SELECT * FROM example')show() would I see the updated results? When the original CREATE TABLE example ... SQL runs, is example automatically added to the python namespace?
The sql function returns a new DataFrame, so the table is not modified. If you want to write a DataFrame's contents into a table created in spark, do it like this:
table.write.mode("append").saveAsTable("example")
But what you are doing is actually changing the schema of a table, in that case
table.createOrReplaceTempView("mytempTable")
sql("create table example2 as select * from mytempTable");

Create a "temporary" table in spark-SQL, not spark-scala?

I have loaded a table from an input file.
CREATE TABLE MyTable (
ID INT,
VALUE FLOAT,
RATE INT
...
LOAD DATA LOCAL INPATH 'MYPATH' INTO TABLE MyTable;
Now I'd like to create a new based on this one
DerivedTable =
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL;
Then I'm going to use this table as a source for other tables and for outputs.
What is a correct Sql (or Hive) way to create this "temporary" table? This should work in spark-sql?
PS: I know how to do that in spark-shell. But that is not what I'm looking for.
You can:
CREATE TEMPORARY VIEW DerivedTable AS (
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL);

WHERE variable = ( subquery ) in OpenSQL

I'm trying to retrieve rows from a table where a subquery matches an variable. However, it seems as if the WHERE clause only lets me compare fields of the selected tables against a constant, variable or subquery.
I would expect to write something like this:
DATA(lv_expected_lines) = 5.
SELECT partner contract_account
INTO TABLE lt_bp_ca
FROM table1 AS tab1
WHERE lv_expected_lines = (
SELECT COUNT(*)
FROM table2
WHERE partner = tab1~partner
AND contract_account = tab1~contract_account ).
But obviously this select treats my local variable as a field name and it gives me the error "Unknown column name "lv_expected_lines" until runtime, you cannot specify a field list."
But in standard SQL this is perfectly possible:
SELECT PARTNER, CONTRACT_ACCOUNT
FROM TABLE1 AS TAB1
WHERE 5 = (
SELECT COUNT(*)
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT );
So how can I replicate this logic in RSQL / Open SQL?
If there's no way I'll probably just write native SQL and be done with it.
The program below might lead you to an Open SQL solution. It uses the SAP demo tables to determines the plane types that are used on a specific number of flights.
REPORT zgertest_sub_query.
DATA: lt_planetypes TYPE STANDARD TABLE OF s_planetpp.
PARAMETERS: p_numf TYPE i DEFAULT 62.
START-OF-SELECTION.
SELECT planetype
INTO TABLE lt_planetypes
FROM sflight
GROUP BY planetype
HAVING COUNT( * ) EQ p_numf.
LOOP AT lt_planetypes INTO DATA(planetype).
WRITE: / planetype.
ENDLOOP.
It only works if you don't need to read fields from TAB1. If you do you will have to gather these with other selects while looping at your results.
For those dudes who found this question in 2020 I report that this construction is supported since ABAP 7.50. No workarounds are needed:
SELECT kunnr, vkorg
FROM vbak AS v
WHERE 5 = ( SELECT COUNT(*)
FROM vbap
WHERE kunnr = v~kunnr
AND vkorg = v~vkorg )
INTO TABLE #DATA(customers).
This select all customers who made 5 sales orders within some sales organization.
In ABAP there is no way to do the query as in NATIVE SQL.
I would advice not to use NATIVE SQL, instead give a try to SELECT/ENDSELECT statement.
DATA: ls_table1 type table1,
lt_table1 type table of table1,
lv_count type i.
SELECT PARTNER, CONTRACT_ACCOUNT
INTO ls_table1
FROM TABLE1.
SELECT COUNT(*)
INTO lv_count
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT.
CHECK lv_count EQ 5.
APPEND ls_table1 TO lt_table1.
ENDSELECT
Here you append to ls_table1 only those rows where count is equals to 5 in selection of table2.
Hope it helps.

cassandra 2.0.9: query for undefined column

Using Cassandra 2.0.9 CQL, how does one query for rows that don't have a particular column defined? For example:
create table testtable ( id int primary key, thing int );
create index on testtable ( thing );
# can now select rows by thing
insert into testtable( id, thing ) values ( 100, 100 );
# row values will be persistent
update testtable using TTL 30 set thing=1 where id=100;
# wait 30 seconds, thing column will go away for the row
select * from testtable;
Ideally I'd like to be able to do something like this:
select * from testtable where NOT DEFINED thing;
or some such and have the row with the id==1 returned. Is there any way to search for rows that do not have a particular column value assigned?
I'm afraid I've been through the Datastax 2.0 manual, as well as the CQLSH help with no luck trying to find an operator or syntax for this. Thanks.
Doesn't appear to be available yet
https://issues.apache.org/jira/browse/CASSANDRA-3783

Resources