Pure SQL way to save query output in Spark-sql - apache-spark

Is there any way to save query result to file using pure SQL form?
I understand that we can do it easily using java or scala api. But I am looking for pure SQL solution which can be executed via Spark-sql CLI directly.

Create table is the closest thing:
CREATE TABLE table
USING format -- some format
LOCATION '/path/to/location'
AS SELECT * FROM some_view -- some query
It is not fully equivalent to simple save methods, as it uses metastore.

Related

Does Databricks offer something like Oracle's dblink?

I am aware, I can load anything into a DataFrame using JDBC, that works well from Oracle sources. Is there an equivalent in Spark SQL, so I can combine datasets as well?
Basically something like so - you get the idea...
select
lt.field1,
rt.field2
from localTable lt
join remoteTable#serverLink rt
on rt.id = lt.id
Thanks
dblink does not exist. You can create two table statements with JDBC sources and then join the two tables. It will be a little more to write, but you'll get the correct table.
In python, you can maybe do it easier with something like:
<!— begin snippet: js hide: false console: true babel: false -->
spark.read.jdbc(config1).join(spark.read.jdbc(config2), "key", "type")
There is an upcoming Query Federation functionality that allows to access tables in other databases by registering them in Databricks SQL.

Databricks SQL nondeterministic expressions using DELETE FROM

I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.
I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);

Can we run complex multi line SQL queries using Blueprism?

I am new to SQL stuff in blueprism, I am able to configure SQL object and execute simple queries, but I am facing trouble while trying to run multiline complex SQL queries.
when I was trying to execute the below query in blueprism, getting some error message, saying "Incorrect Syntax near Database2"
"select top 10 * from [Database1].[dbo].[Table1]
join [Database1].[dbo].[Table2] on [Database1].[dbo].[Table2].Fieldname1=[Database1].[dbo].[Table1].Fieldname2
join [Database2].[dbo].[Table1] on [Database2].[dbo].[Table1].Fieldname1=[Database1].[dbo].[Table2].Fieldname2"
Can somebody please help me, what was the wrong in the above query...
I found the answer myself, there should not be any additional white space characters in the query, entire query should be in continuous line. The beauty of blueprism is, it can execute any level of complex queries without any constraints, but need to modify the syntax accordingly. always we should mention the filename and table names in the following format - [databasename].[dbo].[tablename].[fieldname]

Parameterized SQL in Presto on Presto CLI

Is there any option to provide parameters on Presto CLI.
I am trying to change my impala-shell command to Presto where my HQL file gets parameter from command line of impala like below.
impala-shell -f ot_full.hql --var=date_next=${date_next_para} --var=yrmth=${yrmth_para} --var=yrmth_L12=${yrmth_L12_para} --var=pyrmth=${pyrmth_para}
WITH clause in presto is not that much helpful.
How can we convert it to Presto command line.
I did not find any documentation/example on this in https://prestodb.io/docs/current/
The Presto .CLI doesn't support this, so you'll need to substitute the variables in the SQL query before passing it to the CLI. One way is to do this directly in shell:
presto --execute "SELECT * FROM table WHERE ds >= '${date_next_para}'"
For longer queries, using a here document is a good option.

Spark SQL version of EXEC()

Does anyone know of a way in Spark SQL to execute a string variable like the following?
INSERT TableA (Col1,Col2) SELECT Col1,Col2 FROM TableB
I understand that I can obviously write this statement directly. However, I am using a work flow engine where my Insert/Select statement is in String variable. If not, I assume I should use spark_submit. I was looking for other options.
I'm not sure what environment you're in. If this is a Spark application or a Spark shell you always provide queries as strings:
val query = "INSERT TableA (Col1,Col2) SELECT Col1,Col2 FROM TableB"
sqlContext.sql(query)
(See http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically.)
Spark Sql also support hive queries
insert overwrite table usautomobiles select * from sourcedata
Go Through this link

Resources