spark update a column with other dataframe column

spark update a column with other dataframe column - apache-spark

how to write this code in spark Scala and spark + sql ?
update a
set
a.value='1',
a.name=b.old-name
from tbl1 a , tbl2 b
where a.f=b.f

Create a TEMP Table using below
df.createOrReplaceTempView("tbl1")
df.createOrReplaceTempView("tbl2")
use spark-sql to fire the any SQL query
df = spark.sql("update a set a.value='1', a.name=b.old-name from tbl1 a , tbl2 b where a.f=b.f")

Firstly you need create a new dataframe with updated data:
val joinDf=spark.sql("select a.<rest-fields>, '1' as 'value', b.old-name as 'name' from a inner join b on a.f=b.f")
If you are not using delta table you need to replace the table:
joinDf.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("a")

Related

how to run insert statement in spark sql to insert timestamp column?

I have created below table create using spark sql and inserted value using spark.sql
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
But I want to insert values into timestamp column using spark sql
create_table="create table tbl2 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
But below insert statement is not working and throws error
insert_query="insert into tbl2 partition(year='2022') values (101,to_timestamp('2019-06-13 13:22:30.521000000', 'yyyy-mm-dd hh24:mi:ss.ff'))"
spark.sql(insert_query)
how to insert timestamp value into table using spark sql. Please help

Try below:
create_table="create table tbl5 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
insert_query="insert into tbl5 partition(year='2022') values (101,cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp))"
spark.sql(insert_query)
spark.sql("select * from tbl5").show(100,False)
+----+-----------------------+----+
|tran|trandate |year|
+----+-----------------------+----+
|101 |2019-06-13 13:22:30.521|2022|
+----+-----------------------+----+

Pyspark sql to create hive partitioned table

I am trying to create a hive paritioned table from pyspark dataframe using spark sql. Below is the command I am executing, but getting an error. Error message below.
df.createOrReplaceTempView(df_view)
spark.sql("create table if not exists tablename PARTITION (date) AS select * from df_view")
Error: pyspark.sql.utils.ParseException:u"\nmismatched input 'PARTITION' expecting <EOF>
When I try to run without PARTITION (date) in the above line it works fine. However I am unable to create with partition.
How to create table with partition and insert date from.pyspark dataframe to hive.

To address this I created the table first
spark.sql("create table if not exists table_name (name STRING,age INT) partitioned by (date_column STRING)")
Then set dynamic partition to nonstrict using below.
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("insert into table table_name PARTITION (date_column) select *,'%s from df_view" % current_date))
Where current date is a variable with today's date.

ValidationFailureSemanticException: Partition spec contains non-partition columns

I am trying a simple use case of inserting into a hive partitioned table on S3. I am running my code on zeppelin notebook on EMR and below is my code along with the screenshot of the output of the commands. I checked the schema of hive table and dataframe and there is no case difference in column name. I am getting below mentioned exception.
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
System.setProperty("hive.metastore.uris","thrift://datalake-hive-server2.com:9083")
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
spark.sql("""CREATE EXTERNAL TABLE employee_table (Emp_Id STRING, First_Name STRING, Salary STRING) PARTITIONED BY (Month STRING) LOCATION 's3n://dev-emr-jupyter/anup/'
TBLPROPERTIES ("skip.header.line.count"="1") """)
val csv_df = spark.read
.format("csv")
.option("header", "true").load("s3n://dev-emr-jupyter/anup/test_data.csv")
import org.apache.spark.sql.SaveMode
csv_df.registerTempTable("csv")
spark.sql(""" INSERT OVERWRITE TABLE employee_table PARTITION(Month) select Emp_Id, First_Name, Salary, Month from csv""")
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {month=, Month=May} contains non-partition columns;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

You need to put a command before your insert statement, in order to be able to populate a partition at runtime. By default, the dynamic partition mode is set to strict.
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
Try by adding that line and run again.
Edit 1:
I saw in your attache image that when you do csv_df.show() you got your salary column to be the last, instead of month column. Try to reference your columns in the insert statement, like: insert into table_name partition(month) (column1, column2..)..
Florin

Output Hive table is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive

I have a Apache Spark(v2.4.2) dataframe, I want to insert this dataframe into a hive table.
df = spark.sparkContext.parallelize([["c1",21, 3], ["c1",32,4], ["c2",4,40089], ["c2",439,6889]]).toDF(["c", "n", "v"])
df.createOrReplaceTempView("df")
And I created a hive table:
spark.sql("create table if not exists sample_bucket(n INT, v INT)
partitioned by (c STRING) CLUSTERED BY(n) INTO 3 BUCKETS")
And then I tried to insert data from dataframe df into sample_bucket table:
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df")
Which gives me an error, saying:
Output Hive table `default`.`sample_bucket` is bucketed but Spark currently
does NOT populate bucketed output which is compatible with Hive.;
I tried couple of ways which didn't work, on of them is:
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.enforce.bucketing=true")
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df cluster by n")
But no luck, can anyone help me!

Spark (current last 2.4.5) does not fully support Hive bucketed tables.
You can read bucketed tables (without any bucket effect) and even insert into it (in this case buckets will be ignoted and futher Hive reads can have unpredicted behaviour).
See https://issues.apache.org/jira/browse/SPARK-19256

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.

What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark update a column with other dataframe column - apache-spark

how to write this code in spark Scala and spark + sql ? update a set a.value='1', a.name=b.old-name from tbl1 a , tbl2 b where a.f=b.f

Create a TEMP Table using below df.createOrReplaceTempView("tbl1") df.createOrReplaceTempView("tbl2") use spark-sql to fire the any SQL query df = spark.sql("update a set a.value='1', a.name=b.old-name from tbl1 a , tbl2 b where a.f=b.f")

Related

how to run insert statement in spark sql to insert timestamp column?

Pyspark sql to create hive partitioned table

ValidationFailureSemanticException: Partition spec contains non-partition columns

Output Hive table is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

Categories

Resources