how to run insert statement in spark sql to insert timestamp column? - apache-spark

I have created below table create using spark sql and inserted value using spark.sql
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
But I want to insert values into timestamp column using spark sql
create_table="create table tbl2 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
But below insert statement is not working and throws error
insert_query="insert into tbl2 partition(year='2022') values (101,to_timestamp('2019-06-13 13:22:30.521000000', 'yyyy-mm-dd hh24:mi:ss.ff'))"
spark.sql(insert_query)
how to insert timestamp value into table using spark sql. Please help

Try below:
create_table="create table tbl5 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
insert_query="insert into tbl5 partition(year='2022') values (101,cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp))"
spark.sql(insert_query)
spark.sql("select * from tbl5").show(100,False)
+----+-----------------------+----+
|tran|trandate |year|
+----+-----------------------+----+
|101 |2019-06-13 13:22:30.521|2022|
+----+-----------------------+----+

Related

Pyspark sql to create hive partitioned table

I am trying to create a hive paritioned table from pyspark dataframe using spark sql. Below is the command I am executing, but getting an error. Error message below.
df.createOrReplaceTempView(df_view)
spark.sql("create table if not exists tablename PARTITION (date) AS select * from df_view")
Error: pyspark.sql.utils.ParseException:u"\nmismatched input 'PARTITION' expecting <EOF>
When I try to run without PARTITION (date) in the above line it works fine. However I am unable to create with partition.
How to create table with partition and insert date from.pyspark dataframe to hive.
To address this I created the table first
spark.sql("create table if not exists table_name (name STRING,age INT) partitioned by (date_column STRING)")
Then set dynamic partition to nonstrict using below.
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("insert into table table_name PARTITION (date_column) select *,'%s from df_view" % current_date))
Where current date is a variable with today's date.

spark update a column with other dataframe column

how to write this code in spark Scala and spark + sql ?
update a
set
a.value='1',
a.name=b.old-name
from tbl1 a , tbl2 b
where a.f=b.f
Create a TEMP Table using below
df.createOrReplaceTempView("tbl1")
df.createOrReplaceTempView("tbl2")
use spark-sql to fire the any SQL query
df = spark.sql("update a set a.value='1', a.name=b.old-name from tbl1 a , tbl2 b where a.f=b.f")
Firstly you need create a new dataframe with updated data:
val joinDf=spark.sql("select a.<rest-fields>, '1' as 'value', b.old-name as 'name' from a inner join b on a.f=b.f")
If you are not using delta table you need to replace the table:
joinDf.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("a")

Write PySpark dataframe into Partitioned Hive table

I am learning Spark. I have a dataframe ts of below structure.
ts.show()
+--------------------+--------------------+
| UTC| PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+
I need to insert ts into Partitioned table in Hive with below structure,
spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY( bkup_dt DATE )
STORED AS ORC""")
How do i dynamically pass system run date in the insert statement so that it gets partitioned on bkup_dt in table based on date.
I tried something like this code. But it didn't work
ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)
How should I do it? Can someone please help!
Try by creating new column with current_date() and then write as partitioned by hive table.
Example:
df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)
UPDATE:
try by creating temp view then run insert statement.
df.createOrReplaceTempView("tmp")
sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")

ValidationFailureSemanticException: Partition spec contains non-partition columns

I am trying a simple use case of inserting into a hive partitioned table on S3. I am running my code on zeppelin notebook on EMR and below is my code along with the screenshot of the output of the commands. I checked the schema of hive table and dataframe and there is no case difference in column name. I am getting below mentioned exception.
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
System.setProperty("hive.metastore.uris","thrift://datalake-hive-server2.com:9083")
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
spark.sql("""CREATE EXTERNAL TABLE employee_table (Emp_Id STRING, First_Name STRING, Salary STRING) PARTITIONED BY (Month STRING) LOCATION 's3n://dev-emr-jupyter/anup/'
TBLPROPERTIES ("skip.header.line.count"="1") """)
val csv_df = spark.read
.format("csv")
.option("header", "true").load("s3n://dev-emr-jupyter/anup/test_data.csv")
import org.apache.spark.sql.SaveMode
csv_df.registerTempTable("csv")
spark.sql(""" INSERT OVERWRITE TABLE employee_table PARTITION(Month) select Emp_Id, First_Name, Salary, Month from csv""")
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {month=, Month=May} contains non-partition columns;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
You need to put a command before your insert statement, in order to be able to populate a partition at runtime. By default, the dynamic partition mode is set to strict.
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
Try by adding that line and run again.
Edit 1:
I saw in your attache image that when you do csv_df.show() you got your salary column to be the last, instead of month column. Try to reference your columns in the insert statement, like: insert into table_name partition(month) (column1, column2..)..
Florin

how can insert data in hbase through hive table?

I can create a Hive table with this query
CREATE TABLE hbtable(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
And I used this query for inserting data into the table but it's not working,
insert overwrite table hbtable select * from hbtable s where s:hive fiels="value"
How can I insert values into a HBase table through Hive table?
Follow these steps,
Step 1 :
bin/hive --auxpath /hadoop/projects/hive-0.9.0/lib/hive-hbase-handler-0.9.0.jar,/hadoop/projects/hive-0.9.0/lib/hbase-0.92.0.jar,/hadoop/projects/hive-0.9.0/lib/zookeeper-3.3.4.jar,/hadoop/projects/hive-0.9.0/lib/guava-r09.jar -hiveconf hbase.master=localhost:60000
STep 2 :
hive> CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
Step 3 :
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM xyz WHERE key=1;
Note : I am running hive-0.9.0 and hbase-0.94.4 on a single Ubuntu box.

Resources