I am trying a simple use case of inserting into a hive partitioned table on S3. I am running my code on zeppelin notebook on EMR and below is my code along with the screenshot of the output of the commands. I checked the schema of hive table and dataframe and there is no case difference in column name. I am getting below mentioned exception.
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
System.setProperty("hive.metastore.uris","thrift://datalake-hive-server2.com:9083")
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
spark.sql("""CREATE EXTERNAL TABLE employee_table (Emp_Id STRING, First_Name STRING, Salary STRING) PARTITIONED BY (Month STRING) LOCATION 's3n://dev-emr-jupyter/anup/'
TBLPROPERTIES ("skip.header.line.count"="1") """)
val csv_df = spark.read
.format("csv")
.option("header", "true").load("s3n://dev-emr-jupyter/anup/test_data.csv")
import org.apache.spark.sql.SaveMode
csv_df.registerTempTable("csv")
spark.sql(""" INSERT OVERWRITE TABLE employee_table PARTITION(Month) select Emp_Id, First_Name, Salary, Month from csv""")
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {month=, Month=May} contains non-partition columns;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
You need to put a command before your insert statement, in order to be able to populate a partition at runtime. By default, the dynamic partition mode is set to strict.
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
Try by adding that line and run again.
Edit 1:
I saw in your attache image that when you do csv_df.show() you got your salary column to be the last, instead of month column. Try to reference your columns in the insert statement, like: insert into table_name partition(month) (column1, column2..)..
Florin
Related
I am trying to create a hive paritioned table from pyspark dataframe using spark sql. Below is the command I am executing, but getting an error. Error message below.
df.createOrReplaceTempView(df_view)
spark.sql("create table if not exists tablename PARTITION (date) AS select * from df_view")
Error: pyspark.sql.utils.ParseException:u"\nmismatched input 'PARTITION' expecting <EOF>
When I try to run without PARTITION (date) in the above line it works fine. However I am unable to create with partition.
How to create table with partition and insert date from.pyspark dataframe to hive.
To address this I created the table first
spark.sql("create table if not exists table_name (name STRING,age INT) partitioned by (date_column STRING)")
Then set dynamic partition to nonstrict using below.
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("insert into table table_name PARTITION (date_column) select *,'%s from df_view" % current_date))
Where current date is a variable with today's date.
I want to covert pandas df to spark df and save it to hive.
#create spark df from panda dataframe
df = self.ss.createDataFrame(dataframe)
df.createOrReplaceTempView("table_Template")
self.ss.sql("create table IF NOT EXISTS database."+ table_name +" STORED AS PARQUET as select * from table_Template")
ERROR:
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support date. See HIVE-6384;
Try below code, to cast all date type columns to string.
df.select(map(lambda field: F.col(field.name).cast("string") if field.dataType.typeName == "date" else F.col(field.name), df.schema)).show()
I am learning Spark. I have a dataframe ts of below structure.
ts.show()
+--------------------+--------------------+
| UTC| PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+
I need to insert ts into Partitioned table in Hive with below structure,
spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY( bkup_dt DATE )
STORED AS ORC""")
How do i dynamically pass system run date in the insert statement so that it gets partitioned on bkup_dt in table based on date.
I tried something like this code. But it didn't work
ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)
How should I do it? Can someone please help!
Try by creating new column with current_date() and then write as partitioned by hive table.
Example:
df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)
UPDATE:
try by creating temp view then run insert statement.
df.createOrReplaceTempView("tmp")
sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")
I have a Apache Spark(v2.4.2) dataframe, I want to insert this dataframe into a hive table.
df = spark.sparkContext.parallelize([["c1",21, 3], ["c1",32,4], ["c2",4,40089], ["c2",439,6889]]).toDF(["c", "n", "v"])
df.createOrReplaceTempView("df")
And I created a hive table:
spark.sql("create table if not exists sample_bucket(n INT, v INT)
partitioned by (c STRING) CLUSTERED BY(n) INTO 3 BUCKETS")
And then I tried to insert data from dataframe df into sample_bucket table:
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df")
Which gives me an error, saying:
Output Hive table `default`.`sample_bucket` is bucketed but Spark currently
does NOT populate bucketed output which is compatible with Hive.;
I tried couple of ways which didn't work, on of them is:
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.enforce.bucketing=true")
spark.sql("INSERT OVERWRITE table SAMPLE_BUCKET PARTITION(c) select n, v, c from df cluster by n")
But no luck, can anyone help me!
Spark (current last 2.4.5) does not fully support Hive bucketed tables.
You can read bucketed tables (without any bucket effect) and even insert into it (in this case buckets will be ignoted and futher Hive reads can have unpredicted behaviour).
See https://issues.apache.org/jira/browse/SPARK-19256
I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',