I can create a Hive table with this query
CREATE TABLE hbtable(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
And I used this query for inserting data into the table but it's not working,
insert overwrite table hbtable select * from hbtable s where s:hive fiels="value"
How can I insert values into a HBase table through Hive table?
Follow these steps,
Step 1 :
bin/hive --auxpath /hadoop/projects/hive-0.9.0/lib/hive-hbase-handler-0.9.0.jar,/hadoop/projects/hive-0.9.0/lib/hbase-0.92.0.jar,/hadoop/projects/hive-0.9.0/lib/zookeeper-3.3.4.jar,/hadoop/projects/hive-0.9.0/lib/guava-r09.jar -hiveconf hbase.master=localhost:60000
STep 2 :
hive> CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
Step 3 :
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM xyz WHERE key=1;
Note : I am running hive-0.9.0 and hbase-0.94.4 on a single Ubuntu box.
Related
I have created below table create using spark sql and inserted value using spark.sql
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
But I want to insert values into timestamp column using spark sql
create_table="create table tbl2 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
But below insert statement is not working and throws error
insert_query="insert into tbl2 partition(year='2022') values (101,to_timestamp('2019-06-13 13:22:30.521000000', 'yyyy-mm-dd hh24:mi:ss.ff'))"
spark.sql(insert_query)
how to insert timestamp value into table using spark sql. Please help
Try below:
create_table="create table tbl5 (tran int,trandate timestamp) partitioned by (year string)"
spark.sql(create_table)
insert_query="insert into tbl5 partition(year='2022') values (101,cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp))"
spark.sql(insert_query)
spark.sql("select * from tbl5").show(100,False)
+----+-----------------------+----+
|tran|trandate |year|
+----+-----------------------+----+
|101 |2019-06-13 13:22:30.521|2022|
+----+-----------------------+----+
I am trying to create a hive paritioned table from pyspark dataframe using spark sql. Below is the command I am executing, but getting an error. Error message below.
df.createOrReplaceTempView(df_view)
spark.sql("create table if not exists tablename PARTITION (date) AS select * from df_view")
Error: pyspark.sql.utils.ParseException:u"\nmismatched input 'PARTITION' expecting <EOF>
When I try to run without PARTITION (date) in the above line it works fine. However I am unable to create with partition.
How to create table with partition and insert date from.pyspark dataframe to hive.
To address this I created the table first
spark.sql("create table if not exists table_name (name STRING,age INT) partitioned by (date_column STRING)")
Then set dynamic partition to nonstrict using below.
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("insert into table table_name PARTITION (date_column) select *,'%s from df_view" % current_date))
Where current date is a variable with today's date.
I have created a database and a table (table1) using an SQL syntax and execute them using spark.sql :
spark.sql("CREATE TABLE table1...");
I also loaded a csv file data into a dataframe using :
Dataset<Row> firstDF = spark.read().format("csv").load("C:/file.csv");
Now i use the following code to populate the existing table with the csv data :
firstDF.toDF().writeTo("table1").append();
But when i select all from the table1 :
Dataset<Row> firstDFRes = spark.sql("SELECT * FROM table1");
firstDFRes.show();
i get it empty (with only the schema of the table and no data)
My question is how to populate an existing SQL table with a dataframe ?
PS : using DataFrameReader's InsertInto or else SaveAsTable will create the table using the csv data and will ignore the schema of the SQL created table.
Thank you.
I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',
I have loaded a table from an input file.
CREATE TABLE MyTable (
ID INT,
VALUE FLOAT,
RATE INT
...
LOAD DATA LOCAL INPATH 'MYPATH' INTO TABLE MyTable;
Now I'd like to create a new based on this one
DerivedTable =
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL;
Then I'm going to use this table as a source for other tables and for outputs.
What is a correct Sql (or Hive) way to create this "temporary" table? This should work in spark-sql?
PS: I know how to do that in spark-shell. But that is not what I'm looking for.
You can:
CREATE TEMPORARY VIEW DerivedTable AS (
SELECT ID, VALUE*RATE AS Total
FROM MyTable
WHERE VALUE IS NOT NULL);