I have created a database and a table (table1) using an SQL syntax and execute them using spark.sql :
spark.sql("CREATE TABLE table1...");
I also loaded a csv file data into a dataframe using :
Dataset<Row> firstDF = spark.read().format("csv").load("C:/file.csv");
Now i use the following code to populate the existing table with the csv data :
firstDF.toDF().writeTo("table1").append();
But when i select all from the table1 :
Dataset<Row> firstDFRes = spark.sql("SELECT * FROM table1");
firstDFRes.show();
i get it empty (with only the schema of the table and no data)
My question is how to populate an existing SQL table with a dataframe ?
PS : using DataFrameReader's InsertInto or else SaveAsTable will create the table using the csv data and will ignore the schema of the SQL created table.
Thank you.
Related
I am trying to create a hive paritioned table from pyspark dataframe using spark sql. Below is the command I am executing, but getting an error. Error message below.
df.createOrReplaceTempView(df_view)
spark.sql("create table if not exists tablename PARTITION (date) AS select * from df_view")
Error: pyspark.sql.utils.ParseException:u"\nmismatched input 'PARTITION' expecting <EOF>
When I try to run without PARTITION (date) in the above line it works fine. However I am unable to create with partition.
How to create table with partition and insert date from.pyspark dataframe to hive.
To address this I created the table first
spark.sql("create table if not exists table_name (name STRING,age INT) partitioned by (date_column STRING)")
Then set dynamic partition to nonstrict using below.
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("insert into table table_name PARTITION (date_column) select *,'%s from df_view" % current_date))
Where current date is a variable with today's date.
I am learning Spark. I have a dataframe ts of below structure.
ts.show()
+--------------------+--------------------+
| UTC| PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+
I need to insert ts into Partitioned table in Hive with below structure,
spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY( bkup_dt DATE )
STORED AS ORC""")
How do i dynamically pass system run date in the insert statement so that it gets partitioned on bkup_dt in table based on date.
I tried something like this code. But it didn't work
ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)
How should I do it? Can someone please help!
Try by creating new column with current_date() and then write as partitioned by hive table.
Example:
df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)
UPDATE:
try by creating temp view then run insert statement.
df.createOrReplaceTempView("tmp")
sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")
Example Hive table:
id|year
1|1990
Added new data for the same table :
id|year
2|2010
but i need insertion time with new column like:
id|year|updateddate
1|1990|olddatatimestamp
2|2010|updateddatatimestamp
is this possible with hive ? and even interested to know how this will do in spark-scala as well(specially with DF/RDD).
Thank you
There is no auto-calculated columns in Hive, insert the timestamp explicitly using current_timestamp. And of course you need to add updateddate column:
insert into table tablename
select 2 as ID, 2010 as year, current_timestamp as updateddate;
When I create a table using SQL in Spark, for example:
sql('CREATE TABLE example SELECT a, b FROM c')
How can I pull that table into the python namespace (I can't think of a better term) so that I can update it? Let's say I want to replace NaN values in the table like so:
import pyspark.sql.functions as F
table = sql('SELECT * FROM example')
for column in columns:
table = table.withColumn(column,F.when(F.isnan(F.col(column)),F.col(column)).otherwise(None))
Does this operation update the original example table created with SQL? If I were to run sql('SELECT * FROM example')show() would I see the updated results? When the original CREATE TABLE example ... SQL runs, is example automatically added to the python namespace?
The sql function returns a new DataFrame, so the table is not modified. If you want to write a DataFrame's contents into a table created in spark, do it like this:
table.write.mode("append").saveAsTable("example")
But what you are doing is actually changing the schema of a table, in that case
table.createOrReplaceTempView("mytempTable")
sql("create table example2 as select * from mytempTable");
I can create a Hive table with this query
CREATE TABLE hbtable(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
And I used this query for inserting data into the table but it's not working,
insert overwrite table hbtable select * from hbtable s where s:hive fiels="value"
How can I insert values into a HBase table through Hive table?
Follow these steps,
Step 1 :
bin/hive --auxpath /hadoop/projects/hive-0.9.0/lib/hive-hbase-handler-0.9.0.jar,/hadoop/projects/hive-0.9.0/lib/hbase-0.92.0.jar,/hadoop/projects/hive-0.9.0/lib/zookeeper-3.3.4.jar,/hadoop/projects/hive-0.9.0/lib/guava-r09.jar -hiveconf hbase.master=localhost:60000
STep 2 :
hive> CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
Step 3 :
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM xyz WHERE key=1;
Note : I am running hive-0.9.0 and hbase-0.94.4 on a single Ubuntu box.