I need to set a custom property in one of my Hive tables using pySpark.
Normally, I would just run this command in any Hive interface to do it:
ALTER TABLE table_name SET TBLPROPERTIES ('key1'='value1');
But the question is, can I accomplish the same within a pySpark script?
Thanks!
Well, that was actually easy... it can be set using sqlContext in pySpark:
sqlContext.sql("ALTER TABLE table_name SET TBLPROPERTIES('key1' = 'value1')")
It will return an empty DataFrame: DataFrame[]
But the property is actually present on the target table. It can be similarly retrieved using:
sqlContext.sql("SHOW TBLPROPERTIES table_name('key1')").collect()[0].asDict()
{'value': u'value1'}
Related
I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.
The data is from a Hive table, to be more precise
The first table has the properties
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
This Table should be transformed to have parquet and have the properties
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
The following Scala Spark code is executed:
val df = spark.sql("SELECT * FROM table")
df.write.format("parquet").mode("append").saveAsTable("table")
This results still in the unwanted the properties:
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Hopefully somebody can help me
You can not mix different file formats in the same table, nor can you change the file format of a table with data in it. (To be more precise, you can do these things, but neither Hive nor Spark will be able to read the data that is in a format that does not match the metadata.)
You should write the data to a new table, make sure that it matches your expectations, then rename or remove the old table and finally rename the new table to the old name. For example:
CREATE TABLE new_table STORED AS PARQUET AS SELECT * FROM orig_table;
ALTER TABLE orig_table RENAME TO orig_table_backup;
ALTER TABLE new_table RENAME TO orig_table;
You can execute these SQL statements in a Hive session directly or from Spark using spark.sql(...) statements (one by one).
I have a parquet table and the table contains a column with new line data. So when a hive query is fired on this table new line data will be considered as new records, i can over come this in hive by setting parameter "set hive.query.result.fileformat=SequenceFile;". Now i am migrating this parameter and MR query to be run in spark sql. Also i want to run some other queries like drop table statement before the actual query.
My code is like below
spark.sql(set hive.query.result.fileformat=SequenceFile;drop table output_table; create table output_table stored as orc as select * from source_table;)
With this query i am getting parser error at the semicolon (;) position. How can i properly execute above code in spark sql?
You shouldn't have a semicolon at the end of the code. Remove your semicolon ,add parentheses, and include your set parameter variable in the spark config command. Then it should work.
ex:
spark = (SparkSession
.builder
.appName('hive_validation_test')
.enableHiveSupport()
.config("hive.query.result.fileformat", "SequenceFile")
spark.sql('drop table output_table').
spark.sql('create table output_table stored as orc as select * from source_table').
I had a managed hive table and moved it to a different database using the following command:
alter table_name rename to new_db.table_name
The table was successfully moved and all the data is under the database now. The table is shown fine in HIVE. However when I try to read the table from Spark, it can read the schema but there is no content in there. That is, the count returns zero! What has happened? How can I fix this issue?
I loading it in spark using the following code:
val t = sqlContext.table("new_db.table_name")
Sometimes just altering the name isn't enough, I had to also alter the location.
sqlContext.sql("""
ALTER TABLE new_table
SET LOCATION "hdfs://.../new_location"
""")
And refresh the table in Spark for good measure
sqlContext.sql("""
REFRESH TABLE new_table
""")
You can double-check if the location is correct w/ describe formatted new_table
I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)
You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html