Altered column in Hive, showing Null value in Spark SQL result - apache-spark

I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.

Related

Spark SQL query issue - SQL with Subquery doesn't seem to retrieve records

I have a Spark SQL query like:
Select * from xTable a Where Exist (filter subquery) AND (a.date IN (Select max(b.date) from xTable b))
Under certain circumstances (when a filter table is not provided) , my filter subquery should simply do a Select 1.
Whenever I run this in Impala it returns records, in Hive it complains that only 1 subquery expression is allowed. However, when I run it as a Spark SQL in Spark 2.4, it returns an empty dataframe. Any idea why? What am I doing wrong?
Ok, I think I found the reason. It is not related to the query. It seems like an issue while trying to create a table using a csv file in Hive.
When you select the source - path to the csv file in HDFS
, and then under the format - check the 'Has Header' check box.
It seems to create the table ok.
Then, When I execute the following in Hive or Impala :
Select max(date) from xTable
I get the max date back (Where the date column is a String)
However, when I try and run the same via Spark SQL:
I get the result as date (The same name as the column header).
If I remove the header from CSV file and import it, and them manually create the headers and types, then I am not facing this issue.
Seems like some form of bug or may be a user error from my end.

Spark SQL does not pick up changes in Hive table schema

I have a Hive table that I wrote in Hive using Spark (saveAs(tableName)). Now I want to add a column to this table. My first approach was to add this column via Hive but apparently Spark does not pick up the new schema even though the column was added in the table. When I check table details in Hue, spark.sql.sources.schema.part.0 is not updated. So I thought to add a column via Spark job that executes this query on the table. Same result, and the column is not even added in the table. Is there a way around this issue? I thought to change table name, create a new one with the right schema and then insert select * ... into new table but that didn't work as the tables are partitioned.

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

Hive PartitionFilter are not Applied

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else
Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

Spark can't read content of a table

I had a managed hive table and moved it to a different database using the following command:
alter table_name rename to new_db.table_name
The table was successfully moved and all the data is under the database now. The table is shown fine in HIVE. However when I try to read the table from Spark, it can read the schema but there is no content in there. That is, the count returns zero! What has happened? How can I fix this issue?
I loading it in spark using the following code:
val t = sqlContext.table("new_db.table_name")
Sometimes just altering the name isn't enough, I had to also alter the location.
sqlContext.sql("""
ALTER TABLE new_table
SET LOCATION "hdfs://.../new_location"
""")
And refresh the table in Spark for good measure
sqlContext.sql("""
REFRESH TABLE new_table
""")
You can double-check if the location is correct w/ describe formatted new_table

Resources