Spark can't read content of a table - apache-spark

I had a managed hive table and moved it to a different database using the following command:
alter table_name rename to new_db.table_name
The table was successfully moved and all the data is under the database now. The table is shown fine in HIVE. However when I try to read the table from Spark, it can read the schema but there is no content in there. That is, the count returns zero! What has happened? How can I fix this issue?
I loading it in spark using the following code:
val t = sqlContext.table("new_db.table_name")

Sometimes just altering the name isn't enough, I had to also alter the location.
sqlContext.sql("""
ALTER TABLE new_table
SET LOCATION "hdfs://.../new_location"
""")
And refresh the table in Spark for good measure
sqlContext.sql("""
REFRESH TABLE new_table
""")
You can double-check if the location is correct w/ describe formatted new_table

Related

Deleting data from Hive managed table (Partitioned and Bucketed)

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj
Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

Altered column in Hive, showing Null value in Spark SQL result

I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.

External hive table on top of parquet returns no data

I created a hive table on top of a parquet folder written via spark. In one test server it is running fine and giving out results (hive version 2.6.5.196) but in production it gives no records (hive 2.6.5.179). Could someone please point out what the exact issue could be?
If you created the table on top of an existing partition structure, you have to make it known to the table that there are partitions at this location.
MSCK REPAIR TABLE table_name; -- adds missing partitions
SELECT * FROM table_name; -- should return records now
This problem shouldn't happen if there are only files in that location, and if they are the expected format.
You can verify with:
SHOW CREATE TABLE table_name; -- to see the expected format
created hive table on top of a parquet folder written via spark.
Check for the databases that you are using is available or not using
show databases;
check the ddl of the table that you have created on your test server and the other that is there on production
show create table table_name;
Make sure that both the ddl exactly matches.
Do msck repair table table_name to load the incremental data or the data from all the partitions
select * from table_name to view records

spark.sql (hive) schema doesn't match Cassandra schema

So I'm trying to do a simple select statement in spark.sql, however it comes up with an error even though the column clearly exists in the Cassandra table:
// Spark ------------------------------------
spark.sql("SELECT value2 FROM myschema.mytable").show()
>> org.apache.spark.sql.AnalysisException: cannot resolve '`value2`'
given input columns: [key, value1]
// Cassandra --------------------------------
DESCRIBE myschema.mytable;
>> CREATE TABLE mytable.myschema (
>> key int,
>> value1 text,
>> value2 text,
>> PRIMARY KEY (key)
>> ) WITH ...;
I assume hive just isn't synced properly but running a table refresh command does NOT work. spark.sql("REFRESH TABLE myschema.mytable")
See https://spark.apache.org/docs/2.1.2/sql-programming-guide.html#metadata-refreshing
The only way I could get it to properly refresh was to:
Move all data out of the table
Drop the table
Delete the hive metadata row
DELETE FROM "HiveMetaStore".sparkmetastore WHERE key='_2_myschema' AND entity='org.apache.hadoop.hive.metastore.api.Table::mytable';
Recreate table
Copy all data back
Surely there is a better way?
This is still a problem within my spark environment, however I found that just truncating or removing specific records in the "HiveMetaStore".sparkmetastore table seems to refresh properly after around 5 minutes.
This works even without restarting a spark session.

Hive PartitionFilter are not Applied

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else
Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

Resources