spark.sql (hive) schema doesn't match Cassandra schema - apache-spark

So I'm trying to do a simple select statement in spark.sql, however it comes up with an error even though the column clearly exists in the Cassandra table:
// Spark ------------------------------------
spark.sql("SELECT value2 FROM myschema.mytable").show()
>> org.apache.spark.sql.AnalysisException: cannot resolve '`value2`'
given input columns: [key, value1]
// Cassandra --------------------------------
DESCRIBE myschema.mytable;
>> CREATE TABLE mytable.myschema (
>> key int,
>> value1 text,
>> value2 text,
>> PRIMARY KEY (key)
>> ) WITH ...;
I assume hive just isn't synced properly but running a table refresh command does NOT work. spark.sql("REFRESH TABLE myschema.mytable")
See https://spark.apache.org/docs/2.1.2/sql-programming-guide.html#metadata-refreshing
The only way I could get it to properly refresh was to:
Move all data out of the table
Drop the table
Delete the hive metadata row
DELETE FROM "HiveMetaStore".sparkmetastore WHERE key='_2_myschema' AND entity='org.apache.hadoop.hive.metastore.api.Table::mytable';
Recreate table
Copy all data back
Surely there is a better way?

This is still a problem within my spark environment, however I found that just truncating or removing specific records in the "HiveMetaStore".sparkmetastore table seems to refresh properly after around 5 minutes.
This works even without restarting a spark session.

Related

Altered column in Hive, showing Null value in Spark SQL result

I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.

External hive table on top of parquet returns no data

I created a hive table on top of a parquet folder written via spark. In one test server it is running fine and giving out results (hive version 2.6.5.196) but in production it gives no records (hive 2.6.5.179). Could someone please point out what the exact issue could be?
If you created the table on top of an existing partition structure, you have to make it known to the table that there are partitions at this location.
MSCK REPAIR TABLE table_name; -- adds missing partitions
SELECT * FROM table_name; -- should return records now
This problem shouldn't happen if there are only files in that location, and if they are the expected format.
You can verify with:
SHOW CREATE TABLE table_name; -- to see the expected format
created hive table on top of a parquet folder written via spark.
Check for the databases that you are using is available or not using
show databases;
check the ddl of the table that you have created on your test server and the other that is there on production
show create table table_name;
Make sure that both the ddl exactly matches.
Do msck repair table table_name to load the incremental data or the data from all the partitions
select * from table_name to view records

Spark SQL query issue - SQL with Subquery doesn't seem to retrieve records

I have a Spark SQL query like:
Select * from xTable a Where Exist (filter subquery) AND (a.date IN (Select max(b.date) from xTable b))
Under certain circumstances (when a filter table is not provided) , my filter subquery should simply do a Select 1.
Whenever I run this in Impala it returns records, in Hive it complains that only 1 subquery expression is allowed. However, when I run it as a Spark SQL in Spark 2.4, it returns an empty dataframe. Any idea why? What am I doing wrong?
Ok, I think I found the reason. It is not related to the query. It seems like an issue while trying to create a table using a csv file in Hive.
When you select the source - path to the csv file in HDFS
, and then under the format - check the 'Has Header' check box.
It seems to create the table ok.
Then, When I execute the following in Hive or Impala :
Select max(date) from xTable
I get the max date back (Where the date column is a String)
However, when I try and run the same via Spark SQL:
I get the result as date (The same name as the column header).
If I remove the header from CSV file and import it, and them manually create the headers and types, then I am not facing this issue.
Seems like some form of bug or may be a user error from my end.

Spark SQL does not pick up changes in Hive table schema

I have a Hive table that I wrote in Hive using Spark (saveAs(tableName)). Now I want to add a column to this table. My first approach was to add this column via Hive but apparently Spark does not pick up the new schema even though the column was added in the table. When I check table details in Hue, spark.sql.sources.schema.part.0 is not updated. So I thought to add a column via Spark job that executes this query on the table. Same result, and the column is not even added in the table. Is there a way around this issue? I thought to change table name, create a new one with the right schema and then insert select * ... into new table but that didn't work as the tables are partitioned.

Spark can't read content of a table

I had a managed hive table and moved it to a different database using the following command:
alter table_name rename to new_db.table_name
The table was successfully moved and all the data is under the database now. The table is shown fine in HIVE. However when I try to read the table from Spark, it can read the schema but there is no content in there. That is, the count returns zero! What has happened? How can I fix this issue?
I loading it in spark using the following code:
val t = sqlContext.table("new_db.table_name")
Sometimes just altering the name isn't enough, I had to also alter the location.
sqlContext.sql("""
ALTER TABLE new_table
SET LOCATION "hdfs://.../new_location"
""")
And refresh the table in Spark for good measure
sqlContext.sql("""
REFRESH TABLE new_table
""")
You can double-check if the location is correct w/ describe formatted new_table

Resources