Delta Lake Table shows data in Glue but not in Athena - apache-spark

I am writing data to S3 in delta format from Spark dataframe in Glue. When I read the same table back in a dataframe, I am able to see the data. I created a table in Athena pointing to this S3 location (using the CREATE EXTERNAL TABLE command and not Glue crawler). When I query the data in Athena, I get zero records. What am I missing?

Related

How to find out whether Spark table is parquet or delta?

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet).
How can I determine a table's format?
I tried show tblproperties <tbl_name> but this gives an empty result.
According to Delta lake Api Doc you can check
DeltaTable.isDeltaTable(Spark, "path")
Please see the note in the documentation
This uses the active SparkSession in the current thread to read the table data. Hence, this throws error if active SparkSession has not been set, that is, SparkSession.getActiveSession() is empty.

Hive Create Table Reading CSV from S3 data spill

I’m trying to create a hive table from external location on S3 from a CSV file.
CREATE EXTERNAL TABLE coder_bob_schema.my_table (column data type)
ROW DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘s3://mybucket/path/file.CSV’
The resultant table has data from n-x fields spilling over to n which leads me to believe Hive doesn’t like the CSV. However, I downloaded the CSV from s3 and it opens and looks okay in excel. Is there a workaround like using a different delimiter?

DataBricks - Ingest new data from parquet files into delta / delta lake table

I posted this question on the databricks forum, I'll copy below but basically I need to ingest new data from parquet files into a delta table. I think I have to figure out how to use a merge statement effectively and / or use an ingestion tool.
I'm mounting some parquet files and then I create a table like this:
sqlContext.sql("CREATE TABLE myTableName USING parquet LOCATION 'myMountPointLocation'");
And then I create a delta table with a subset of columns and also a subset of the records. If I do both these things, my queries are super fast.
sqlContext.sql("CREATE TABLE $myDeltaTableName USING DELTA SELECT A, B, C FROM myTableName WHERE Created > '2021-01-01'");
What happens if I now run:
sqlContext.sql("REFRESH TABLE myTableName");
Does my table now update with any additional data that may be present in my parquet source files? Or do I have to re-mount those parquet files to get additional data?
Does my delta table also update with new records? I doubt it but one can hope...
Is this a case for AutoLoader? Or maybe I do some combination of mounting, re-creating / refreshing my source table, and then maybe MERGE new records / updated records into my delta table?

Query Hive table in spark 2.2.0

I have a hive table(say table1) in avro file format with 1900 columns. When I query the table in hive - I am able to fetch data but when I query the same table in spark sql I am getting metastore client lost connection. Attempting to reconnect
I have also queried another hive table(say table2) in avro file format with 130 columns it's fetching data both in hive and spark.
What I observed is I can see data in hdfs location of table2 but I can't see any data in table1 hdfs location (but it's feching data when I query only in hive)
Split tell you about number of mappers in MR job.
It doesn’t show you the exact location from where the data had been picked.
The below will help you to check where the data for Table1 is stored in HDFS.
For Table 1: You can check the location of data in HDFS by running a SELECT query with WHERE conditions in Hive with MapReduce as the execution engine. Once the job is complete, you can check the map task's log of the YARN application (specifically for the text "Processing file") and find where the input data files have been taken from.
Also, try checking the location of data for both the tables present in HiveMetastore by running "SHOW CREATE TABLE ;" in hive for both the tables in Hive. From the result, try to check the "LOCATION" details.

Count(*) over same External table gives different values in spark.sql() and hive

I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.
We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below
spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")
Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count
in spark:
spark.sql("select count(1) from target") -- giving 50K records
in hive:
select count(1) from target -- giving a count 50k - 100
Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running
ANALYZE TABLE target COMPUTE STATISTICS
But even after doing all these still we are getting original_count-100 in hive where correct count in spark.
There was a mistake in the DDL used for external table. "skip.header.line.count"="1" was there in the DDL and we are having 100 output files. so 1 line each file were skipped , which caused original count - 100 in hive while spark calculated it correctly. Removed "skip.header.line.count"="1" and its giving count as expected now.

Resources