Same query resulting in different outputs in Hive vs Spark - apache-spark

Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?

I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.

Related

Pyspark: Row count is not matching to the count of records appended

I am trying to identify and insert only the delta records to the target hive table from pyspark program. I am using left anti join on ID columns and it's able to identify the new records successfully. But I could notice that the total number of delta records is not the same as the difference between table record count before load and afterload.
delta_df = src_df.join(tgt_df, src_df.JOIN_HASH == tgt_df.JOIN_HASH,how='leftanti')\
.select(src_df.columns).drop("JOIN_HASH")
delta_df.count() #giving out correct delta count
delta_df.write.mode("append").format("hive").option("compression","snappy").saveAsTable(hivetable)
But if I could see delta_df.count() is not the same as count( * ) from hivetable after writting data - count(*) from hivetable before writting data. The difference is always coming higher compared to the delta count.
I have a unique timestamp column for each load in the source, and to my surprise, the count of records in the target for the current load(grouping by unique timestamp) is less than the delta count.
I am not able to identify the issue here, do I have to write the df.write in some other way?
It was a problem with the line delimiter. When the table is created with spark.write, in SERDEPROPERTIES there is no line.delim specified and column values with * were getting split into multiple rows.
Now I added the below SERDEPROPERTIES and it stores the data correctly.
'line.delim'='\n'

Insert selective columns to hive

I want to insert selective columns to Hive and I am unable to do so. This is what I was trying via spark
val df2 = spark.sql("SELECT Device_Version,date, SUM(size) as size FROM table1 WHERE date='2019-06-13' GROUP BY date, Device_Version")
df2.createOrReplaceTempView("tempTable")
spark.sql("Insert into table2 PARTITION (date,ID) (Device_Version) SELECT Device_Version, date, '1' AS ID FROM tempTable")
My aim is to only insert selective fields to the table t2. Table t2 has many other columns which I want to be padded as null. I can do the padding as long as I can specify the order. I do not want the order to be taken by default.
Something like ...
spark.sql("Insert into table2 PARTITION (date,cuboid_id) (Device_Version,OS) SELECT Device_Version, null as os, date, '10001' AS CUBOID_ID FROM tempTable")
Is there any way to do this ? Any options are welcome.

spark-hive - Upsert into dynamic partition hive table throws an error - Partition spec contains non-partition columns

I am using spark 2.2.1 and hive2.1. I am trying to insert overwrite multiple partitions into existing partitioned hive/parquet table.
Table was created using sparkSession.
I have a table 'mytable' with partitions P1 and P2.
I have following set on sparkSession object:
"hive.exec.dynamic.partition"=true
"hive.exec.dynamic.partition.mode"="nonstrict"
Code:
val df = spark.read.csv(pathToNewData)
df.createOrReplaceTempView("updateTable") //here 'df' may contains data from multiple partitions. i.e. multiple values for P1 and P2 in data.
spark.sql("insert overwrite table mytable PARTITION(P1, P2) select c1, c2,..cn, P1, P2 from updateTable") // I made sure that partition columns P1 and P2 are at the end of projection list.
I am getting following error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {p1=, p2=, P1=1085, P2=164590861} contains non-partition columns;
dataframe 'df' have records for P1=1085, P2=164590861 . It looks like issue with casing (lower vs upper). I tried both cases in my query but it's still not working.
EDIT:
Insert statement works with static partitioning but that is not what I am looking for:
e.g. following works
spark.sql("insert overwrite table mytable PARTITION(P1=1085, P2=164590861) select c1, c2,..cn, P1, P2 from updateTable where P1=1085 and P2=164590861")
Create table stmt:
`CREATE TABLE `my_table`(
`c1` int,
`c2` int,
`c3` string,
`p1` int,
`p2` int)
PARTITIONED BY (
`p1` int,
`p2` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'maprfs:/mds/hive/warehouse/my.db/xc_bonus'
TBLPROPERTIES (
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{.spark struct metadata here.......}';
'spark.sql.sources.schema.partCol.0'='P1', //Spark is using Capital Names for Partitions; while hive is using lowercase
'spark.sql.sources.schema.partCol.1'='P2',
'transient_lastDdlTime'='1533665272')`
In above, spark.sql.sources.schema.partCol.0 uses all uppercase while PARTITIONED BY statement uses all lowercase for partitions columns
Based on the Exception and also assuming that the table 'mytable' was created as a partitioned table with P1 and P2 as partitions. One way to overcome this exception would be to force a dummy partition manually before executing the command. Try doing
spark.sql("alter table mytable add partition (p1=default, p2=default)").
Once successful, execute your insert overwrite statement. Hope this helps?
As I mentioned in EDIT section issue was in fact with difference in partition columns casing (lower vs upper) between hive and spark! I created hive table with all Upper cases but hive still internally stored it as lowercases but spark metadata kept is as Upper cases as intended by me. Fixing create statement with all lower case partition columns fixed the issue with subsequent updates!
If you are using hive 2.1 and spark 2.2 make sure following properties in create statement have same casing.
PARTITIONED BY (
p1int,
p2int)
'spark.sql.sources.schema.partCol.0'='p1',
'spark.sql.sources.schema.partCol.1'='p2',

Hive query for row number

I am working on pyspark, need to write a query which reads data from hive table and returns a pyspark dataframe containing all the columns and row number.
This is what I tried :
SELECT *, ROW_NUMBER() OVER () as rcd_num FROM schema_name.table_name
This query works fine in hive, but when I run it from a pyspark script it throws the following error:
Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
Please suggest some solution.
Note: I do not wish to order the rows in any particular order, I just need row numbers for all the rows present in the table without any sorting or ordering.
Using spark 2.1
ROW_NUMBER()might be required ordering so you can used monotonicallyIncreasingId function which gives you row numbers for all the rows present in the table.
from pyspark.sql.functions import monotonicallyIncreasingId
df.withColumn("rcd_num ", monotonicallyIncreasingId())
OR
SELECT *, ROW_NUMBER() OVER (Order by (select NULL)) as rcd_num FROM schema_name.table_name
you can set order by select NULL

Read records from joining Hive tables with Spark

We can easily read records from a Hive table in Spark with this command:
Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();
But when I join two tables, such as:
select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id
How to retrive the records from the above join query?
SparkContext.sql method always returns DataFrame so there is no practical difference between JOIN and any other type of query.
You shouldn't use collect method though, unless fetching data to the driver is really a desired outcome. It is expensive and will crash if data cannot fit in the driver memory.

Resources