PySpark dataframe drops records while writing to a hive table - apache-spark

I am trying to write a pyspark dataframe to hive table which also got created using the below line
parks_df.write.mode("overwrite").saveAsTable("fs.PARKS_TNTO")
When I try to print the count of the dataframe parks_df.count() I get 1000 records.
But in the final table fs.PARKS_TNTO, I get 980 records. Hence 20 records are getting dropped. How can I resolve this issue ? . Also , how can I capture the records which are getting dropped. There are no partitions on this final table fs.PARKS_TNTO.

Related

Incrementing aggregate the hudi table value using spark

I have a spark streaming job that loads the data in apache hudi table every 10 seconds. It update the row in hudi table if the row already exists. Actually, it is doing an upsert operation.
But in hudi table, there is an amount column that is also updated with the new value.
for example
1 batch, id=1, amount value=10. --> in table, amount value = 10
2 batch, id=1, amount value=20. --> in table, amount value = 20
But I need the amount value should 30 not 20. I need to incrementally aggregate the amount column.
Does hudi support incremental aggregation usecase without using external caching/db?
Apache Hudi uses the class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload by default to precombine your dataframe records and upsert old stored records, which simply checks if your dataframe contains duplicated records with the same key, and choose the records which have the max ordering field, then replace the old stored records with the new ones coming selected from your inserted data.
But you can create your own record payload class by implementing the interface org.apache.hudi.common.model.HoodieRecordPayload, and setting the config hoodie.compaction.payload.class to your class. (Here is more configs)

Best way to get the dropped records after using "dropDuplicates" function in Spark

I have a dataframe which contains duplicate records based on a column. My requirement is to drop duplicates based on the column and perform certain operation on unique records. And also identify the duplicate record based on column and save it to hbase for audit purposes.
input file:
A,B
1,2
1,3
2,5
Dataset<Row> datasetWithDupes=session.read().option("header","true").csv("inputfile");
//drop dupliactes on column A
Dataset<Row> datasetWithoutDupes = datasetWithDupes.dropDuplicates("A")
A dataset is required with dropped record. I have tried 2 options
Using except function
Dataset<Row> droppedRecords = datasetWithDupes.except("datasetWithoutDupes ")
this should contain the dropped records
Using the ranking function directly without using "dropDuplicates"
datasetWithDupes.withColumn("rank", functions.row_number().over(Window.partitionBy("A").seq()).orderBy("B").seq())))
then filtering based on the rank to get the duplicate records.
Is there any faster way to get the duplicated records, because I am using it in streaming application and most of the processing time(50%) is getting elapsed on finding the duplicate records and saving it to hbase table. I have batch interval of 10 sec and around 5 sec is getting spent for the task for filtering the duplicate record.
Please suggest to achieve in faster way.

Spark: Record count mismatch

I am quite confused because I am facing a weird situation.
My spark application reads data from an Oracle database and load it into a dataframe using this instruction:
private val df = spark.read.jdbc(
url = [the jdbc url],
table="(" + [the query] + ") qry",
properties= [the oracle driver]
)
Then, I save in a variable the number of records in this dataframe:
val records = df.count()
The I create a hive table ([my table]) with the dataframe schema, and I dump the content of the dataframe on it:
df.write
.mode(SaveMode.Append)
.insertInto([my hive db].[my table])
Well, here is where I am lost; When I perform a select count(*) to the hive table where the dataframe is being loaded, "sometimes" there are a few records more in hive than in the records variable.
Can you think on what could be the source of this mismatch??
*Related to the possible duplicate, my question is different. I am not counting my dataframe many times with different values. I count the records on my dataframe once. I dump the dataframe into hive, and I count the records in the hive table, and sometimes there are more in hive than in my count.*
Thank you very much in advance.

Getting partition list of inserted df partitions

Is there a way to get the file list or partition name of the partition that was inserted into the table?
df.write.format("parquet").partitionBy('id,name').insertInto(...)
A sample of the following command I wish to get a list :
1,Jhon
2,Jake
3,Dain
I don't think thats possible because to don't what was already present in the table and what was newely added.
Of course you can query your dataframe to get this:
val partitionList = df.select($"id,name").distinct.map(_.getString(0)).collect

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources