Best way to get the dropped records after using "dropDuplicates" function in Spark - apache-spark

I have a dataframe which contains duplicate records based on a column. My requirement is to drop duplicates based on the column and perform certain operation on unique records. And also identify the duplicate record based on column and save it to hbase for audit purposes.
input file:
A,B
1,2
1,3
2,5
Dataset<Row> datasetWithDupes=session.read().option("header","true").csv("inputfile");
//drop dupliactes on column A
Dataset<Row> datasetWithoutDupes = datasetWithDupes.dropDuplicates("A")
A dataset is required with dropped record. I have tried 2 options
Using except function
Dataset<Row> droppedRecords = datasetWithDupes.except("datasetWithoutDupes ")
this should contain the dropped records
Using the ranking function directly without using "dropDuplicates"
datasetWithDupes.withColumn("rank", functions.row_number().over(Window.partitionBy("A").seq()).orderBy("B").seq())))
then filtering based on the rank to get the duplicate records.
Is there any faster way to get the duplicated records, because I am using it in streaming application and most of the processing time(50%) is getting elapsed on finding the duplicate records and saving it to hbase table. I have batch interval of 10 sec and around 5 sec is getting spent for the task for filtering the duplicate record.
Please suggest to achieve in faster way.

Related

Incrementing aggregate the hudi table value using spark

I have a spark streaming job that loads the data in apache hudi table every 10 seconds. It update the row in hudi table if the row already exists. Actually, it is doing an upsert operation.
But in hudi table, there is an amount column that is also updated with the new value.
for example
1 batch, id=1, amount value=10. --> in table, amount value = 10
2 batch, id=1, amount value=20. --> in table, amount value = 20
But I need the amount value should 30 not 20. I need to incrementally aggregate the amount column.
Does hudi support incremental aggregation usecase without using external caching/db?
Apache Hudi uses the class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload by default to precombine your dataframe records and upsert old stored records, which simply checks if your dataframe contains duplicated records with the same key, and choose the records which have the max ordering field, then replace the old stored records with the new ones coming selected from your inserted data.
But you can create your own record payload class by implementing the interface org.apache.hudi.common.model.HoodieRecordPayload, and setting the config hoodie.compaction.payload.class to your class. (Here is more configs)

Joining two large tables which have large regions of no overlap

Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.

PySpark dataframe drops records while writing to a hive table

I am trying to write a pyspark dataframe to hive table which also got created using the below line
parks_df.write.mode("overwrite").saveAsTable("fs.PARKS_TNTO")
When I try to print the count of the dataframe parks_df.count() I get 1000 records.
But in the final table fs.PARKS_TNTO, I get 980 records. Hence 20 records are getting dropped. How can I resolve this issue ? . Also , how can I capture the records which are getting dropped. There are no partitions on this final table fs.PARKS_TNTO.

Why is row count different when using spark.table().count() and df.count()?

I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.
I want to preprocess the data in the table before I can use it further as a training set for a model I need to train.
When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).
what I have tried:
raised spark.kryoserializer.buffer.max capacity.
load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.
this sample is very similar to the code I ran in order to investigate the problem.
df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961
I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.

Fetch record countr from a list of tables in Netezza using Python

I have to fetch record counts from tables starting with WC_* from database "TEST_DB".
Currently I'm using below code to do that, but it's taking too long as there are billions of records in many tables. Is there any way to improve performance
for item in list_tables[]:
total_count_query="select count(*) from TEST_DB.."+item[0]+"
cur.execute(total_count_query)
total_record_count=cur.fetchone()[0]
print (item[0]," : ",total_record_count)
There's a column called reltuples in System table "_V_TABLE" which gives the count of all tables provided to statistics are regularly collected. You don't have to calculate the count each each
total_count_query="select reltuples from "+database+".._v_table

Resources