Incrementing aggregate the hudi table value using spark - apache-spark

I have a spark streaming job that loads the data in apache hudi table every 10 seconds. It update the row in hudi table if the row already exists. Actually, it is doing an upsert operation.
But in hudi table, there is an amount column that is also updated with the new value.
for example
1 batch, id=1, amount value=10. --> in table, amount value = 10
2 batch, id=1, amount value=20. --> in table, amount value = 20
But I need the amount value should 30 not 20. I need to incrementally aggregate the amount column.
Does hudi support incremental aggregation usecase without using external caching/db?

Apache Hudi uses the class org.apache.hudi.common.model.OverwriteWithLatestAvroPayload by default to precombine your dataframe records and upsert old stored records, which simply checks if your dataframe contains duplicated records with the same key, and choose the records which have the max ordering field, then replace the old stored records with the new ones coming selected from your inserted data.
But you can create your own record payload class by implementing the interface org.apache.hudi.common.model.HoodieRecordPayload, and setting the config hoodie.compaction.payload.class to your class. (Here is more configs)

Related

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Best way to get the dropped records after using "dropDuplicates" function in Spark

I have a dataframe which contains duplicate records based on a column. My requirement is to drop duplicates based on the column and perform certain operation on unique records. And also identify the duplicate record based on column and save it to hbase for audit purposes.
input file:
A,B
1,2
1,3
2,5
Dataset<Row> datasetWithDupes=session.read().option("header","true").csv("inputfile");
//drop dupliactes on column A
Dataset<Row> datasetWithoutDupes = datasetWithDupes.dropDuplicates("A")
A dataset is required with dropped record. I have tried 2 options
Using except function
Dataset<Row> droppedRecords = datasetWithDupes.except("datasetWithoutDupes ")
this should contain the dropped records
Using the ranking function directly without using "dropDuplicates"
datasetWithDupes.withColumn("rank", functions.row_number().over(Window.partitionBy("A").seq()).orderBy("B").seq())))
then filtering based on the rank to get the duplicate records.
Is there any faster way to get the duplicated records, because I am using it in streaming application and most of the processing time(50%) is getting elapsed on finding the duplicate records and saving it to hbase table. I have batch interval of 10 sec and around 5 sec is getting spent for the task for filtering the duplicate record.
Please suggest to achieve in faster way.

Cassandra failure during read query

I have a Cassandra Table with ~500 columns and primary key ((userId, version, shredId), rowId) where shredId is used to distribute data evenly into different partitions. Table also has a default TTL of 2 days to expire data as data are for real-time aggregation. The compaction strategy is TimeWindowCompactionStrategy.
The workflow is:
write data to input table (with consistency EACH_QUORUM)
Run spark aggregation (on rows with same the userId and version)
write aggregated data to output table.
But I'm getting Cassandra failure during read query when size of data gets large; more specifically, once there are more than 210 rows in one partition, read queries fail.
How can I tune my database and change properties to fix this?
After investigation and research, the issued is caused by null values been inserted for some empty column. this creates large amount of tombstones and eventually timeout the query.

Spark and cassandra, range query on clustering key

I have cassandra table with following structure:
CREATE TABLE table (
key int,
time timestamp,
measure float,
primary key (key, time)
);
I need to create a Spark job which will read data from previous table, within specified start and end timestamp do some processing, and flush results back to cassandra.
So my spark-cassandra-connector will have to do a range query on clustering cassandra table column.
Are there any performance differences if I do:
sc.cassandraTable(keyspace,table).
as(caseClassObject).
filter(a => a.time.before(startTime) && a.time.after(endTime).....
so what I am doing is loading all the data into Spark and applying filtering
OR if I do this:
sc.cassandraTable(keyspace, table).
where(s"time>$startTime and time<$endTime)......
which filters all the data in Cassandra and then loads smaller subset to Spark.
The selectivity of a range query will be around 1%
It is impossible to include partition key in the query.
Which of these two solutions is preferred?
sc.cassandraTable(keyspace, table).where(s"time>$startTime and time<$endTime)
Will be MUCH faster. You are basically doing a percentage (if you only pull 5% of the data 5% of the total work) of the full grab in the first command to get the same data.
In the first case you are
Reading all of the data from Cassandra.
Serializing every object and then moving it to Spark.
Then finally filtering everything.
In the second case you are
Reading only the data you actually want from C*
Serializing only this tiny subset
There is no step 3
As an additional comment you can also put your case class type right in the call
sc.cassandraTable[CaseClassObject](keyspace, table)

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources