Duplicate Records in when Spark Tasks fails - apache-spark

I am facing duplicate records in the Cassandra table when the spark tasks fails and it is restarted again. Schema of the table that I am trying to insert.
CREATE TABLE duplicate_record (object_id bigint,entity_key timeuuid,
PRIMARY KEY (object_id, entity_key));
Sample Duplicate Records in the table
1181592431 uuid
1181592431 uuid1
8082869622 uuid2
8082869622 uuid3
I have a df which is produced by the left join between Oracle and Cassandra. So we have already existing records in Cassandra and new records which are generated from Oracle.
I apply a map of each record to see if the entity_id exists. If it exists then use it else for new records create fresh entity_id and then do the save. I am using saveToCassandra to insert this df to Cassandra.
When the task fails and is restarted, already inserted records are being inserted again with a different entity_key. I guess the inserted record during the successful execution is not available when the task is resubmitted resulting in duplicate records.

Spark Speculative Execution can cause duplicate records, so pls turn it to false.
still if this happens then it could be due to some spark nodes/tasks getting restarted… there was a Spark bug for duplicate record commit, its resolved in 2.1.3 version:
https://issues.apache.org/jira/browse/SPARK-24589
pls ensure you are running on 2.1.3+.

Related

Cassandra fails to start after schema changed

I'm using Cassandra 3.11.6 on Centos 7 with a 3 node cluster, I ran some schema changes, drop tables/materialized views, alter tables, etc, after that one of the materialized views was failing with this error:
org.apache.cassandra.schema.SchemaKeyspace$MissingColumns: No partition key columns found in schema table for my_keyspace.my_materialized_view.
I wanted to replace that materialized view with a table, with the same name though, that might be why it's failing
I ran nodetool describecluster and found the schema version was different, I tried to run a repair, it didn't work, I restarted the nodes, but they didn't start.
This is the error that is showing up in cassandra.log
ERROR [main] 2020-12-09 10:13:15,827 SchemaKeyspace.java:1017 - No partition columns found for table my_keyspace.my_materialized_view in system_schema.columns. This may be due to corruption or concurrent dropping and altering of a table. If this table is supposed to be dropped, run the following query to cleanup: "DELETE FROM system_schema.tables WHERE keyspace_name = 'my_keyspace' AND table_name = 'my_materialized_view'; DELETE FROM system_schema.columns WHERE keyspace_name = 'my_keyspace' AND table_name = 'my_materialized_view';" If the table is not supposed to be dropped, restore system_schema.columns sstables from backups.
org.apache.cassandra.schema.SchemaKeyspace$MissingColumns: No partition key columns found in schema table for my_keyspace.my_materialized_view
at org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:1106) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:1046) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:1000) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:959) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:936) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:924) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:92) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:82) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:269) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6]
I tried starting Cassandra with -Dcassandra.ignore_corrupted_schema_tables=true and it doesn't work
It looks like you've either made concurrent updates to your schema or at least made multiple DDL statements close to each other to cause a schema disagreement in your cluster.
You are supposed to wait for a single DDL change to propagate through the cluster and check that the schema is in agreement before making the next DDL change to prevent disagreement.
My suggestion is to attempt to start the node where you were performing the schema changes and leave the other nodes down temporarily. Hopefully it's also a seed node.
Remove the ignore_corrupted_schema_tables and see if you can bring it back online. If it does, then proceed to the next node and watch the startup sequence (do a tail -f on the system.log). Keep going until all nodes are back online.
The issue is that depending on the state of the schema on each node, it will be difficult to "unscramble the egg". Good luck!

Will this use-case can be handled with spark-sql streaming and cassandra?

I need to do a PoC for a business use-case.
Use case :
Need to update a record in Cassandra table if exists.
Will spark streaming support compare each record and update existing Cassandra record ?
For each record received from kakfa topic , If I want to check and compare each record whether its already there in Cassandra or not , if yes , update the record else insert a new record.
How can be this done using spark-structured streaming and cassandra?
any snippet or sample if you have.
Do a normal write in Cassandra using the Spark-Cassandra connector. If the row key is already present, it will get updated and if not, it will be inserted.
This is how Cassandra works. Insert and Update, both perform write operations.
Hope this helps!

spark hive insert transaction,batch data at most once

I have data to insert to hive each week,for example,1st week I insert:
spark.range(0,2).write.mode("append").saveAsTable("batches")
and 2nd week I insert:
spark.range(2,4).write.mode("append").saveAsTable("batches")
I am worry when the record which id is 2 is inserted,some exception occur,3 is not inserted,then I insert the data of 2nd week again,there will be two 2.
I googled,
hive is not suit for delete particular set of records:link,so I can not delete the data left before the exception of 2nd week.
I think I can use hive transaction,but spark is not right now (2.3 version) fully compliant with hive transactional tables. ,seems even we can not read hive if we enable hive transaction.
And I see another website:Hive's ACID feature (which introduces transactions) is not required for
inserts, only updates and deletes. Inserts should be supported on a vanilla
Hive shell.,but I do not why transaction is not needed for insert,and what does it mean.
So,if I do not want to get the duplicated data,or like run at most once,what shall I do?
If you say there won`t be an exception between 2 and 3,but consider another case,if I have many tables to write:
spark.range(2,4).write.mode("append").saveAsTable("batches")
val x=1/0
spark.range(2,4).write.mode("append").saveAsTable("batches2")
I tested it,new records 2 and 3 have been inserted into table "batches",but not inserted into "batches2",so if I want to insert again,must I only insert "batches2"?but how can I know where is the exception,and which tables should I insert again?I must to insert many try and catch,but it makes code hard to read and write.And what about the exception is disk is full or power off?
How to prevent duplicated data?

spark Dataframe execute UPDATE statement

Hy guys,
I need to perform jdbc operation using Apache Spark DataFrame.
Basically I have an historical jdbc table called Measures where I have to do two operations:
1. Set endTime validity attribute of the old measure record to the current time
2. Insert a new measure record setting endTime to 9999-12-31
Can someone tell me how to perform (if we can) update statement for the first operation and insert for the second operation?
I tried to use this statement for the first operation:
val dfWriter = df.write.mode(SaveMode.Overwrite)
dfWriter.jdbc("jdbc:postgresql:postgres", tableName, prop)
But it doesn't work because there is a duplicate key violation. If we can do update, how we can do delete statement?
Thanks in advance.
I don't think its supported out of the box yet by Spark. What you can do it iterate over the dataframe/RDD using the foreachRDD() loop and manually update/delete the table using JDBC api.
here is link to a similar question :
Spark Dataframes UPSERT to Postgres Table

how to archive or delete cassandra data after receiving event in wso2bam

I use WSO2BAM in version 2.3.0 where I defined a stream holding much amount of data in Cassandra datasource. Currently my Hive script processes all events from keyspace where 99% of data is unneccesary. And it takes disk space too.
My idea is to clear this data after it becomes unnecessary.
The format of stream is:
{"streamId":"kroki_i_kolejki_zlecen:1.0.0","name":"kroki_i_kolejki_zlecen","version":"1.0.0","nickName":"Kroki i kolejki zlecen","description":"Wyniki i daty zamkniecia zlecen","payloadData":[{"name":"casenum","type":"STRING"},{"name":"type_id","type":"STRING"},{"name":"id_zlecenie","type":"STRING"},{"name":"sid","type":"STRING"},{"name":"step_name","type":"STRING"},{"name":"proc_name","type":"STRING"},{"name":"step_desc","type":"STRING"},{"name":"audit_date","type":"STRING"},{"name":"audit_usecs","type":"STRING"},{"name":"user_name","type":"STRING"}]}
My intention is to delete data with the same column payload_id_zlecenie after I receive event with specific payload_type_id.
In relational database it would be equal to query:
delete from kroki_i_kolejki_zlecen where payload_id_zlecenie = [argument];
Is it possible to do?
In Hive you cannot delete Cassandra data according to my knowledge. The [1] link given by Inosh describes how to archive Cassandra records older than a specific time duration. (e.g. records older than 3 months) All the archived data will be stored in a column family with the postfix, "_arch". In that feature a custom analyzer is used inside the generated Hive script to delete Cassandra rows. And also note that deleted records will take about 10 days to completely delete entire rows with it's row key. Until that happens you will see some empty fields associated with the Cassandra row ID.
Inosh's [2] is the real solution for your problem. Once incremental processing is enabled, hive script will process only the Cassandra rows unprocessed in the previous hive script execution. That means, the Hive will aggregate the values processed in each execution and will keep them for future. The next time hive will use that value, and previously processed last timestamp and process all the records came after that timestamp. The new aggregated value and older aggregated value will be used to get the overall value.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
You can use Cassandra data archival feature [1] to archive cassandra data.
Also refer to Incremental Analysis [2] which is a new feature released with BAM 2.4.0. Using that feature, received data can be analyzed incrementally, without processing all events in CFs.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

Resources