spark hive insert transaction,batch data at most once - apache-spark

I have data to insert to hive each week,for example,1st week I insert:
spark.range(0,2).write.mode("append").saveAsTable("batches")
and 2nd week I insert:
spark.range(2,4).write.mode("append").saveAsTable("batches")
I am worry when the record which id is 2 is inserted,some exception occur,3 is not inserted,then I insert the data of 2nd week again,there will be two 2.
I googled,
hive is not suit for delete particular set of records:link,so I can not delete the data left before the exception of 2nd week.
I think I can use hive transaction,but spark is not right now (2.3 version) fully compliant with hive transactional tables. ,seems even we can not read hive if we enable hive transaction.
And I see another website:Hive's ACID feature (which introduces transactions) is not required for
inserts, only updates and deletes. Inserts should be supported on a vanilla
Hive shell.,but I do not why transaction is not needed for insert,and what does it mean.
So,if I do not want to get the duplicated data,or like run at most once,what shall I do?
If you say there won`t be an exception between 2 and 3,but consider another case,if I have many tables to write:
spark.range(2,4).write.mode("append").saveAsTable("batches")
val x=1/0
spark.range(2,4).write.mode("append").saveAsTable("batches2")
I tested it,new records 2 and 3 have been inserted into table "batches",but not inserted into "batches2",so if I want to insert again,must I only insert "batches2"?but how can I know where is the exception,and which tables should I insert again?I must to insert many try and catch,but it makes code hard to read and write.And what about the exception is disk is full or power off?
How to prevent duplicated data?

Related

how to avoid errors when querying hive table being loaded by Spark at the same time

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
The error is:
java.io.FileNotFoundException: File does not exist [HDFS PATH]
Which i think happens because the metadata says there is a file A that gets deleted during the job execution.
The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)
The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.
Have a set of two tables and a view over them:
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;
First iteration of ETL process needs to:
Synchronize foo_a -> foo_b
Do the work on foo_b
Drop view foo and recreate it pointing to foo_b
Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.
This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:
user is unlucky enough to hit a short time between dropping and
recreating the view
user submits a query that's heavy enough to run across two iterations of ETL
not sure but check it out
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

how to archive or delete cassandra data after receiving event in wso2bam

I use WSO2BAM in version 2.3.0 where I defined a stream holding much amount of data in Cassandra datasource. Currently my Hive script processes all events from keyspace where 99% of data is unneccesary. And it takes disk space too.
My idea is to clear this data after it becomes unnecessary.
The format of stream is:
{"streamId":"kroki_i_kolejki_zlecen:1.0.0","name":"kroki_i_kolejki_zlecen","version":"1.0.0","nickName":"Kroki i kolejki zlecen","description":"Wyniki i daty zamkniecia zlecen","payloadData":[{"name":"casenum","type":"STRING"},{"name":"type_id","type":"STRING"},{"name":"id_zlecenie","type":"STRING"},{"name":"sid","type":"STRING"},{"name":"step_name","type":"STRING"},{"name":"proc_name","type":"STRING"},{"name":"step_desc","type":"STRING"},{"name":"audit_date","type":"STRING"},{"name":"audit_usecs","type":"STRING"},{"name":"user_name","type":"STRING"}]}
My intention is to delete data with the same column payload_id_zlecenie after I receive event with specific payload_type_id.
In relational database it would be equal to query:
delete from kroki_i_kolejki_zlecen where payload_id_zlecenie = [argument];
Is it possible to do?
In Hive you cannot delete Cassandra data according to my knowledge. The [1] link given by Inosh describes how to archive Cassandra records older than a specific time duration. (e.g. records older than 3 months) All the archived data will be stored in a column family with the postfix, "_arch". In that feature a custom analyzer is used inside the generated Hive script to delete Cassandra rows. And also note that deleted records will take about 10 days to completely delete entire rows with it's row key. Until that happens you will see some empty fields associated with the Cassandra row ID.
Inosh's [2] is the real solution for your problem. Once incremental processing is enabled, hive script will process only the Cassandra rows unprocessed in the previous hive script execution. That means, the Hive will aggregate the values processed in each execution and will keep them for future. The next time hive will use that value, and previously processed last timestamp and process all the records came after that timestamp. The new aggregated value and older aggregated value will be used to get the overall value.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
You can use Cassandra data archival feature [1] to archive cassandra data.
Also refer to Incremental Analysis [2] which is a new feature released with BAM 2.4.0. Using that feature, received data can be analyzed incrementally, without processing all events in CFs.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

Resources