Spark Job for Inserting data to Cassandra - apache-spark

I am trying to write data to Cassandra tables using Spark on Scala. Sometimes the spark task fails in between and there are partial writes. Does Spark roll back the partial writes when the new task is started from first.

No. Spark (and Cassandra for that matter) doesn't do a commit style insert based on the whole task. This means that your writes must be idempotent otherwise you can end up with strange behaviors.

No but if I'm right, you can just reprocess your data. Which will overwrite the partial writes. When writing to Cassandra, a kind of update (upsert) is used when you are trying to insert data with the same primary key.

Related

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

How to write data into a Hive table?

I use Spark 2.0.2.
While learning the concept of writing a dataset to a Hive table, I understood that we do it in two ways:
using sparkSession.sql("your sql query")
dataframe.write.mode(SaveMode."type of
mode").insertInto("tableName")
Could anyone tell me what is the preferred way of loading a Hive table using Spark ?
In general I prefer 2. First because for multiple rows you cannot build such a long sql and second because it reduces the chance of errors or other issues like SQL injection attacks.
In the same way that for JDBC I use PreparedStatements as much as possible.
Think in this fashion, we need to achieve updates on daily basis on hive.
This can be achieved in two ways
Process all the data of the hive
Process only effected partitions.
For the first option sql works like a gem, but keep in mind that the data should be less to process entire data.
Second option works well.If you want to process only effected partition. Use data.overwite.partitionby.path
You should write the logic in such a way that it process only effected partitions. This logic will be applied to tables where data is in millions T billions records

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Reading cassandra simultaneously while writing

I am trying to read cassandra table immediately while data is been inserted to the table. The table is having timestamp as one of the primary key (Not the partition key). We have a spark job reads from the kafka and writes to cassandra at every 15 secs. The server component read from the cassandra almost immediately when the spark job starts inserting the data. Since the data inserting to the cassandra and is huge we are reading the data in pages. While reading in pages ,we observed that few of the records being skipped and reaches last record.
But when we run same logic of reading the data by pages on all ready inserted data it is working fine ( no skipping of records) . Is there any way read the data in pages while data being inserted in cassandra ?
What you observe might be a result of a current Cassandra data consistency level. To make sure all data written is available to read you could use ALL level, but this will cause waiting for all nodes to make a change.

Spark Streaming and Cassandra - keeping data of previous insert in memory

I have a Spark Streaming process, which inserts data into Cassandra.
Results of each computational process will depend on the previous insert, so what I need to do is find a way to efficiently keep in memory only the data of the previous insert. Also, that in memory structures will be updated with each insert..
I considered using accumulator variables, but I will need too many values -> more than 5-6000.. Queering them every time from Cassandra is definitely not providing good performance.
What do you think I should do?

Resources