Reading cassandra simultaneously while writing - cassandra

I am trying to read cassandra table immediately while data is been inserted to the table. The table is having timestamp as one of the primary key (Not the partition key). We have a spark job reads from the kafka and writes to cassandra at every 15 secs. The server component read from the cassandra almost immediately when the spark job starts inserting the data. Since the data inserting to the cassandra and is huge we are reading the data in pages. While reading in pages ,we observed that few of the records being skipped and reaches last record.
But when we run same logic of reading the data by pages on all ready inserted data it is working fine ( no skipping of records) . Is there any way read the data in pages while data being inserted in cassandra ?

What you observe might be a result of a current Cassandra data consistency level. To make sure all data written is available to read you could use ALL level, but this will cause waiting for all nodes to make a change.

Related

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.

Concept for temporary data in Apache Cassandra

I have a question regarding the usage of Cassandra for temporary data (Data which is written once to the database, which is read once from the database and then deleted).
We are using Cassandra, to exchange data between processes which are running on different machines / different containers. Process1 is writing some data to the Cassandra, Process2 is reading this data. After that, data can be deleted.
As we learned that Cassandra doesn't like writing and deleting data very often in one table because of tombestones and performance issues, we are creating temporary tables for this.
Process1 : Create table, write data to table.
Process2 : Read data from table, drop table.
But doing this in a very high number (500-1000 tables create and drop per hour) we are facing problems on our schema synchronization between our nodes (we have cluster with 6 nodes).
The Cassandra cluster got very slow, we got a lot of timeout warnings, we got errors about different schemas on the nodes, the CPU load on the cluster nodes grew up to 100% and then the cluster was dead :-).
Is Cassandra the right database for this usecase ?
Is it a problem of how we configured our cluster ?
Will it be a better solution to create temporary keyspaces for this ?
Has anyone experience of how to handle such usecase with Cassandra ?
You don't need any database here. Your use case is to enable your applications to handshake with each other to share data asynchronously. There are two possible solutions:
1) For Batch based writes and reads consider using something like HDFS for intermediate storage. Process 1 writes data files in HDFS directories and Process 2 reads it from HDFS.
2) For message based system consider something like Kafka. Process 1 process the data stream and writes into Kafka Topics and Process 2 consumers reads data from Kafka Topics. Kafka do provides Ack/Nack features.
Continuously creating and deleting number of tables in Cassandra is not a good practice and is never recommended.

Looking up about 40k records out 150 million records in Cassandra in every job run?

I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).

Spark Job for Inserting data to Cassandra

I am trying to write data to Cassandra tables using Spark on Scala. Sometimes the spark task fails in between and there are partial writes. Does Spark roll back the partial writes when the new task is started from first.
No. Spark (and Cassandra for that matter) doesn't do a commit style insert based on the whole task. This means that your writes must be idempotent otherwise you can end up with strange behaviors.
No but if I'm right, you can just reprocess your data. Which will overwrite the partial writes. When writing to Cassandra, a kind of update (upsert) is used when you are trying to insert data with the same primary key.

Cassandra Fast Read Configuration

I have 4 Cassandra nodes with 1 seed in a single data center. I have about 5M records in which Cassandra takes around 4 mins to read where with MySQL, it takes only 17 seconds. So my guess is that there is something wrong in my configuration. So kindly will anyone let me know what configuration attributes so I have to check in Cassandra.yaml.
You may be doing an apples to oranges comparison if you are reading all 5M records from one client.
With MySQL all the data is local and optimized for reads since data is updated in place.
Cassandra is distributed and optimized for writes. Writes are simple appends, but reads are expensive since all the appends need to be read and merged to get the current value of each column.
Since the data is distributed across multiple nodes, there is a lot of overhead of accessing and retrieving the data over the network.
If you were using Spark with Cassandra and loading the data into Spark workers in parallel without shuffling it across the network to a single client, then it would be a more similar comparison.
Cassandra is generally good at ingesting large amounts of data and then working on small slices of it (i.e. partitions) rather than doing table scan operations such as reading the entire table.

Resources