Require help in creating design for cassandra data model for my requirement

Require help in creating design for cassandra data model for my requirement - cassandra

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.

That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

Related

What is the performance difference between stream.filter instead of CQL ALLOW FILTERING?

The data in my Cassandra DB table doesn't have much data right now.
However, since it is a table where data is continuously accumulated, I am interested in performance issues.
First of all, please don't think about the part where you need to redesign the table.
Think of it as a general RDBS date-based lookup. (startDate ~ endDate)
From Cassandra DB
Apply allow filtering and force the query.
This will get you exactly the data you want.
Query "all data" in Cassandra DB, This query only needs to be done once. (no where)
After that, only the data within the desired date is extracted through the stream().filter() function.
Which method would you choose?
In general, which one has more performance issues?
Summary: You need to do about 6 methods.
Execute allow filtering query 6 times / Not perform stream filter
Execute findAll query once / Execute stream filter 6 times

The challenge with both options is that neither will scale. It may work with very small data sets, say less than 1000 partitions, but you will quickly find that neither will work once your tables grow.
Cassandra is designed for real-time OLTP workloads where you are retrieving a single partition for real-time applications.
For analytics workloads, you should instead use Spark with the spark-cassandra-connector because it optimises analytics queries. Cheers!

Is Spark good for processing data from sql db in a job? How to avoid processing the same data in the job?

I have a problem and I wonder if spark is a good tool to solve it:
There is sql db. I want to process data from such table:
Orders Table:
| id | product | date |
I would like to create "processing job" which can scan all records and save to other db/file.
Ultimately, I would like to have several features/tables in the database/file (for example, the older product orders, the number of orders for a given month).
So, the target database/file will contain the ordersForGivenMounths table with the values: September: 150 (orders with same id), October: 230 ... etc.
Tables in the database will be expanded. I have given only two examples.
Can it be done at Spark? Is it a good tool for this type of task?
Can I create jobs in Spark that will process the sql database every given period of time?
New records will be constantly added to the source sql database. Is it possible to configure Spark so that it does not process data that it has processed earlier and already pushed into the target database/file earlier?
I was looking for tutorials/docs but most are introductions without specific solutions.

I think you can use spark streaming with custom receivers, and you can add some logic at receiver. [receiver]http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Is there way in cassandra system tables check the counts ? where we can check the meta data of latest inserts?

i am working on migration tool oracle to cassandra , where I want to maintain a validation table with columns oracle count and cassandra count , so that i can validate the migration job,in cassandra is there any way system maintains the recently executed/inserted query count ? total count of a particular table ? is there anywhere in cassandra system tables does it store? if so what is it ? if not please suggest some way to design validation framework of data migration.
Is there way in cassandra, get the latest query inserted record count and total count of table in any system tables from where we can read the counts instead of executing the count(*) query on the tables ? does cassandra maintains the of the counts anywhere internally ?If so where we can check the meta data of latest inserts i.e which system tables?

Cassandra is distributed system and there is no place where it will collect the counts per tables. You can get some estimates from system.size_estimates, but it will say only paritions count per range, and their sizes.
For such framework as you're asking, you may need to develop custom Spark code (easiest way) that will perform counting of the rows, and other checks. Spark is highly optimized for effective data access and could be more preferable than writing the custom code.
Also, during migration, consider using consistency level greater than ONE to make sure that at least several nodes confirmed writing of the data. Although, it depends on the amount of data & timing requirements for your migration jobs.

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!

I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Is there a data architecture for efficient joins in Spark (a la RedShift)?

I have data that I would like to do a lot of analytic queries on and I'm trying to figure out if there is a mechanism I can use to store it so that Spark can efficiently do joins on it. I have a solution using RedShift, but would ideally prefer to have something that is based on files in S3 instead of having a whole RedShift cluster up 24/7.
Introduction to the data
This is a simplified example. We have 2 initial CSV files.
Person records
Event records
The two tables are linked via the person_id field. person_id is unique in the Person table. Events have a many-to-one relationship with person.
The goal
I'd like to understand how to set up the data so I can efficiently perform the following query. I will need to perform many queries like this (all queries are evaluated on a per person basis):
The query is to produce a data frame with 4 columns, and 1 row for every person.
person_id - person_id for each person in the data set
age - "age" field from the person record
cost - The sum of the "cost" field for all event records for that person where "date" is during the month of 6/2013
All current solutions I have with Spark to this problem involve reshuffling all the data, which ends up making the process slow for large amounts (hundreds of millions of people). I am happy with a solution that requires me to reshuffle the data and write it to a different format once if that can then speed up later queries.
The solution using RedShift
I can accomplish this solution using RedShift in a fairly straightforward way:
Each both files are loaded in as RedShift tables, with DISTKEY person_id, SORTKEY person_id. This distributes the data so that all the data for a person is on a single node. The following query will produce the desired data frame:
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
The solution using Spark/Parquet
I have thought of several potential ways to handle this in Spark, but none accomplishes what I need. My ideas and the issues are listed below:
Spark Dataset write 'bucketBy' - Read the CSV files and then rewrite them out as parquet files using "bucketBy". Queries on these parquet files could then be very fast. This would produce a data setup similar to RedShift, but parquet files don't support bucketBy.
Spark parquet partitioning - Parquet does support partitioning. Because parquet creates a separate set of files for each partition key, you have to create a computed column to partition on and use a hash of person_id to create the partitionKey. However, when you later join these tables in spark based on "partition_key" and "person_id", the query plan still does a full hash partition. So this approach is no better than just reading the CSVs and shuffling every time.
Stored in some other data format besides parquet - I am open to this, but don't know of another data source that will work.
Using a compound record format - Parquet supports hierarchical data formats, so can prejoin both tables into a hierarchical record (where a person record has an "events" field which is an array of struct elements) and then do processing on that. When you have a hierarchical record, there are two approaches that to processing it:
** Use explode to create separate records ** - Using this approach you explode array fields into full rows, then use standard data frame operations to do analytics, and then join them back to the main table. Unfortunately, I've been unable to get this approach to efficiently compile queries.
** Use UDFs to perform operations on subrecords ** - This preserves the structure and executes without shuffles, but is an awkward and verbose way to program. Also, it requires lots of UDFs which aren't great for performance (although they beat large scale shuffling of data).
For my use cases, Spark has advantages over RedShift which aren't obvious in this simple example, so I'd prefer to do this with Spark. Please let me know if I am missing something and there is a good approach to this.

Edited per comment.
Assumptions:
Using parquet
Here's what I would try:
val eventAgg = spark.sql("""select person_id, sum(cost) as cost
from events
where date between '2013-06-01' and '2013-06-30'
group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")
I just did this with some of my data and here's how it went (9
node/140 vCPUs cluster, ~600GB RAM):
27,000,000,000 "events" (aggregated to 14,331,487 "people")
64,000,000 "people" (~20 columns)
aggregated events building and caching took ~3 min
people caching took ~30 seconds (pulling from network, not parquet)
left joining took several seconds
Not caching the "people" led to the join taking a few seconds longer. Then forcing spark to broadcast the couple hundred MB aggregated events made the join take under 1 second.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string