Managing multiple database connections and data with foreachPartition - apache-spark

Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.

Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)

Related

Cassandra Query Performance: Using IN clause for one portion of the composite partition key

I currently have a table set up in Cassandra that has either text, decimal or date type columns with a composite partition key of a business_date and an account_number. For queries to this table, I need to be able to support look-ups for a single account, or for a list of accounts, for a given date.
Example:
select x,y,z from my_table where business_date = '2019-04-10' and account_number IN ('AAA', 'BBB', 'CCC')
//Note: Both partition keys are provided for this query
I've been struggling to resolve performance issues related to accessing this data because I'm noticing latency patterns that I am having trouble trying to understand / explain.
In many scenarios, the same exact query can be run a total of three times in a short period by the client application. For these scenarios, I see that two out of three requests will have really bad response times (800 ms), and one of them will have a really fast one (50 ms). At first I thought this would be due to key or row caches, however, I'm not so sure since I believe that if this were true, the third request out of the three should always be the fastest, which isn't the case.
The second issue I believed I was facing was the actual data model itself. Although the queries are being submitted with all the partition keys being provided, since it's an IN clause, the results would be separate partitions and can be distributed across the cluster and so, this would be a bad access pattern. However, I see these latency problems when even single account queries are run. Additionally, I see queries that come with 15 - 20 accounts performing really well (under 50ms), so I'm not sure if the data model is actually an issue.
Cluster setup:
Datacenters: 2
Number of nodes per data center: 3
Keyspace Replication:local_dc = 2, remote_dc = 2
Java Driver set:
Load-balancing: DCAware with LatencyAware
Protocol: v3
Queries are still set up to use "IN" clauses instead of async individual queries
Read_consistency: LOCAL_ONE
Does anyone have any ideas / clues of what I should be focusing on in terms of really identifying the root cause of this issue?
the use of IN on the partition key is always the bad idea, even for composite partition keys. The value of partition key defines the location of your data in cluster, and different values of partition key will most probably put data onto different servers. In this case, coordinating node (that received the query) will need to contact nodes that hold the data, wait that these nodes will deliver results, and only after that, send you results back.
If you need to query several partition keys, then it will be faster if you issue individual queries asynchronously, and collect result on client side.
Also, please note that TokenAware policy works best when you use PreparedStatement - in this case, driver is able to extract value of partition key, and find what server holds data for it.

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

Getting rid of confusion regarding NoSQL databases

This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.

Is a read with one secondary index faster than a read with multiple in cassandra?

I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.

Resources