Cluster New Record in Dedupe Clustered Table - python-dedupe

I am using Python Dedupe for de-duplication for our MDM database, So far it works fine after sufficient training and a entity map table is formed which shows you the Cluster_id's, Canonical name and a score.
I'm stucked and not sure for a new record inserted in the database, how this new record can be merged with the existing clusters in the entity_map table. I could not find a function in the dedupe documentation also.
Running the entire process(creating blocking map,plural key and clustered dupes) again for the new records will be costly, so just looking for a less expensive solution to cluster the new records with the existing clusters in entity map table


it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

How do I design a table in Cassandra for a TinyURL use case?

Recently I came across a well-known design problem.
'Tiny URL'
What I found was people vouching for NoSQL DBS such as DynamoDB or Cassandra. I've been reading about Cassandra for a couple of days, and I want to design my solution around this DB for this specific problem.
What would be the table definition? If I choose the following table definition:
Create table UrlMap(tiny_url text PRIMARY KEY, url text);
Wouldn't this result in a lot of partitions? since my partition key can take on around 68B values (using 6 char base64 strings)
Would that somehow affect the overall read/write performance? If so, what would be a better model to define the table.
Lot's of partitions is fine, think of it as using c* as a key value store.
The primary principle of data modelling in Cassandra is to design one table for each application query.
For a URL shortening service, the main application query is to retrieve the equivalent full URL for a given tiny URI. In pseudo-code, the query looks like:
GET long url FROM datastore WHERE uri = ?
Note that for the purpose of a service, we won't store the web domain name to make the app reusable for any domain. The filter (WHERE clause) is the URI so this is what you want as the partition key so we would design the table accordingly:
CREATE TABLE urls_by_uri (
uri text,
long_url text,
If we want to retrieve the URL for http://tinyu.rl/abc123, the CQL query is:
SELECT long_url FROM urls_by_uri WHERE uri = 'abc123'
As Phact and Andrew pointed, there is no need to worry about the number of partitions (records) you'll be storing in the table because you can store as many as 2^128 partitions in a Cassandra table which for practical purposes is limitless.
In Cassandra, each partition gets hashed into a token value using the Murmur3 hash algorithm (default partitioner). This implementation distributes each partition randomly across all nodes in the cluster. The same hash algorithm is used to determine which node "owns" the partition making retrieval (reads) very fast in Cassandra.
As long as you limit the SELECT queries to a single partition, retrieving the data is extremely fast. In fact, I work with hundreds of companies who have an SLA on reads of 95% between 6-9 milliseconds. This is achievable in Cassandra when you model your data correctly and size your cluster correctly. Cheers!

Managing multiple database connections and data with foreachPartition

Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
... // set MongoDB upsert criteria.
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)

Audit Trail Design using Table Storage

I'm considering implementing an Audit Trail for my application in using Table Storage.
I need to be able to log all actions for a specific customer and all actions for entities from that customer.
My first guess was creating a table for each customer (Audits_CustomerXXX) and use as a partition key the entity id and row key the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") value. And this works great when my question is what happened to certain entity? For instance the audit of purchase would have PartitionKey = "Purchases/12345" and the RowKey as the timestamp.
But when I want a birds eye view from the entire customer, can I just query the table sorting by row key across partitions? Or is it better to create a secondary table to hold the data with different partition keys? Also when using the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") is there a way to prevent errors when two actions in the same partition happen in the same tick (unlikely but who knows...).
You could certainly create a separate table for the birds eye view but you really don't have to. Considering Azure Tables are schema-less, you can keep this data in the same table as well. You would keep the PartitionKey as reverse ticks and RowKey as entity id. Because you would be querying only on PartitionKey, you can also keep RowKey as GUID as well. This will ensure that all entities are unique. Or you could append a GUID to your entity id and use that as RowKey.
However do keep in mind that because you're inserting two entities with different PartitionKey values, you will have to safegaurd your code against possible network failures as each entry will be a separate request to Table service. The way we're handling this in our application is we write this payload to a queue message and then process that message through a background process.

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.
