Reusing Dedupe training for Gazetteer matching - python-dedupe

I'm using the Dedupe library to clean up some data. However, once the first deduplication is done using the Dedupe object, I understand we are supposed to use the Gazetteer object to match any new incoming data against the clustered data.
For the sake of explaining the issue, let's assume that :
The first batch of data is 500k rows of restaurants, with name, address, and phone number fields.
The second batch of data is, for instance, 1k new restaurants that did not exist at the time, but that I now want to match against the first 500k.
If I describe the pipeline, it goes something like this :
Step 1) Initial deduplication
Train a Dedupe object on a sample of the 500k restaurants
Cluster the 500k rows with a Dedupe / Static Dedupe object
Step 2) Incremental deduplication
Train a Gazetteer object on a sample of the 500k restaurants vs 1k new restaurants
Match incoming 1k rows against 500k previous rows
Assign canonical ID according to the 1k rows that actually matched an existing restaurant
So, the questions are :
Is the pipeline actually correct ?
Do I have to retrain the Gazetteer each time new data comes in ?
Can't I use the same blocking rules that I learned during the first step ? Or at least the same labelled pairs ? Assuming of course the fields are the same, and the data goes through exactly the same preprocessing.
I understand I could keep redoing step 1, but from what I read, is not the best practice.
#fgregg I went through all the Stackoverflow and Github issues (most recent one being this one), but could not find any helpful answers.
Thanks !

Related

Designing Twitter Search - How to sort large datasets?

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!
First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.

Mongoose: how to use index in aggregate?

How can I use indexes in aggregate?
I saw the document https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
Is there any way of using index not the beginning situation?
like $sort,
$match or $group
Please help me
An index works by keeping a record of certain pieces of data that point to a given record in your collection. Think of it like having a novel, and then having a sheet of paper that lists the names of various people or locations in that novel with the page numbers where they're mentioned.
Aggregation is like taking that novel and transforming the different pages into an entirely different stream of information. You don't know where the new information is located until the transformation actually happens, so you can't possibly have an index on that transformed information.
In other words, it's impossible to use an index in any aggregation pipeline stage that is not at the very beginning because that data will have been transformed and MongoDB has no way of knowing if it's even possible to efficiently make use of the newly transformed data.
If your aggregation pipeline is too large to handle efficiently, then you need to limit the size of your pipeline in some way such that you can handle it more efficiently. Ideally this would mean having a $match stage that sufficiently limits the documents to a reasonably-sized subset. This isn't always possible, however, so additional effort may be required.
One possibility is generating "summary" documents that are the result of aggregating all new data together, then performing your primary aggregation pipeline using only these summary documents. For example, if you have a log of transactions in your system that you wish to aggregate, then you could generate a daily summary of the quantities and types of the different transactions that have been logged for the day, along with any other additional data you would need. You would then limit your aggregation pipeline to only these daily summary documents and avoid using the normal transaction documents.
An actual solution is beyond the scope of this question, however. Just be aware that the index usage is a limitation that you cannot avoid.

How can I implement an iterative optimization problem in Spark

Assume I have the following two sets of data. I'm attempting to associate products on hand with their rolled up tallies. For a roll up tally you may have products made of multiple categories with a primary and alternative category. In a relational database I would load the second set of data into a temporary table use a stored procedure to iterate through the rollup data and decrement the quantities until until they were zero or I had matched the tallies. I'm trying to implement a solution in Spark/PySpark and I'm not entirely sure where to start. I've attached a possible output solution that I'm trying to achieve though I recognize there are multiple outputs that would work.
#Rolled Up Quantities#
owner,category,alternate_category,quantity
ABC,1,4,50
ABC,2,3,25
ABC,3,2,15
ABC,4,1,10
#Actual Stock On Hand#
owner,category,product_id,quantity
ABC,1,123,30
ABC,2,456,20
ABC,3,789,20
ABC,4,012,30
#Possible Solution#
owner,category,product_id,quantity
ABC,1,123,30
ABC,1,012,20
ABC,2,456,20
ABC,2,789,5
ABC,3,789,15
ABC,4,012,10

Back-filling a feed?

Is there a way to insert activities into a feed so they appear as if they were inserted at a specific time in the past? I had assumed that when adding items to a feed it would use the 'time' value to sort the results, even when propagated to other feeds following the initial feed, but it seems that's not the case and they just get sorted by the order they were added to the feed.
I'm working on a timeline view for our users, and I have a couple of reasons for wanting to insert activities at previous points in time:
1) We have a large number of entities in our database but a relatively small number of them will be followed (especially at first), so to be more efficient I had planned to only add activities for an entity once it had at least one follower. Once somebody follows it I would like to go back 14 days and insert activities for that entity as if they were created at the time they occurred, so the new follower would see them in their feed at the appropriate place. Currently they will just see a huge group of activities from the past at the top of their feed which is not useful.
2) Similarly, we already have certain following relationships within our database and at launch I would like to go back a certain amount of time and insert activities for all entities that already have followers so that the feed is immediately useful.
Is there any way to do this, or am I out of luck?
My feeds are a combination of flat and aggregated feeds - the main timeline for a user is aggregated, but most entity feeds are flat. All of my aggregation groups would be based on the time of the activity so ideally there would be a way to sort the final aggregation groups by time as well.
Feeds on Stream are sorted differently depending on their type:
Flat feeds are sorted based by activity time descending
Aggregated feeds and Notification feeds sort activity groups based on last-updated (activities inside groups are sorted by time descending)
This means that you can back-fill flat feeds but not aggregated feeds.
One possible way to get something similar to what you describe is to create follow relationship with copy_limit set to a low number so that only the most recent activities are propagated to followers.

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

Resources