Firestore, how to structure a "likedBy" query

Firestore, how to structure a "likedBy" query - node.js

I'm having a little trouble wrapping my head around how to best structure my (very simple) Firestore app. I have a set of users like this:
users: {
'A123': {
'name':'Adam'
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
...and each user can 'like' or 'dislike' any number of other users (like Tinder).
I'd like to structure a "likes" table (or Firestore equivalent) so that I can list people who I haven't yet liked or disliked. My initial thought was to create a "likes" object within the user table with boolean values like this:
users: {
'A123': {
'name':'Adam',
'likedBy': {
'B234':true,
},
'disLikedBy': {
'C345':true
}
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
That way if I am Charlie and I know my ID, I could list users that I haven't yet liked or disliked with:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
.where('dislikedBy.C345','==',false)
This doesn't work (everyone gets listed) so I suspect that my approach is wrong, especially the '==false' part. Could someone please point me in the right direction of how to structure this? As a bonus extra question, what happens if somebody changes their name? Do I need to change all of the embedded "likedBy" data? Or could I use a cloud function to achieve this?
Thanks!

There isn't a perfect solution for this problem, but there are alternatives you can do depending on what trade-offs you want.
The options: Overscan vs Underscan
Remember that Cloud Firestore only allows queries that scale independent of the total size of your dataset.
This can be really helpful in preventing you from building something that works in test with 10 documents, but blows up as soon as you go to production and become popular. Unfortunately, this type of problem doesn't fit that scalable pattern and the more profiles you have, and the more likes people create, the longer it takes to answer the query you want here.
The solution then is to find a one or more queries that scale and most closely represent what you want. There are 2 options I can think of that make trade-offs in different ways:
Overscan --> Do a broader query and then filter on the client-side
Underscan --> Do one or more narrower queries that might miss a few results.
Overscan
In the Overscan option, you're basically trading increased cost to get 100% accuracy.
Given your use-case, I imagine this might actually be your best option. Since the total number of profiles is likely orders of magnitude larger than the number of profiles an individual has liked, the increased cost of overscanning is probably inconsequential.
Simply select all profiles that match any other conditions you have, and then on the client side, filter out any that the user has already liked.
First, get all the profiles liked by the user:
var likedUsers = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
Then get all users, checking against the first list and discarding anything that matches.
var allUsers = firebase.firestore().collection('users').get()
Depending on the scale, you'll probably want to optimize the first step, e.g. every time the user likes someone, update an array in a single document for that user for everyone they have liked. This way you can simply get a single document for the first step.
var likedUsers = firebase.firestore().collection('likedUsers')
.doc('C345').get()
Since this query does scale by the size of the result set (by defining the result set to be the data set), Cloud Firestore can answer it without a bunch of hidden unscalable work. The unscalable part is left to you to optimize (with 2 examples above).
Underscan
In the Underscan option, you're basically trading accuracy to get a narrower (hence cheaper) set of results.
This method is more complex, so you probably only want to consider it if for some reason the liked to unliked ratio is not as I suspect in the Overscan option.
The basic idea is to exclude someone if you've definitely liked them, and accept the trade-off that you might also exclude someone you haven't yet liked - yes, basically a Bloom filter.
In each users profile store a map of true/false values from 0 to m (we'll get to what m is later), where everything is set to false initially.
When a user likes the profile, calculate the hash of the user's ID to insert into the Bloom filter and set all those bits in the map to true.
So let's say C345 hashes to 0110 if we used m = 4, then your map would look like:
likedBy: {
0: false,
1: true,
2: true,
3: false }
Now, to find people you definitely haven't liked, you need use the same concept to do a query against each bit in the map. For any bit 0 to m that your hash is true on, query for it to be false:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.1','==',false)
Etc. (This will get easier when we support OR queries in the future). Anyone who has a false value on a bit where your user's ID hashes to true definitely hasn't been liked by them.
Since it's unlikely you want to display ALL profiles, just enough to display a single page, you can probably randomly select a single one of the ID's hash bits that is true and just query against it. If you run out of profiles, just select another one that was true and restart.
Assuming most profiles are liked 500 or less times, you can keep the false positive ratio to ~20% or less using m = 1675.
There are handy online calculators to help you work out ratios of likes per profile, desired false positive ratio, and m, for example here.
Overscan - bonus
You'll quickly realize in the Overscan option that every time you run the query, the same profiles the user didn't like last time will be shown. I'm assuming you don't want that. Worse, all the ones the user liked will be early on in the query, meaning you'll end up having to skip them all the time and increase your costs.
There is an easy fix for that, use the method I describe on this question, Firestore: How to get random documents in a collection. This will enable you to pull random profiles from the set, giving you a more even distribution and reducing the chance of stumbling on lots of previously liked profiles.
Underscan - bonus
One problem I suspect you'll have with the Underscan option is really popular profiles. If someone is almost always liked, you might start exceeding the usefulness of a bloom filter if that profile has a size not reasonable to keep in a single document (you'll want m to be less than say 8000 to avoid running into per document index limits in Cloud Firestore).
For this problem, you want to combine the Overscan option just for these profiles. Using Cloud Functions, any profile that has more than x% of the map set to true gets a popular flag set to true. Overscan everyone on the popular flag and weave them into your results from the Underscan (remember to do the discard setup).

Related

Original Estimates in Azure DevOps - Max value?

Our evolution of using DevOps is continuing (slowly but surely). One thing we've noticed is that some people are trying to but excessive estimates in for their time, but what we really want to be encouraging is for people to be breaking work down into multiple tasks.
Is there a way that we can set our DevOps work items to only accept a maximum value? I've had a look at the 'rules' and there doesn't seem to be anything there to let us do this, and because it's an out of the box field I don't think we can put a value limit against it.
I suppose what I want to understand is whether it would be possible to do this in some way? Could I do something with the existing 'Original Estimate' field or would I have to create a new custom field to have any chance of preventing people from putting in 100 hours for something that's actually more like 2?

If you are also using Boards, you could highlight work items where the original estimate is higher than a certain value. This would not prevent setting these values, but rather encourage the users to put in lower values.
https://learn.microsoft.com/en-us/azure/devops/boards/boards/customize-cards?view=azure-devops
Beware that this might not really help the underlying issue: People must be convinced of the benefits of splitting up tasks, otherwise they will just work around the tooling. Like always putting in the maximum value or not putting in the actual work hours.

Is there a way that we can set our DevOps work items to only accept a
maximum value?
I am afraid that setting the value limit for the Original Estimates field is currently not supported.
As workaround, you could need to create a custom field of type Picklist, and then specify the available values in the picklist.
You could add your request for this feature on our UserVoice site , which is our main forum for product suggestions.After suggest raised, you can vote and add your comments for this feedback. The product team would provide the updates if they view it.

Correct structure for MongoDB

I am fairly new to mongoDB and databases in general and I am not sure what the correct/typical structure is for setting up different attributes.
For example, Lets say I have a person named Tim and he can play basketball, soccer, and tennis. How do you go about stating this? do you use booleans or store an array of strings?
This is how I think the format is..is this the correct way to think about it?
name: 'Tim',
sports: {
soccer: true,
tennis: true,
basketball: true,
football: false
}

Data modeling in MongoDB works differently than with RDBMS. A typical workflow with RDBMs is that you define your entities and their properties as well as their relations and then bang your head against the wall to get your „upper left above and beyond“™ JOINs right so that the data gives you the answers you need.
The typical workflow for NoSQL databases in general and MongoDB in particular is different. You identify the questions you need to get answered by your data first, and model your data so that these questions can be answered in the most efficient way. Hence, let us play this through.
You obviously have people, for which sports they participate in should be recorded. So the first question we have to ask ourselves is wether we embed the data or wether we reference the data. Since it is rather unlikely that we hit the 16MB document limit, embedding seems to be a good choice. Now, with staying with an object holding Boolean values for every sport imaginable (later down the road), not only to we introduce unnecessary verbosity, we add exactly 0 informational value in comparison to holding an array of sports:
{
Name: “Tim“,
Sports: [”Soccer”,“Tennis”,”Basketball”]
}
This makes it much easier to introduce new Sports later down the road (simply append said sport to the array of people doing it), does not pollute each and every document with irrelevant data and holds the same informational value.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks

If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.

Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

CouchDB car sale example, possible or not?

I was explaining to a friend of mine how great CouchDB was and I was actually doing a very good job of it, until he asked me if you could do a car sale database. After giving this quite a lot of thought, I have no answer, I kind of think this is impossible.
My dilemma is like this. Lets say a car has an owner_id, manufacturer, year, type, color, milage and price.
My first initial thought was to just emit all the keys. But you might want to search for a car that is blue or red or yellow and is driven between 30.000 and 80.000 miles and with some price range. And given this query, what if you don't search for color ?
The only way I can think of, is doing many queries, and doing a manual brute force diff array in my database layer code. But that seems to be quite excessive, even if there are only a few thousand cars.
So, in short, is this possible to do, in a viable way ?

CouchDB is designed for scalability, and infinite-flexibility ad-hoc queries are not scalable, so therefore it is discouraged. It is possible though, with temporary views. You can POST your view (query) as a JSON object to /db/_temp_view.
See for more details http://wiki.apache.org/couchdb/HTTP_view_API#Temporary_Views and https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Concept.
Also, this answer might be of use to you.

As you say, there's no nice way to spell this in CouchDB, in the same way as you might do this in a relational database with support for spatial indices; That's not to say that there's no hope at all. You could, for instance, do some simple clustering to index cars that meet a certain set of attributes together.
Mileage and price look like good candidates for this approach. Most queries will probably specify these values to a single digit
function oneDigit(value) {
var strValue = String(value);
return (Number(strValue[0]) * Math.pow(10, strValue.length - 1);
}
With this, we can build a view which organizes cars into bins based on their price and mileage.
function (doc) {
emit([[oneDigit(doc.mileage), oneDigit(doc.price)], null]);
}
It's then a simple matter of getting all of the cars that have that feature:
for mileage in range(60000, 100000, 10000):
cars.append(db.view('cars/mileageAndPrice', startkey=[mileage, minPrice], endkey=[mileage + 10000, maxPrice]))

Dynamics CRM 2011 Import Data Duplication Rules

I have a requirement in which I need to import data from excel (CSV) to Dynamics CRM regularly.
Instead of using some simple Data Duplication Rules, I need to implement a point system to determine whether a data is considered duplicate or not.
Let me give an example. For example these are the particular rules for Import:
First Name, exact match, 10 pts
Last Name, exact match, 15 pts
Email, exact match, 20 pts
Mobile Phone, exact match, 5 pts
And then the Threshold value => 19 pts
Now, if a record have First Name and Last Name matched with an old record in the entity, the points will be 25 pts, which is higher than the threshold (19 pts), therefore the data is considered as Duplicate
If, for example, the particular record only have same First Name and Mobile Phone, the points will be 15 pts, which is lower than the threshold and thus considered as Non-Duplicate
What is the best approach to achieve this requirement? Is it possible to utilize the default functionality of Import Data in the MS CRM? Is there any 3rd party Add-on that answer my requirement above?
Thank you for all the help.
Updated
Hi Konrad, thank you for your suggestions, let me elaborate here:
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Nice one but I don't think it is really workable in my case, the data will be coming regularly from client in moderate numbers (hundreds to thousands). Typically client won't check about the duplication on the data.
Workflow. Run a process removing any instance calculated as a duplicate.
Workflow is a good idea, however since it is being processed asynchronously, my concern is the user in some cases may already do some update/changes to the data inserted, before the workflow finish working.. therefore creating some data inconsistency or at the very least confusing user experience
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
I like this approach. So I just import like usual (for example, to contact entity), but I already have a plugin in place that getting triggered every time a record is created, the plugin will check whether the record is duplicat-ish or not and took necessary action.

I haven't been fiddling a lot with duplicate detection but looking at your criteria you might be able to make rules that match those, pretty much three rules to cover your cases, full name match, last name and mobile phone match and email match.
If you want to do the points system I haven't seen any out of the box components that solve this, however CRM Extensions have a product called Import Manager that might have that kind of duplicate detection. They claim to have customized duplicate checking. Might be worth asking them about this.
Otherwise it's custom coding that will solve this problem.

I can think of the following approaches to the task (depending on the number of records, repetitiveness of the import, automatization requirement etc.) they may be all good somehow. Would you care to elaborate on the current conditions?
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
Workflow. Run a process removing any instance calculated as a duplicate.
You also need to consider the implication of such elimination of data. There's a mathematical issue. Suppose that the uniqueness' radius (i.e. the threshold in this 1D case) is 3. Consider the following set of numbers (it's listed twice, just in different order).
1 3 5 7 -> 1 _ 5 _
3 1 5 7 -> _ 3 _ 7
Are you sure that's the intended result? Under some circumstances, you can even end up with sets of records of different sizes (only depending on the order). I'm a bit curious on why and how the setup came up.
Personally, I'd go with plugin, if the above is OK by you. If you need to make sure that some of the unique-ish elements never get omitted, you'd probably best of applying a test algorithm to a backup of the data. However, that may defeat it's purpose.
In fact, it sounds so interesting that I might create the solution for you (just to show it can be done) and blog about it. What's the dead-line?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string