CouchDB car sale example, possible or not?

CouchDB car sale example, possible or not? - couchdb

I was explaining to a friend of mine how great CouchDB was and I was actually doing a very good job of it, until he asked me if you could do a car sale database. After giving this quite a lot of thought, I have no answer, I kind of think this is impossible.
My dilemma is like this. Lets say a car has an owner_id, manufacturer, year, type, color, milage and price.
My first initial thought was to just emit all the keys. But you might want to search for a car that is blue or red or yellow and is driven between 30.000 and 80.000 miles and with some price range. And given this query, what if you don't search for color ?
The only way I can think of, is doing many queries, and doing a manual brute force diff array in my database layer code. But that seems to be quite excessive, even if there are only a few thousand cars.
So, in short, is this possible to do, in a viable way ?

CouchDB is designed for scalability, and infinite-flexibility ad-hoc queries are not scalable, so therefore it is discouraged. It is possible though, with temporary views. You can POST your view (query) as a JSON object to /db/_temp_view.
See for more details http://wiki.apache.org/couchdb/HTTP_view_API#Temporary_Views and https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Concept.
Also, this answer might be of use to you.

As you say, there's no nice way to spell this in CouchDB, in the same way as you might do this in a relational database with support for spatial indices; That's not to say that there's no hope at all. You could, for instance, do some simple clustering to index cars that meet a certain set of attributes together.
Mileage and price look like good candidates for this approach. Most queries will probably specify these values to a single digit
function oneDigit(value) {
var strValue = String(value);
return (Number(strValue[0]) * Math.pow(10, strValue.length - 1);
}
With this, we can build a view which organizes cars into bins based on their price and mileage.
function (doc) {
emit([[oneDigit(doc.mileage), oneDigit(doc.price)], null]);
}
It's then a simple matter of getting all of the cars that have that feature:
for mileage in range(60000, 100000, 10000):
cars.append(db.view('cars/mileageAndPrice', startkey=[mileage, minPrice], endkey=[mileage + 10000, maxPrice]))

Related

Correct structure for MongoDB

I am fairly new to mongoDB and databases in general and I am not sure what the correct/typical structure is for setting up different attributes.
For example, Lets say I have a person named Tim and he can play basketball, soccer, and tennis. How do you go about stating this? do you use booleans or store an array of strings?
This is how I think the format is..is this the correct way to think about it?
name: 'Tim',
sports: {
soccer: true,
tennis: true,
basketball: true,
football: false
}

Data modeling in MongoDB works differently than with RDBMS. A typical workflow with RDBMs is that you define your entities and their properties as well as their relations and then bang your head against the wall to get your „upper left above and beyond“™ JOINs right so that the data gives you the answers you need.
The typical workflow for NoSQL databases in general and MongoDB in particular is different. You identify the questions you need to get answered by your data first, and model your data so that these questions can be answered in the most efficient way. Hence, let us play this through.
You obviously have people, for which sports they participate in should be recorded. So the first question we have to ask ourselves is wether we embed the data or wether we reference the data. Since it is rather unlikely that we hit the 16MB document limit, embedding seems to be a good choice. Now, with staying with an object holding Boolean values for every sport imaginable (later down the road), not only to we introduce unnecessary verbosity, we add exactly 0 informational value in comparison to holding an array of sports:
{
Name: “Tim“,
Sports: [”Soccer”,“Tennis”,”Basketball”]
}
This makes it much easier to introduce new Sports later down the road (simply append said sport to the array of people doing it), does not pollute each and every document with irrelevant data and holds the same informational value.

Is my database design consistent with RDMS

I am working on my website where I sell concert tickets.
I am working on designing the part of the website where I generate tickets based on seat and rows available.
After some thinking and drawing I have to the conclusion that this design would be best for my problem.
I was wondering is this poor design or are there any improvements that I can make?
Thank you

I wouldn't expect to have a table of unbooked seats. A table of bookings seems more logical. Your concerts table looks questionable if you expect to have a series of dates for the same concert.
Perhaps you should first sketch out the key functions of your site as User Stories or Use Cases and list out the required attributes for each. That could give you a better set of requirements for your database design, e.g. what customer attributes; what about seat attributes such as restricted view, standing places or accessible places for the disabled.

Firestore, how to structure a "likedBy" query

I'm having a little trouble wrapping my head around how to best structure my (very simple) Firestore app. I have a set of users like this:
users: {
'A123': {
'name':'Adam'
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
...and each user can 'like' or 'dislike' any number of other users (like Tinder).
I'd like to structure a "likes" table (or Firestore equivalent) so that I can list people who I haven't yet liked or disliked. My initial thought was to create a "likes" object within the user table with boolean values like this:
users: {
'A123': {
'name':'Adam',
'likedBy': {
'B234':true,
},
'disLikedBy': {
'C345':true
}
},
'B234': {
'name':'Bella'
},
'C345': {
'name':'Charlie'
}
}
That way if I am Charlie and I know my ID, I could list users that I haven't yet liked or disliked with:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
.where('dislikedBy.C345','==',false)
This doesn't work (everyone gets listed) so I suspect that my approach is wrong, especially the '==false' part. Could someone please point me in the right direction of how to structure this? As a bonus extra question, what happens if somebody changes their name? Do I need to change all of the embedded "likedBy" data? Or could I use a cloud function to achieve this?
Thanks!

There isn't a perfect solution for this problem, but there are alternatives you can do depending on what trade-offs you want.
The options: Overscan vs Underscan
Remember that Cloud Firestore only allows queries that scale independent of the total size of your dataset.
This can be really helpful in preventing you from building something that works in test with 10 documents, but blows up as soon as you go to production and become popular. Unfortunately, this type of problem doesn't fit that scalable pattern and the more profiles you have, and the more likes people create, the longer it takes to answer the query you want here.
The solution then is to find a one or more queries that scale and most closely represent what you want. There are 2 options I can think of that make trade-offs in different ways:
Overscan --> Do a broader query and then filter on the client-side
Underscan --> Do one or more narrower queries that might miss a few results.
Overscan
In the Overscan option, you're basically trading increased cost to get 100% accuracy.
Given your use-case, I imagine this might actually be your best option. Since the total number of profiles is likely orders of magnitude larger than the number of profiles an individual has liked, the increased cost of overscanning is probably inconsequential.
Simply select all profiles that match any other conditions you have, and then on the client side, filter out any that the user has already liked.
First, get all the profiles liked by the user:
var likedUsers = firebase.firestore().collection('users')
.where('likedBy.C345','==',false)
Then get all users, checking against the first list and discarding anything that matches.
var allUsers = firebase.firestore().collection('users').get()
Depending on the scale, you'll probably want to optimize the first step, e.g. every time the user likes someone, update an array in a single document for that user for everyone they have liked. This way you can simply get a single document for the first step.
var likedUsers = firebase.firestore().collection('likedUsers')
.doc('C345').get()
Since this query does scale by the size of the result set (by defining the result set to be the data set), Cloud Firestore can answer it without a bunch of hidden unscalable work. The unscalable part is left to you to optimize (with 2 examples above).
Underscan
In the Underscan option, you're basically trading accuracy to get a narrower (hence cheaper) set of results.
This method is more complex, so you probably only want to consider it if for some reason the liked to unliked ratio is not as I suspect in the Overscan option.
The basic idea is to exclude someone if you've definitely liked them, and accept the trade-off that you might also exclude someone you haven't yet liked - yes, basically a Bloom filter.
In each users profile store a map of true/false values from 0 to m (we'll get to what m is later), where everything is set to false initially.
When a user likes the profile, calculate the hash of the user's ID to insert into the Bloom filter and set all those bits in the map to true.
So let's say C345 hashes to 0110 if we used m = 4, then your map would look like:
likedBy: {
0: false,
1: true,
2: true,
3: false }
Now, to find people you definitely haven't liked, you need use the same concept to do a query against each bit in the map. For any bit 0 to m that your hash is true on, query for it to be false:
var usersRef = firebase.firestore().collection('users')
.where('likedBy.1','==',false)
Etc. (This will get easier when we support OR queries in the future). Anyone who has a false value on a bit where your user's ID hashes to true definitely hasn't been liked by them.
Since it's unlikely you want to display ALL profiles, just enough to display a single page, you can probably randomly select a single one of the ID's hash bits that is true and just query against it. If you run out of profiles, just select another one that was true and restart.
Assuming most profiles are liked 500 or less times, you can keep the false positive ratio to ~20% or less using m = 1675.
There are handy online calculators to help you work out ratios of likes per profile, desired false positive ratio, and m, for example here.
Overscan - bonus
You'll quickly realize in the Overscan option that every time you run the query, the same profiles the user didn't like last time will be shown. I'm assuming you don't want that. Worse, all the ones the user liked will be early on in the query, meaning you'll end up having to skip them all the time and increase your costs.
There is an easy fix for that, use the method I describe on this question, Firestore: How to get random documents in a collection. This will enable you to pull random profiles from the set, giving you a more even distribution and reducing the chance of stumbling on lots of previously liked profiles.
Underscan - bonus
One problem I suspect you'll have with the Underscan option is really popular profiles. If someone is almost always liked, you might start exceeding the usefulness of a bloom filter if that profile has a size not reasonable to keep in a single document (you'll want m to be less than say 8000 to avoid running into per document index limits in Cloud Firestore).
For this problem, you want to combine the Overscan option just for these profiles. Using Cloud Functions, any profile that has more than x% of the map set to true gets a popular flag set to true. Overscan everyone on the popular flag and weave them into your results from the Underscan (remember to do the discard setup).

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.

I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.

Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.

To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!

The Levenshtein distance should do.
Look around - there are implementations in many languages.

Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.

Soundex or Metaphone might work.

I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.

Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.

check this one http://docs.python.org/library/difflib.html
it should help you

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string