Searching List<T> data

Searching List<T> data - search

I'm trying to work out the best way possible to search a List based on say merchant name and amount. For example consider following Transaction class or goals :
public class Transaction
{
public string MerchantName;
public double Amount
public int Rank
}
The input for a search of List might be "Acme" for merchant and "13.37" for amount. Or "Amce" -- Acme spelled incorrectly. Weather merchant is spelled 100% correct or slightly off I would like to return all "Acme" transactions. Additionally based on the "closeness" of input amount & closeness of spelling to actual Transaction amount/name provide a rank or weighting to the transactions so the UI layer can present accordingly.
Comceptually i understand SoundEx and Edit Distance type algorithms but have zero practical experience implementing this in code. Looking to draw from the community here you experience guidance. I understand this code may (maybe not) be better suited to implement in SQL (in my case SQL) but right now i'd like to see if this is achievable in the application code -- c#. Open to SQL suggestions though.
Thanks

Related

Is my database design consistent with RDMS

I am working on my website where I sell concert tickets.
I am working on designing the part of the website where I generate tickets based on seat and rows available.
After some thinking and drawing I have to the conclusion that this design would be best for my problem.
I was wondering is this poor design or are there any improvements that I can make?
Thank you

I wouldn't expect to have a table of unbooked seats. A table of bookings seems more logical. Your concerts table looks questionable if you expect to have a series of dates for the same concert.
Perhaps you should first sketch out the key functions of your site as User Stories or Use Cases and list out the required attributes for each. That could give you a better set of requirements for your database design, e.g. what customer attributes; what about seat attributes such as restricted view, standing places or accessible places for the disabled.

Spark Item Similarity Interpretation (Cross-Similarity and Similarity)

I've been using Spark Item Similarity through mahout by following the steps in this article:
https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
I was able to clean my data, setup a local-only spark/hadoop node and all that.
Now, my question relies more in the interpretation of the matrices. I've tried some Google queries with limited success.
I'm creating a multi-modal recommender - and one of my datasets is very similar to the Mahout example.
Example input:
Customer ActionName Product
11064612 view 241505
11086047 purchase 110915
11121878 view CERT_DL
11149030 purchase CERT_FS
11104130 view 111401
The output of mahout is 2 sets of matrices. A similarity matrix and a coocurrence matrix.
This is my similarity matrix (I assume mahout uses my "filter1" purchases)
**791207-WP** 791520-WP:11.350536461453885 791520:9.547158147208393 76130142:7.938639976084232 711215:7.0641921646893024 751309:6.805891904514283
So how would I interpret this? If someone purchased 791207-WP they could be interested in 791520-WP? (so I'd use the left part against purchases of a customer and rank products in the right part?).
The row for 791520-WP looks like this:
791520-WP 76151220:18.954662238247693 791604-WP:13.951210170984268
So, in theory, I'd recommend 76151220 to someone who bought 791520-WP, correct?
Part 2 of the question is interpreting the cross-similarity matrix. Remember my filter2 is "views".
How would I interpret this:
**790907** 76120956:14.2824428207241 791500-LXQ2:13.864741460885853 190907:10.735807818360627
I take this matrix as "someone who visited the 76120956 web page ended up purchasing 790907". So I should promote 790907 to customers who bought 76120956 and maybe even add a link between these 2 products on our site, for example.
Or is it "people who visited the webpage of 790907 ended up buying 76120956"?
My plan is not to use these as-is. I'll still use RowSimilarity and different sources to rank products - but I'm missing the basic interpretation of the outputs from mahout.
If you know of any documentation that clarifies this, that would be a great asset to have.
Thank you.

In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item.
Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.
The output is meant to be used with a search engine generally and you would use a user's history of purchases and views as a 2 field query against the matrices, one in each field.
There are analogous methods to find item-based recommendations.
Better yet, use something like the Universal Recommender here: actionml.com/docs/ur with PredictionIO for an end-to-end system.

CouchDB car sale example, possible or not?

I was explaining to a friend of mine how great CouchDB was and I was actually doing a very good job of it, until he asked me if you could do a car sale database. After giving this quite a lot of thought, I have no answer, I kind of think this is impossible.
My dilemma is like this. Lets say a car has an owner_id, manufacturer, year, type, color, milage and price.
My first initial thought was to just emit all the keys. But you might want to search for a car that is blue or red or yellow and is driven between 30.000 and 80.000 miles and with some price range. And given this query, what if you don't search for color ?
The only way I can think of, is doing many queries, and doing a manual brute force diff array in my database layer code. But that seems to be quite excessive, even if there are only a few thousand cars.
So, in short, is this possible to do, in a viable way ?

CouchDB is designed for scalability, and infinite-flexibility ad-hoc queries are not scalable, so therefore it is discouraged. It is possible though, with temporary views. You can POST your view (query) as a JSON object to /db/_temp_view.
See for more details http://wiki.apache.org/couchdb/HTTP_view_API#Temporary_Views and https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Concept.
Also, this answer might be of use to you.

As you say, there's no nice way to spell this in CouchDB, in the same way as you might do this in a relational database with support for spatial indices; That's not to say that there's no hope at all. You could, for instance, do some simple clustering to index cars that meet a certain set of attributes together.
Mileage and price look like good candidates for this approach. Most queries will probably specify these values to a single digit
function oneDigit(value) {
var strValue = String(value);
return (Number(strValue[0]) * Math.pow(10, strValue.length - 1);
}
With this, we can build a view which organizes cars into bins based on their price and mileage.
function (doc) {
emit([[oneDigit(doc.mileage), oneDigit(doc.price)], null]);
}
It's then a simple matter of getting all of the cars that have that feature:
for mileage in range(60000, 100000, 10000):
cars.append(db.view('cars/mileageAndPrice', startkey=[mileage, minPrice], endkey=[mileage + 10000, maxPrice]))

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.

I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.

Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.

To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

How do I sort search results by relevance?

I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.

In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string