Difference between a search engine's relevance rankings and a recommender system [closed] - search

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
What's the difference between a search engine's relevance rankings and a recommender system?
Don't both try and achieve the same purpose, i.e. finding the most relevant items for the user?

There is a major difference between a search engine and a recommender system:
In a search engine, the user knows what he is looking for, and he makes the query ! For instance, I might wonder if I should go to see a movie, and search information about it, like actors and directors.
In a recommender system, the user isn't supposed to know what we are recommending to her. We match her tastes with neighbours or whatever algorithm you like, and find things that she would't have looked after, like a new movie!
One is more about information retrieval, while the other is more about information filtering and discovery.

No, there are at two differents levels of analysis.
A search engine, look into a collection of data to get data that match a query. Even if all result are identicals or the results doesn't change from day to day. Very like a special form of databases.
The recommender system, use informations about you to provide specific improved content about the searched data. Very like a servant that know you well and use search engine for you.
Beward, some tool that starts as web-search engine are now more like recommender system.

Related

Looking for ICD10 API [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Any body knows of a good ICD10 API to do diagnostic code lookups that can recommend. I am currently building a simple app to tag patients with medical condition and the idea is to have a lookup API where one can type asthma for example and get back all the different ICD10 codes for asthma
My R package, icd converts ICD-9 and ICD-10 codes to descriptions, in addition to its main function of finding comorbidities. Documentation at https://jackwasey.github.io/icd/ , and code at https://github.com/jackwasey/icd . It does this using the function explain_code. It currently uses ICD-10-CM, i.e. the USA billing adapted ICD-10 code set, which in general is more specific than the canonical WHO version, but does have some areas of less detail.
E.g. WHO ICD-10 has HIV disease resulting in Pneumocystis jirovecii pneumonia as a subdivision for HIV infection, whereas ICD-10-CM just has HIV. On the other hand, ICD-10-CM has Sucked into jet engine, subsequent encounter whereas the WHO is happy with the terribly vague: Person on ground injured in air transport accident.
The volume of data for all the descriptions is not very high, just handful of megabytes, so although an API may seem convenient, you might consider just having all the data and not having to ping some random server.
I'm going to assume you're ignoring all of the usual stuff around variations of spelling of medical terms, proper terms vs. colloquialisms, labels vs. descriptions, etc. that get to be a pain with term / code finders.
If you want to use a hosted option and are OK with the terms of use, you could use UMLS (https://uts.nlm.nih.gov/home.html#apidocumentation). It's a great resource, but the use case you're describing isn't necessarily what it's intended to address.
Personally - and I usually don't like to roll my own stuff - I'd consider doing your own thing. You could do something focused on your needs and tailor it to any specific behaviors you might want (like preferring specific codes based on an organization - EX: billing preference). You could also probably make it far, far more ... perky ... and address short forms of terms (EX: synonyms like "DVT") or misspellings ("asthma" vs. "athsma"). If you go that route, I'd suggest considering getting your hands on the ICD-10 code info and then mashing it into Elastic Search. You could extend the data by mixing it with other info and really make it hum. And Elastic is wicked fast.
That's just my $0.02, though.
There is a project called "Unified Medical Language System (UMLS)", funded by NIH and apparently they are working on a RESTful Web API for medical terms.
https://documentation.uts.nlm.nih.gov/rest/home.html
I didn't work with their API yest and the samples I am seeing on their website sounds like they are more SNOMED-CT oriented.
The option I would go for is to get the whole ICD-10-CM from CMS and build my own Web API.
https://www.cms.gov/Medicare/Coding/ICD10/2016-ICD-10-CM-and-GEMs.html
you can check the full documentation from WHO https://icd.who.int/icdapi

Approximate string matching algorithms state-of-the-art [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I seek a state of the art algorithms to approximate string matching.
Do you offer me references(article, thesis,...)?
thank you
You might have got your answer already but I want to convey my points on approximate string matching so that, others might benefit. I am speaking from my experience having worked to solve cloud service problems to handle really large scale requirements.
If we just want to talk about the Approximate String matching algorithms, then there are many.
Few of them are:
Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc.
A simple googling would give us all the details.
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
Assuming that we have a list of millions of names and if we want to search a given input name against all the entries in the list using one of the standard algorithms above would mean disaster.
A typical, edit distance algorithm has a time complexity of O(N^2) where N is the number of characters in a string. To scan the list of size M, the complexity would be O(M * N^2). This would mean very high hardware requirements and it just doesn't work in your favor regardless of how much h/w you want to stack up.
This is where we have to start thinking about other approaches.
One of the common approaches to solve such a problem in the production environment is to use a standard search engine like -
Apache Lucene.
https://lucene.apache.org/
Lucene indexing engine indexes the reference data(Called as documents) and input query can be fired against the engine. The results are returned which are ranked based on how close they are to the input.
This is close to how google search engine works. Googles crawles and index the whole web but you should have a miniature system mimicking what Google does.
This works for most of the cases including complicated name matching where the first, middle and the last names are interchanged.
You can select your results based on the scores emitted by Lucene.
While you mature in your role, you will start thinking about using hosted solutions like Amazon Cloudsearch which wraps the Solr and ElastiSearch for you. Of-course it uses Lucene underneath and keeps you independent of potential size of the index due to larger reference data which is used for indexing.
http://aws.amazon.com/cloudsearch/
You might want to read about Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance

design a search function for high scale [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
(not sure if this is the right forum for this question)
I am very curious about how search in major site, say youtube/quora/stackexcahnge, works?
And I'm NOT looking for an answer like 'They Use Lucene Search engine'. I want to understand exactly how the indexing works there.
Is there a different Index for text search than the autocomplete feature?
Is it done in the background like map reduce.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
Help Please :-)
I voted to close, because I think your question is too broad. Each bullet could form the basis of an SO question. That stated, I'll take a crack at answer how SolrCloud attempts to solve each of the problems you are asking about:
Is there a different Index for text search than the autocomplete feature?
The short answer is "yes". Solr has several options for implementing an autocomplete feature and all of them rely on either building a separate index or being supplied a separate dictionary. You can also roll your own in an even more sophisticated fashion as the blog post "Super flexible AutoComplete with Solr" demonstrates.
Is it done in the background like map reduce?
Generally speaking no. SolrCloud is based on the idea of shards with leaders and replicas. A shard being a subset of your overall index. With a shard being comprised of a leader and possibly one or more replicas.
Queries are executed against all shard leaders. With assigning a particular shard to serve as the aggregator of each shard's response, but unlike map reduce where the individual node responses have all the data the reducing node needs, the aggregating Solr shard may make multiple requests back to the other shards to figure out sort order - for example.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
See my response to your previous question. In short the query is executed against each shard, aggregated by one of those shards, and returned to the requestor. What Solr does - Lucene really - that's the useful magic part that people most often associate with it is Term Frequency Inverse Document Frequency indexing usually with stemming on text searches. While this is not exactly what happens under the hood, and you can vary what's actually done via configuration, it provides a fairly good idea of what's being done.
Other searching, on dates and numbers, or simple textual values is done in a fashion similar to database indexing. That is a simplification, if you want to understand it more fully read the JavaDoc on NumericRangeQuery for an in-depth explanation.
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
If I knew the answer to that I would probably be working for Google and not answering StackOverflow questions :). Seriously whatever they've built is new PhD level work that as far as I know they haven't even release a research paper on, which is what they did with map reduce that led to Yahoo building Hadoop.

How to define a PBI that has no perceived value to the user? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I need to add an item to our product backlog list that has no (perceived) value to the users.
Context: every week we need to parse and import a TXT file our system. Now the provider decided to change the format to XML, so we need to rewrite the parsing engine.
In the end the user won't see any benefit as he'll keep getting his new data, but we still have to do this to keep importing the data.
How to add an item like this to the product backlog list?
What happens if you don't make the change? Is there value to the user in preventing that from happening? If the answer is yes, I'd recommend tying your business value statement to that. Then, you can write a typical user story with business value and treat it like any other PBI.
It has no value to the user, but it has value to your company.
As company X I want to be able to support the new XML format so that I can keep importing data from provider Y.
How does that sound like? Not all stories necessarily evolve around the end user.
Note: technical stories and technical improvement stories are not a good practice and they should avoided. Why? Because you can't prioritize them correctly as they have no estimable value.
The correct way to do tech stories is to include them in the definition of done. For example: decide that every new story played is only complete once database access is via Dapper and not L2S. This is a viable DoD definition and makes sure you can evolve your system appropriately.
We typically just add it as a "technical improvement" and give it a priority that we think fits. If the user asks you about it, you just explain them high level what the change does and why it's needed.
Don't forget that your application will most likely start failing in the future if you don't make the change. Just tell them that, and let them decide whether they want that or not.

How to document software algorithms? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a large project for a university assignment, we're developing an application that is used by a business to compile quotes for their various services.
I need to document the algorithms in a way that the client can sign off on to make sure the way we calculate the prices is correct
So far I've tried using a large flow chart with decisions diamonds like in information systems modelling but it's proving to be overkill for even simple algorithms.
Can anybody please suggest some ways to do this? It needs to be as little like software code as possible, and enough for the client to see how we decide what prices are quoted
Maybe you should then use pseudocode.
Create two documents.
First: The business process model (BPM) that shows the sequence of steps required to be done. This should be annotated with the details for each step.
Second: Create a spreadsheet with each input data item defined so that business can see that you understand the type of field for entry of each data point and the rules for each data point. If the calculation uses a table for the step, then that is where you define the input lookup value from the table. So for each step you know where the data is coming from and then going to. Your spreadsheet can include the link to the BPM so they can walk through each data point in the BPM and see where it is coming from/going to.
You can prepare screen designs to show the users how your system is doing actually.
Well, the usual way to document algorithms is writing papers.
If your clients have studied business, I'm sure they are familiar with reading formulas.
Would a data flow diagrams help? Put psuedo code or math in the bubbles. I've had some success combining data flow models and entity relationship diagrams, but it's non standard.
What about Nassi-Shneiderman-Diagram, it's a diagram from structural programming. I think its good to show decision flows.
http://en.wikipedia.org/wiki/Nassi%E2%80%93Shneiderman_diagram
You could create an algorithm test screen to display and comment on the various steps through the calculations.

Resources