Filtering and sorting of multiallelic sites - vcf-variant-call-format

since splitting of multiallelic sites the way bcftools does it, doesn't really work for me, I am looking for a good way to sort and/or filter multiallelic sites by allele frequency.
bcftools view --min-af 0.05:alt1 will filter all variants of multiallelic sites with AF of ALT1 < 5% even, if ALT2 (or ALT3) would have a higher AF.
bcftools view --min-af 0.05:nref wil combine the AFs of all ALT alleles, which also isn't what I want to do.
Is there a good tool, that can filter all ALT alleles separately? Or is there a tool to sort all ALTs by their AF so I would at least have the most common at the beginning?
Thanks in advance

Related

Is there a way to group items based on a broader category (e.g., skittles and snickers get labeled as "candy")?

I'm wondering if there is a way (specific package, process, etc.) of grouping items based on an overall category? For example, I'm looking at empty search results and want to see what category customers are most interested in.
Let's say I have a list of searched terms: skittles, laundry, snickers and detergent. I would want to group these items based on a broader category (i.e., skittles and snickers are "candy" and laundry and detergent would be "cleaners").
I've done some research on this and have seen similar (but not exact) ways of doing this (e.g., common keyword grouping using NLP) but not sure if something like this exists in the world when there isn't necessarily any commonality. Any help or direction would be greatly appreciated.
Update here: The best way to handle this scenario is to use pretrained word embeddings using something like Google's BERT algorithm as the first pass and then layer on another ML model that is specific to the use case.

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Product search engines, and filtering on attributes ala eBay - how is it done?

Apologies in advance if this is a common question...I think I'm having trouble finding answers because I'm not sure what the problem is actually called!
The background to the problem is - if you look at a service like ebay, when you make a query, you can select a category to drill down in you results. And then when you drill down a leaf category, you can start using filters. So if you select televisions, you might get a variety of filters - like panel technology (oled, lcd, crt), screen size (22", 32", 40" etc.), brand (sony, samsung, lg etc.). The different filters show you the number of results each filter will produce.
Key point: as you select filters, the filters available update. So if you select sony and oled, the screensize filter (and the others) will update to match results available within the constraints of the previously chosen filters.
My question is...how would you implement this kind of filter system in a search engine. Or specifically, how would you calculate the number of results available for a give combination of filters? How do you work out and update the 'filter histogram' as the user makes filter choices?
It seems like a complex problem. Does ebay precalculate the number of results for every possible combination of filters under a leaf category?
Or is there some other smarter way of handling this?
I hope my question makes sense :) Thanks for ANY help! :)

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

smart search by first/last name

I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.

Resources