What is the cost of performing sequential search for kNN queries.
Or the average number of comparisons to obtain the kNN records.
If you seek the best optimal top-k scores, then you should scan all candidate items. In other words, perform a full scan. If you are willing to retrieve "nearly" the best scores, you can apply an approximated kNN search. There are several approximated kNN indexing algorithms: LSH, HNSW, PQ, IVF, etc. You can use open-source code libraries such as Faiss, NMSLib, Annoy; Open-source cloud tools such as Jina; Or managed services such as AWS elastic-search and Pinecone.io.
Related
Recently I trained a BYOL model on a set of images to learn an embedding space where similar vectors are close by. The performance was fantastic when I performed approximate K-nearest neighbours search.
Now the next task, where I am facing a problem is to find a clustering algorithm that uncovers a set of clusters using the embedding vectors generated by the BYOL trained feature extractor [dimension of the vector is 1024 & there are 1 million vectors]. I have no information apriori about the number of classes i.e. clusters in my dataset & thus cannot use Kmeans. Is there any scalable clustering algorithm that can help me uncover such clusters. I tried to use FISHDBC but the repository does not have good documentation.
I classify clients by many little xgboost models created from different parts of dataset.
Since it is hard to support many models manually, I decided to automate hyperparameters tuning via Hyperopt and features selection via Boruta.
Would you advise me please, what should go first: hyperparameters tuning or features selection? On the other hand, it does not matter.
After features selection, the number of features decreases from 2500 to 100 (actually, I have 50 true features and 5 categorical features turned to 2 400 via OneHotEncoding).
If some code is needed, please, let me know. Thank you very much.
Feature selection (FS) can be considered as a preprocessing activity, wherein, the aim is to identify features having low bias and low variance [1].
Meanwhile, the primary aim of hyperparameter optimization (HPO) is to automate hyper-parameter tuning process and make it possible for users to apply Machine Learning (ML) models to practical problems effectively [2]. Some important reasons for applying HPO techniques to ML models are as follows [3]:
It reduces the human effort required, since many ML developers spend considerable time tuning the hyper-parameters, especially for large datasets or complex ML algorithms with a large number of hyper-parameters.
It improves the performance of ML models. Many ML hyper-parameters have different optimums to achieve best performance in different datasets or problems.
It makes the models and research more reproducible. Only when the same level of hyper-parameter tuning process is implemented can different ML algorithms be compared fairly; hence, using a same HPO method on different ML algorithms also helps to determine the most suitable ML model for a specific problem.
Given the above difference between the two, I think FS should be first applied followed by HPO for a given algorithm.
References
[1] Tsai, C.F., Eberle, W. and Chu, C.Y., 2013. Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, pp.240-247.
[2] M. Kuhn, K. Johnson Applied Predictive Modeling Springer (2013) ISBN: 9781461468493.
[3] F. Hutter, L. Kotthoff, J. Vanschoren (Eds.), Automatic Machine Learning: Methods, Systems, Challenges, 9783030053185, Springer (2019)
In Mahout, there is support for item based recommendation using API method:
ItemBasedRecommender.mostSimilarItems(int productid, int maxResults, Rescorer rescorer)
But in Spark Mllib, it appears that the APIs within ALS can fetch recommended products but userid must be provided via:
MatrixFactorizationModel.recommendProducts(int user, int num)
Is there a way to get recommended products based on a similar product without having to provide user id information, similar to how mahout performs item based recommendation.
Spark 1.2x versions do not provide with a "item-similarity based recommender" like the ones present in Mahout.
However, MLlib currently supports model-based collaborative filtering, where users and products are described by a small set of latent factors {Understand the use case for implicit (views, clicks) and explicit feedback (ratings) while constructing a user-item matrix.}
MLlib uses the alternating least squares (ALS) algorithm [can be considered similar to the SVD algorithm] to learn these latent factors.
If you need to construct purely an item-similarity based recommender, I would recommend this:
Represent all items by a feature vector
Construct an item-item similarity matrix by computing a similarity metric (such as cosine) with each items pair
Use this item similarity matrix to find similar items for users
Since similarity matrices do not scale well, (imagine how your similarity matrix would grow if you had 100 items vs 10000 items) this read on DIMSUM might be helpful if you're planning to implement it on a large number of items:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Please see my implementation of item-item recommendation model using Apache Spark here. You can implement this by using the productFeatures matrix that is generated when you run the MLib ALS algorithm on user-product-ratings data. The ALS algorithm essentially factorizes two matrix - one is userFeatures and the other is productFeatures matrix. You can run a cosine similarity on the productFeatures rank matrix to find item-item similarity.
I want to calculate similarity between two documents indexed in elasticsearch. I know it can be done in lucene using term vectors. What is the direct way to do it?
I found that there is a similarity module doing exactly this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
How do I integrate this in my system? I am using pyelasticsearch for calling elasticsearch commands, but I am open to use the REST api for similarity if needed.
I think the Elasticsearch documentation can easily be mis-interpreted.
Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.
The documentation states:
A similarity (scoring / ranking model) defines how matching documents
are scored.
The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).
In regards to term vectors, this also can be mis-interpreted.
Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:
Returns information and statistics on terms in the fields of a
particular document.
If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.
Here is an excellent resource for evaluation of approximate KNN solutions:
https://github.com/erikbern/ann-benchmarks
I have a dataset which the instances are of about 200 features, about 11 of these features are numerical (integer) and the rest are binary (1/0) , these features may be correlated and they are of different probability distributions ,
It's been a while that I've been for a good similarity score which works for a mixed vector and takes into account the correlation between the features,
Do you know such similarity score?
Thanks,
Arian
In your case, the similarity function relies heavily on the input data patterns. You might benefit from learning a distance metric for the input space of data from a given collection
of pair of similar/dissimilar points that preserves the distance relation among the
training data.
Here is a nice survey paper.
The numerous types of distance measures, Euclidean, Manhattan, etc are going provide different levels of accuracy depending on the dataset. Best to read papers covering your method of data fitting and see what heuristics they use. Not to mention that some methods require only homogeneous data that scale accordingly. Here is a paper that talks about a whole host of measures that you might find attractive.
And as always, test and cross validate to see if there really is an impact from the mixing of feature types.