Apache lucene inverted index - search

Does Lucene index use tf-idf as weights? Is it possible to define your own statistics and weights for each document, and "plug" them into Lucene?

Yes, the default scoring algorithm incorporates tf-idf, and is fully documented in the TFIDFSiilarity documentation.
There are a number of ways to customize the scoring of documents.
The simplest and most common is to incorporate a boost, either on a field at index time, or on a query term when querying.
Many query types modify the scoring used for that query. Examples include ConstantScoreQuery and DisjunctionMaxQuery.
The Similarity you use defines the scoring algorithm. You could select a different one (ex. BM25Similarity).
You can implement your own Similarity, Usually by extending a higher-level implementation such as DefaultSimilarity, TFIDFSimilarity, or SimilarityBase

Just go through this example. It may help help you to know how you can bring custom changes in indexing process
http://lucene.apache.org/core/4_3_1/demo/src-html/org/apache/lucene/demo/IndexFiles.html

Related

Customize Apache Spark implementation of TF-IDF

In one hand I want to use spark capability to compute TF-IDF for a collection of documents, on the other hand, the typical definition of TF-IDF (that Spark implementation is based on that) is not fit in my case. I want the TF to be term frequency among all documents, but in the typical TF-IDF, it's for each pair of (word, document). The IDF definition is the same as the typical definition.
I implemented my customized TF-IDF using Spark RDDs, but I was wondering if there any way to customize the source of the Spark TF-IDF so that I can use the capability of that, like Hashing.
Actually, I need something like :
public static class newHashingTF implements Something<String>
Thanks
It is pretty simple to implement different hashing strategies, as you can see by the simplicity of HashingTF:
(modern) Dataset version
(old) RDD version
This talk and its slides can help and there are many others online.

Natural Language Processing in Python

How to find similar kind of issues for a new unseen issue based on past trained issues(includes summary and description of issue) using natural language processing in python
If I understand you correctly you have a new issue (query) and you want to look up other similar issues (documents) in your database. If so, then what you need is a way to find the similarity between your query and existing documents. And once you have them, you can rank them and select the most relevant ones. One such method that allows you to do this is Latent Semantic Indexing (LSI).
To do this you'll have to construct a document-term matrix. You'll use your existing document and create a term occurrence matrix across documents. What this means is that you basically record how many times a word appears in a document (or some other complex measure, example- tfidf). This can be done either through a bag of words representation or a TFIDF representation.
Once you have that, you'll have to process your query so that it is in the same form as your documents. Now that you have your query in usable form, you can calculate the cosine similarity between documents and your query. The one with the highest cosine similarity is the closest match.
Note: The topic that you may want to read about is Information Retrieval and LSI is just one such method. You should look into other methods as well.

Best for resume, document matching

I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
calculation.
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see https://datascience.stackexchange.com/questions/22421/is-there-an-algorithm-or-nn-to-match-two-documents-basically-not-closely-simila
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.

Document Similarity in ElasticSearch

I want to calculate similarity between two documents indexed in elasticsearch. I know it can be done in lucene using term vectors. What is the direct way to do it?
I found that there is a similarity module doing exactly this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
How do I integrate this in my system? I am using pyelasticsearch for calling elasticsearch commands, but I am open to use the REST api for similarity if needed.
I think the Elasticsearch documentation can easily be mis-interpreted.
Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.
The documentation states:
A similarity (scoring / ranking model) defines how matching documents
are scored.
The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).
In regards to term vectors, this also can be mis-interpreted.
Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:
Returns information and statistics on terms in the fields of a
particular document.
If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.
Here is an excellent resource for evaluation of approximate KNN solutions:
https://github.com/erikbern/ann-benchmarks

What are the applications of length normalization?

I found some info about Length Normalization. I found it mentioned only in the context of search engines. Have people used it for different textual purposes? (please forgive my ignorance. I've truly searched for other uses of it but google keeps confusing the term "normalization" with "scaling"...).
The link you provide in the question already mentions one reason for using length-normalization: to avoid having high term-frequency counts in document vectors. This affects document ranking considerably. A direct application of this is, of course, query-based document retrieval.
There are other algorithm-specific applications as well. For example, if you want to cluster documents using cosine similarity between the vectors: simple clustering algorithms such as k-means may not converge unless the vectors are all on a sphere, i.e. all vectors have the same length.

Resources