What is the meaning of VERBOSE,N-JOBS in decision tree and what is backend process of GRIDSEARCH CV? - decision-tree

I want explanation for the above mentioned terms.
I search many websites but I didn't got the meaning.

Related

Sentence Transformers Using BOW?

I have a collection of terms that appear or are somehow related to web pages (e.g. keywords from the HTML tags). These are not sentences, they are just a collection of keywords, words in a title etc. I am interested in, given such a webpage, to find those most similar. In a case where one has sentences / paragraphs I would think of using a sentence transformer or even like Doc2vec. But in this case I only have the set of words of a page and there is no real context or sentences. Am I correct this precludes me from using sentence transformer / Doc2vec ?
Nothing precludes you from using anything. The relevant test is: does using it work, for your unique data & goals?
Doc2Vec and other shallow techniques work fine on things like lists-of-keywords that aren't perfect grammatical sentences: they're generally using the presence or absence of words, without rigorous grammatical understanding, as signals. And that's plenty for many purposes!
Some deeper transformers might have more order-dependent reliance on coherent natural-language utterances – but I wouldn't be sure of that until it was tried and shown lacking. It might work! And noone with only the vaguest sketch (from your question) of your data & goals can give you hints better than your own experiments.
Try things – including super-simple things like cosine-similarities on bag-of-words representation, or keyword searches based on some measure of most significant terms – then evaluate the results according to your needs/desired results.
You might start some evaluations via ad-hoc eyeballing – "this seems good, this seems wrong" – but would ideally record judgements of which docs "should" be more-similar than others, in your desired end-system, so that eventually you can run an automatic, quantitative comparison of alternate approaches.

Building a thesaurus from corpus

I am working on a natural language processing application. I have a text describing 30 domains. Each domain is defined with a short paragraph that explains it. My aim is to build a thesaurus from this text so I can determine from an input string which domains are concerned. The text is about 5000 words and each domains is described by 150 words. My questions are :
Do I have a long enough text to create a thesaurus from ?
Is my idea of building a thesaurus legit or should I just use NLP libraries to analyse my corpus and the input string ?
At the moment, I have calculated the number total of occurrence of each words grouped by domains because I first thought of a indexed approach. But I am really not sure which method is the best. Does someone have experience in both NLP and thesaurus building ?
I think what you are looking for is topic modeling. Given a word, you want to get the probability of which domain the word belongs to. I would recommend using off the shelf algorithms that implement LDA (Latent Dirichlet Algorithm).
Alternatively, you can visit David Blei's website. He has written some great software that implements LDA, and topic modeling in general. He also has presented several tutorials for topic modeling for beginners.
If your goal is to build a thesaurus then build a thesaurus; if your goal is not to build a thesaurus, then you better use stuff available out there.
More generally, for any task in NLP - from data acquisition to machine translation - you're gonna face numerous problems (both technical and theoretical), and it is very easy to stray from the path, as these problems are - most of the time - fascinating.
Whatever the task is, build a system using existing resources. Then you get the big picture; then you can start thinking about improving component A or B.
Good luck.

Finding how relevant a text is, given a whitelist and blacklist of words/phrases

This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Semantic search with NLP and elasticsearch

I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.
Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.
There may be several approaches with different implementation complexity.
The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).
Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.
Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.
You may want to explore https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-conceptnet-io/.
It combines semantic networks and distributional semantics.
When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.
Some results
The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.
Semantic search is basically search with meaning. Elasticsearch uses JSON serialization by default, to apply search with meaning to JSON you would need to extend it to support edge relations via JSON-LD. You can then apply your semantic analysis over the JSON-LD schema to word disambiguate plumber entity and burst pipe contexts as a subject, predicate, object relationships. Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.
"Elasticsearch 7.3 introduced introduced text similarity search with vector fields".
They describe the application of using text embeddings (e.g., word embeddings and sentence embeddings) to implement this sort of semantic similarity measure.
A bit late to the party, but part II of this blog seems to address this through "contextual searches". It basically makes a two-part query to Elasticsearch in order to build a list of "seed" documents and then an expanded query via the more-like-this API. The result is a set of documents most contextually similar to the search query.
it's possible. This GitHub repo shows how to integrate Elasticsearch with the current state-of-the-art on NLP for semantic representation of language: BERT (Bidirectional Encoder Representations from Transformers) https://github.com/Hironsan/bertsearch
Good luck.
My suggestion is to use BERT embedding for your sentences and add an embedding field to your ElasticSearch, as it is described in https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
For BERT embedding I suggest to use sentence-transformers from Huggingface library. You can find sample codes in https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8
There are several options for that:
You can perform it in elasticsearch itself. Elasticsearch supports the indexing of Dense Embedding of docs. From there, you can write your own pipeline for search and use your preferred relevancy score formula ie. cosine similarity or something else.
Use Haystack pipeline, refer to my blog which describes setting up a semantic search pipeline (end-to-end).
You can use Meta's Faiss

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
[...]
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.

Resources