Generating synonyms or similar words using BERT word embeddings - nlp

I want to generate synonyms or similar words using BERT words embeddings.
I started to do this using BERT.
For later software integration, it has to be done in JAVA, so I went for easy-bert
(https://github.com/robrua/easy-bert).
It appears I can get word embeddings this way:
try(Bert bert = Bert.load(new File("com/robrua/nlp/easy-bert/bert-uncased-L-12-H-768-A-12"))) {
float[][] embedding = bert.embedTokens("A sequence");
float[][][] embeddings = bert.embedTokens("Multiple", "Sequences");
}
Do you know how I could get similars words from these word embeddings ?
Thanks for your help !

I developed a way to do this using Luminoso. I work for them so this is a bit of an ad, but it does exactly what you want it to do.
https://www.luminoso.com/search
Luminoso is really good at understanding conversational text like product reviews, product descriptions, survey results and trouble tickets. It doesn't require ANY kind of training or ontology building and will build a language model around your language. You feed the text for your pages into Luminoso and it will generate a set synonyms for the concepts used in your text.
As an example project I built a search using Amazon.com beauty products. I'll just copy a couple of the automatically generated synonyms around three concepts. There were 17851 synonyms generated from this dataset.
scent, rose-like, not sickeningly, not nauseating, not overwhelming, herb-y, no sweetness, cucumber-y, not too citrus-y, no gardenia, not lemony, pachouli, vanilla-like, fragarance, not spicy, flowerly, musk, perfume-like, floraly, not cloyingly => scent
recommend, recommende, advice, suggestion, highly recommend, suggest, recommeded, recommendation, recommend this product, reccommended, advise, suggest, indicated, suggestion, advice, agree, recommend, say, considering, mentioned => recommend
bottle, no sprayer, 8-oz, beaker, decanter, push-down, dispenser, pipet, pint, not the bottle, no dropper, keg, gallon, jug, pump-top, liter, half-full, decant, tumbler, vial => bottle
eczema, non-steroidal, ulcerative, dematitis, ecsema, Elidel, dermititis, inflammation, pityriasis, hydrocortizone, dyshidrotic, chickenpox, Stelatopia, perioral, rosacea, dry skin, nummular, ecxema, mild-moderate, ezcema => eczema
There were 800k products in this search index so the results were large as well, but this will work on small datasets as well.
Besides the synonym format there you can also place this directly into elasticsearch and associated the synonyms for a specific page with that page.
This is a sample of an Elasticsearch index enhanced with the same technology. It's dialed up super high so there are too many concepts added, but just to show you how well it finds relationships between concepts.
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414089X"}}
{"title": "New Benefit Waterproof Automatic Eyeliner Pen - Black - BAD Gal Liner", "text": "Length : 13.5 cm\nColor: Black\n100% Brand new and unused.\nSmudge free.\nFine-tip. Easy to blend and smooth to apply\nCan make fine and bold eyeline with new texture and furnishing.\nProvide rich and consistant colour\nLongwearing and waterproof\nFregrance Free", "primary_concepts": ["not overpoweringly", "concoction", "equipped", "fine-tip", "water-resistant", "luxuriant", "make", "fixture", "☆", "not lengthen", "washable", "not too heady", "blendable", "doesn't collect", "shade", "niche", "supple", "smudge-proof", "sumptuous", "movable", "black", "over-apply", "quick", "silky", "colored", "sweatproof", "opacity", "accomodate", "fuchsia", "furnishes", "meld", "sturdily", "smear", "inch", "mid-back", "chin-length", "smudge", "alredy", "not cheaply", "long-wearing", "eyeline", "texture", "steady", "no-name", "audacious", "easy", "edgy", "is:A", "marketers", "greys", "decadent", "applicable", "Crease-free", "magenta", "free", "itIn", "stay-true", "racy", "application", "glides", "smooth", "sleek", "taupe", "grainy", "dark", "wealthy", "JP7506CF", "gray", "grayish", "width", "newness", "purfumes", "Lancme", "blackish", "easily", "doesn't smudge", "maroon", "blend", "convenient", "smoother", "Moschino", "long-wear", "mauve", "medium-length", "no raccoon", "revamp", "demure", "richly", "white", "brand", "offers", "lenght", "soft", "doesn't smear", "provide", "provides", "unusable", "eye-liner", "unopened", "straightforward", "silky-smooth", "uniting", "compactness", "bold", "fearless", "mix", "indulgent", "brash", "serviceable", "unmarked", "not musky", "constructed", "racoon", "smoothly", "sealant", "merged", "boldness", "reuse", "unused", "long", "Kors", "effortless", "luscious", "stain", "rich", "discard", "richness", "opulent", "short", "consistency", "fine", "sents", "newfound", "fade-resistant", "mixture", "hue", "sassy", "apply", "fragnance", "heathy", "adventurous", "not enthusiastic", "longwearing", "fregrance", "non-waterproof", "empty", "lashline", "simple", "newly", "you'r", "combined", "no musk", "mingle", "waterproof", "painless", "pinkish", "thickness", "clump-free", "gos", "consistant", "color", "smoothness", "name-brand", "new", "smudgeproof", "yaaay", "water-proof", "eyemakeup", "not instant", "spidery", "furnish", "tint", "product", "reapply", "not black", "no globs", "imitators", "blot", "cinch", "uncomplicated", "untouched", "length"], "related_concepts": ["eyeliner", "no goofs", "doesn't smear", "pen", "hundreds"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414643X"}}
{"title": "Goodskin Labs Eyliplex-2 Eye Life and Circle Reducer - 10ml", "text": "Eyliplex-2 is a dual solution that focuses on the problematic eye area. This breakthrough, 24-hour system from the scientists at good skin pharmacy visibly tightens eye areas while reducing dark circles. 0.34 oz. each. 64% of subjects reported younger looking eyes immediately and a 20% reduction in the appearance of dark circles in clinical studies.", "primary_concepts": ["coloration", "Laboratories", "oncology", "cornea", "undereye", "eye", "immediately", "☆", "teen", "dry-skin", "good", "eyelids", "puffiness", "behold", "research", "temperamental", "dermatological", "breakthrough", "study", "store", "nice", "lasik", "instantaneously", "teenaged", "multi", "rheostat", "dermatology", "chemist", "invisibly", "PhD", "pharmacy", "alredy", "not cheaply", "optional", "pharmacist", "Obagi-C", "topic", "supermarket", "reversible", "studies", "Younger", "medically", "report", "thermo", "tightness", "dual", "eliminate", "researcher", "Minimization", "cutaneous", "hydration", "O2", "taupe", "increase", "moisturization", "dark", "preliminary", "excellent", "Quad", "well", "appearance", "dusky", "quickly", "instantly", "CVS", "Dermal", "great", "revolutionary", "biologist", "epidermis", "blackish", "disclosed", "problem", "youngsters", "murky", "scientific", "teenager", "oz", "dark circles", "clinically", "emphasis", "absorption", "skin", "loosen", "intractable", "technological", "reduction", "clinician", "nutritional", "forthwith", "grocer", "scientifically", "swiftly", "examination", "state-of-the-art", "not acne prone", "zone", "decrease", "younger-looking", "excellently", "troublesome", "system", "radius", "tighten", "FDA", "decent", "noticeably", "WD-40", "clearer", "scientist", "saggy", "significantly", "improvement", "Teamine", "interchangeable", "visible", "visable", "no fine line", "shortly", "minimize", "survey", "problematic", "young", "glance", "racoon", "vicinity", "youthful", "exacerbated", "focal", "region", "groundbreaking", "reddish", "focus", "reduce", "increments", "nad", "fasten", "area", "soon", "complexion", "squinting", "look", "grocery", "eyliplex-2", "Eyliplex-2", "subsequently", "even-toned", "bothersome", "eyes", "mitigate", "markedly", "philosophy:you", "difficult", "darkish", "bluish", "satisfactory", "darken", "epidermal", "lessen", "appearence", "ocular", "ergonomically", "diminished", "progression", "purplish", "sun-damaged", "Cellex-C", "visibly", "diagnosis", "drugstore", "under-eye", "apothecary", ":-D", "terrific", "clinical", "oz.", "Endocrinology", "time-released", "Nouriva", "tight", "adolescent", "subject", "eyeballs", "sking", "Pro-Retinol", "aggravate", "younger", "shortcomings", "solution", "assess", "promptly", "teenage", "Kinetin", "24-hour", "Mart", "youth", "visibility", "scientists", "taut", "better", "eyesight", "no dark circles", "not reduce", "photoaging", "Pending"], "related_concepts": ["A22", "A82", "Amazon", "daytime", "HK", "nighttime", "smell", "dark circles", "purchased"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "1304146537"}}
Luminoso uses word embeddings from ConceptNet which it also develops and the technology is above and beyond what ConceptNet gives you. I'm biased, but every time I've run data through it I'm amazed. Not free, but it really works with absolutely zero pre-training of the data and nothing is actually free.

Similar task for this subject (lexical substitution) would belong to LS07 and LS14.
One researcher achieved the SOTA in those benchmarks using the BERT.
You'd be interested in reading this paper.
https://www.aclweb.org/anthology/P19-1328.pdf
The author says as below.
applies dropout to the target word’s embedding for partially masking
the word, allowing BERT to take balanced consideration of the target
word’s semantics and contexts for proposing substitute candidates, and
then validates the candidates based on their substitution’s influence
on the global contextualized representation of the sentence."
I don't know how to reproduce the same result because the implementation is not open to public. But here's the hint - the embedding dropout could be applied to generate substitute candidates.

Related

Azure full phrase match cognitive search query returning results which don't fully match

I am getting unexpected results when using a phrase search query. According to the Microsoft docs(https://learn.microsoft.com/en-us/azure/search/query-simple-syntax) phrases encapsulated within quotation marks (" ") should only return the full phrase. However I am getting results back I shouldn't be as they don't fully match.
Query string: "building"&parameterName=propertyName&queryType=Full
Results:
"value": [
{
"#search.score": 3.236124,
"id": "PROP127",
"propertyName": "SILVER BUILDING",
"address": "test address",
"fullAddress": "test full address",
"division": "commercial",
"transaction": "lettings",
"selectedCount": null
},
{
"#search.score": 3.2345672,
"id": "PROP323",
"propertyName": "SJW BUILDING",
"address": "test address",
"fullAddress": "test full address",
"division": "commercial",
"transaction": "lettings",
"selectedCount": null
},
The results are returning property names with the word building but this should only appear when typing in "Silver building" for example.
Is there something wrong with the query string?
Any help would be much appreciated!
The document with the property name "silver building" is being returned because Azure Cognitive Search tokenizes the phrase into individual terms . Therefore, searching for the following will return the document:
building
silver
silver building
"silver building"
The quotes are used to make sure that a specific phrase is found within a document but it does not mean exact match. For example, a document with the phrase "The quick brown fox" will be found if you search for "brown fox" or "quick brown".
If you do not want the document to be tokenized (broken up into words) then you can use the keyword analyzer which will emit the entire field as a single token. This means a document with "Silver Building" will only match if you search for the specific text.

GPT-3 davinci gives different results with the same prompt

I am not sure if you have access to GPT-3, particularly DaVinci (the complete-a-sentence tool). You can find the API and info here
I've been trying this tool for the past hour and every time I hit their API using the same prompt (indeed the same input), I received a different response.
Do you happen to encounter the same situation?
If this is expected, do you happen to know the reason behind it?
Here are some examples
Request header (I tried to use the same example they provide)
{
"prompt": "Once upon a time",
"max_tokens": 3,
"temperature": 1,
"top_p": 1,
"n": 1,
"stream": false,
"logprobs": null,
"stop": "\n"
}
Output 1
"choices": [
{
"text": ", this column",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
]
Output 2
"choices": [
{
"text": ", winter break",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
]
Output 3
"choices": [
{
"text": ", the traditional",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
]
I just talked to OpenAI and they said that their response is not deterministic. It's probabilistic so that it can be creative. In order to make it deterministic or reduce the risk of being probabilistic, they suggest adjusting the temperature parameter. By default, it is 1 (i.e. 100% taking risks). If we want to make it completely deterministic, set it to 0.
Another parameter is top_p (default=1) that can be used to set the state of being deterministic. But they don't recommend tweaking both temperature and top_p. Only one of them would do the job.
OpenAI documenation:
https://beta.openai.com/docs/api-reference/completions/create
temperature number Optional Defaults to 1
What sampling temperature to use. Higher values means the model will
take more risks. Try 0.9 for more creative applications, and 0 (argmax
sampling) for ones with a well-defined answer.
We generally recommend altering this or top_p but not both.
top_p number Optional Defaults to 1
An alternative to sampling with temperature, called nucleus sampling,
where the model considers the results of the tokens with top_p
probability mass. So 0.1 means only the tokens comprising the top 10%
probability mass are considered.
We generally recommend altering this or temperature but not both.

Searching for terms with underscore doesn't return expected results

How can I search a documents named "Hola-Mundo_Army.jpg" searching by the Army* word (always using the asterisk key at the end please)? The thing is that if I search the documents using Army* the result is zero. I think that the problem is the underscore before Army word.
But if I search Mundo_Army* the result is one found, correctly.
docs?api-version=2016-09-01&search=Mundo_Army* <--- 1 result OK
docs?api-version=2016-09-01&search=Army* <--- 0 results and it should find 1 result like the previous search. I always need to use the asterisk at the end.
Thank you!
This is the blob information that I have to search and find:
{
"#search.score": 1,
"content": "{\"azure_cdn\":\"http:\\/\\/dev-dr-documents.azureedge.net\\/localhost-hugo-docs-not-indexed\\/Hola-Mundo_Army.jpg\"}\n",
"source": "dr",
"title": "Hola-Mundo_Army.jpg",
"file_name": "Hola-Mundo_Army.jpg",
"file_type": "Image",
"year_created": "2017",
"client": "LALALA",
"brand": "LELELE",
"description": "HUGO_DEV-TUCUMAN",
"categories": "Clothing and Accessories",
"media": "Online media",
"tags": null,
"channel": "Case Study",
"azuresearch_skipcontent": "1",
"id": "1683",
"metadata_storage_content_type": "application/octet-stream",
"metadata_storage_size": 109,
"metadata_storage_last_modified": "2017-04-26T18:30:35Z",
"metadata_storage_content_md5": "o2yZWelvS/EAukoOhCuuKg==",
"metadata_storage_name": "Hola-Mundo_Army.json",
"metadata_content_encoding": "ISO-8859-1",
"metadata_content_type": "text/plain; charset=ISO-8859-1",
"metadata_language": "en"
}
The best way to troubleshoot cases like this is by using the Analyze API. It will help you understand how your documents and query terms are processed by the search engine. In your case, assuming you are not setting the analyzer property on the field you are searching against, the text Hola-Mundo_Army.jpg is broken down by the default analyzer into the following two terms: hola, mundo_army.jpg. These are the terms that are in your index. That's why, when you are searching for the prefix mundo_army*, the term mundo_army.jpg is matched. Prefix army* doesn't match anything in your index.
You can learn more about the the default behavior of the search engine and how to customize it from this article: How full text search works in Azure Search

Data structure / data model for multi-language phrasebook

We want create a multi-language phrasebook / dictionary for a specific
area.
And now I'm thinking about the best data structure / data model for that.
Since it should be more phrasebook than dictionary we want to keep the data model / structure first simple. It should be only used for fast translation: i.e. user selects two languages, types a word and gets translation. The article and description parts are just for displaying, not for search.
There are some specific cases I'm thniking about:
One term can be expressed with several (1..n) words in any language
Any term can also be translated into several (1..m) words in another language
In some languages the word's articel could be important to know
For some words description could be important (e.g. for words from dialects etc.)
I'm not sure about one point: do I reinvent the wheel creating a data model by myself? But I couldn't find any solutions.
I've just created a json data model I'm not sure about if it good enough or not:
[
{
wordgroup-id: 1,
en: [
{word: 'car', plural: 'cars'},
{word: 'auto', plural: 'autos'},
{word: 'vehicle', plural: 'vehicles'},
],
de: [
{word: 'Auto', article: 'das', description: 'Some explanation eg. when to use this word', plural: 'Autos'},
{word: 'Fahrzeug', article: 'das', plural: 'Fahrzeuge'}
],
ru: [...],
...
},
{
wordgroup-id: 2,
...
},
...
]
I also thought about some "corner" cases #triplee wrote about. I thought to solve them with some kind of redundance. Only the word group id and the word within a language should be unique.
I would be very thankfull for any feedback to the first draft of the data model.

How do I find all exact matches within a block of text in Elasticsearch?

I've got an index of hundreds of book titles in elasticserch, with documents like:
{"_id": 123, "title": "The Diamond Age", ...}
And I've got a block of freeform text entered by a user. The block of text could contain a number of book titles throughout it, with varying capitalization.
I'd like to find all the book titles in the block of text, so I can link to the specific book pages.
Any idea how I can do this? I've been looking around for exact phrase matches in blocks of text, with no luck.
You need to index the field title as not_analyzed or using keyword analyzer.
This will tell elasticsearch to do no operations on the field whenever you send a query and this will make you be able to do an exact match search.
I would suggest that you keep an analyzed version as well as a not_analyzed version in order to be able to do exact searches as well as analyzed searches. Your mappings would go like this, in this case I assume that the type name is movies in your case.
"mappings":{
"movies":{
"properties":{
"title":{
"type": "string",
"fields":{
"row":{
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
This will give you two fields title which contains an analyzed title and title.row which contains the exact value indexed with absolutely no processing.
title.row would match if you entered an exact

Resources