Data structure / data model for multi-language phrasebook - nlp

We want create a multi-language phrasebook / dictionary for a specific
area.
And now I'm thinking about the best data structure / data model for that.
Since it should be more phrasebook than dictionary we want to keep the data model / structure first simple. It should be only used for fast translation: i.e. user selects two languages, types a word and gets translation. The article and description parts are just for displaying, not for search.
There are some specific cases I'm thniking about:
One term can be expressed with several (1..n) words in any language
Any term can also be translated into several (1..m) words in another language
In some languages the word's articel could be important to know
For some words description could be important (e.g. for words from dialects etc.)
I'm not sure about one point: do I reinvent the wheel creating a data model by myself? But I couldn't find any solutions.
I've just created a json data model I'm not sure about if it good enough or not:
[
{
wordgroup-id: 1,
en: [
{word: 'car', plural: 'cars'},
{word: 'auto', plural: 'autos'},
{word: 'vehicle', plural: 'vehicles'},
],
de: [
{word: 'Auto', article: 'das', description: 'Some explanation eg. when to use this word', plural: 'Autos'},
{word: 'Fahrzeug', article: 'das', plural: 'Fahrzeuge'}
],
ru: [...],
...
},
{
wordgroup-id: 2,
...
},
...
]
I also thought about some "corner" cases #triplee wrote about. I thought to solve them with some kind of redundance. Only the word group id and the word within a language should be unique.
I would be very thankfull for any feedback to the first draft of the data model.

Related

python Spacy custom NER – how to prepare multi-words entities?

:) Please help :)
I`m preparing custom Name Entity Recognition using Spacy (blank) model. I use only one entity: Brand (we can name it 'ORG' as Organisation). I have short texts with ORGs & have prepared data like this (but I can change it):
train_data = [
(‘First text in string with Name I want’, {'entities': [(START, END, ‘ORG')]}),
(‘Second text with Name and Name2’, {'entities': [(START, END, ‘ORG'), (START2, END2, ‘ORG')]})
]
START, END – are the start and end indexes of the brand name in text , of course.
This is working well, but...
The problem I have is how to prepare entities for Brands that are made of 2 (or more) words.
Lets say Brand Name is a full name of a company. How to prepare an entity?
Consider the tuple itself for a single text:
text = 'Third text with Brand Name'
company = 'Brand Name'
Can I treat company as a one word?
(‘Third text with Brand Name', {“entities”: [(16, 26, 'ORG')]})
Or 2 separated brands ‘Brand’ & ‘Name’ ? (will not be useful in my case while using :( the model later)
(‘Third text with Brand Name', {“entities”: [(16, 21, 'ORG'), (22, 26, 'ORG')]})
Or I should use a different format of labeling eg. BIO ?
So Brand will be B-ORG and Name will be I-ORG ?
IF so can I prepare it like this for Spacy:
(‘Third text with Brand Name', {“entities”: [(16, 21, 'B-ORG'), (22, 26, 'I-ORG')]})
or should I change the format of train_data because I also need the ‘O’ from BIO?
How? Like this? :
(‘Third text with Brand Name', {"entities": ["O", "O", "O", "B-ORG", "I-ORG"]})
The question is on the format of the train_data for ‘Third text with Brand Name' - how to label the entity. If I have the format, I will handle the code. :)
The same question for 3 or more words entities. :)
You can just provide the start and end offsets for the whole entity. You describe this as "treating it as one word", but the character offsets don't have any direct relation to tokenization - they won't affect tokenizer output.
You will get an error if the start and end of your entity don't match token boundaries, but it doesn't matter if the entity is one token or many.
I recommend you take a look at the training data section in the spaCy docs. Your specific question isn't answered explicitly, but that's only because multi-token entries don't require special treatment. Examples include multi-token entities.
Regarding BIO tagging, for details on how to use it with spaCy you can see the docs for spacy convert.

String matching keywords and key phrases in Python

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?
Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

Generating synonyms or similar words using BERT word embeddings

I want to generate synonyms or similar words using BERT words embeddings.
I started to do this using BERT.
For later software integration, it has to be done in JAVA, so I went for easy-bert
(https://github.com/robrua/easy-bert).
It appears I can get word embeddings this way:
try(Bert bert = Bert.load(new File("com/robrua/nlp/easy-bert/bert-uncased-L-12-H-768-A-12"))) {
float[][] embedding = bert.embedTokens("A sequence");
float[][][] embeddings = bert.embedTokens("Multiple", "Sequences");
}
Do you know how I could get similars words from these word embeddings ?
Thanks for your help !
I developed a way to do this using Luminoso. I work for them so this is a bit of an ad, but it does exactly what you want it to do.
https://www.luminoso.com/search
Luminoso is really good at understanding conversational text like product reviews, product descriptions, survey results and trouble tickets. It doesn't require ANY kind of training or ontology building and will build a language model around your language. You feed the text for your pages into Luminoso and it will generate a set synonyms for the concepts used in your text.
As an example project I built a search using Amazon.com beauty products. I'll just copy a couple of the automatically generated synonyms around three concepts. There were 17851 synonyms generated from this dataset.
scent, rose-like, not sickeningly, not nauseating, not overwhelming, herb-y, no sweetness, cucumber-y, not too citrus-y, no gardenia, not lemony, pachouli, vanilla-like, fragarance, not spicy, flowerly, musk, perfume-like, floraly, not cloyingly => scent
recommend, recommende, advice, suggestion, highly recommend, suggest, recommeded, recommendation, recommend this product, reccommended, advise, suggest, indicated, suggestion, advice, agree, recommend, say, considering, mentioned => recommend
bottle, no sprayer, 8-oz, beaker, decanter, push-down, dispenser, pipet, pint, not the bottle, no dropper, keg, gallon, jug, pump-top, liter, half-full, decant, tumbler, vial => bottle
eczema, non-steroidal, ulcerative, dematitis, ecsema, Elidel, dermititis, inflammation, pityriasis, hydrocortizone, dyshidrotic, chickenpox, Stelatopia, perioral, rosacea, dry skin, nummular, ecxema, mild-moderate, ezcema => eczema
There were 800k products in this search index so the results were large as well, but this will work on small datasets as well.
Besides the synonym format there you can also place this directly into elasticsearch and associated the synonyms for a specific page with that page.
This is a sample of an Elasticsearch index enhanced with the same technology. It's dialed up super high so there are too many concepts added, but just to show you how well it finds relationships between concepts.
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414089X"}}
{"title": "New Benefit Waterproof Automatic Eyeliner Pen - Black - BAD Gal Liner", "text": "Length : 13.5 cm\nColor: Black\n100% Brand new and unused.\nSmudge free.\nFine-tip. Easy to blend and smooth to apply\nCan make fine and bold eyeline with new texture and furnishing.\nProvide rich and consistant colour\nLongwearing and waterproof\nFregrance Free", "primary_concepts": ["not overpoweringly", "concoction", "equipped", "fine-tip", "water-resistant", "luxuriant", "make", "fixture", "☆", "not lengthen", "washable", "not too heady", "blendable", "doesn't collect", "shade", "niche", "supple", "smudge-proof", "sumptuous", "movable", "black", "over-apply", "quick", "silky", "colored", "sweatproof", "opacity", "accomodate", "fuchsia", "furnishes", "meld", "sturdily", "smear", "inch", "mid-back", "chin-length", "smudge", "alredy", "not cheaply", "long-wearing", "eyeline", "texture", "steady", "no-name", "audacious", "easy", "edgy", "is:A", "marketers", "greys", "decadent", "applicable", "Crease-free", "magenta", "free", "itIn", "stay-true", "racy", "application", "glides", "smooth", "sleek", "taupe", "grainy", "dark", "wealthy", "JP7506CF", "gray", "grayish", "width", "newness", "purfumes", "Lancme", "blackish", "easily", "doesn't smudge", "maroon", "blend", "convenient", "smoother", "Moschino", "long-wear", "mauve", "medium-length", "no raccoon", "revamp", "demure", "richly", "white", "brand", "offers", "lenght", "soft", "doesn't smear", "provide", "provides", "unusable", "eye-liner", "unopened", "straightforward", "silky-smooth", "uniting", "compactness", "bold", "fearless", "mix", "indulgent", "brash", "serviceable", "unmarked", "not musky", "constructed", "racoon", "smoothly", "sealant", "merged", "boldness", "reuse", "unused", "long", "Kors", "effortless", "luscious", "stain", "rich", "discard", "richness", "opulent", "short", "consistency", "fine", "sents", "newfound", "fade-resistant", "mixture", "hue", "sassy", "apply", "fragnance", "heathy", "adventurous", "not enthusiastic", "longwearing", "fregrance", "non-waterproof", "empty", "lashline", "simple", "newly", "you'r", "combined", "no musk", "mingle", "waterproof", "painless", "pinkish", "thickness", "clump-free", "gos", "consistant", "color", "smoothness", "name-brand", "new", "smudgeproof", "yaaay", "water-proof", "eyemakeup", "not instant", "spidery", "furnish", "tint", "product", "reapply", "not black", "no globs", "imitators", "blot", "cinch", "uncomplicated", "untouched", "length"], "related_concepts": ["eyeliner", "no goofs", "doesn't smear", "pen", "hundreds"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414643X"}}
{"title": "Goodskin Labs Eyliplex-2 Eye Life and Circle Reducer - 10ml", "text": "Eyliplex-2 is a dual solution that focuses on the problematic eye area. This breakthrough, 24-hour system from the scientists at good skin pharmacy visibly tightens eye areas while reducing dark circles. 0.34 oz. each. 64% of subjects reported younger looking eyes immediately and a 20% reduction in the appearance of dark circles in clinical studies.", "primary_concepts": ["coloration", "Laboratories", "oncology", "cornea", "undereye", "eye", "immediately", "☆", "teen", "dry-skin", "good", "eyelids", "puffiness", "behold", "research", "temperamental", "dermatological", "breakthrough", "study", "store", "nice", "lasik", "instantaneously", "teenaged", "multi", "rheostat", "dermatology", "chemist", "invisibly", "PhD", "pharmacy", "alredy", "not cheaply", "optional", "pharmacist", "Obagi-C", "topic", "supermarket", "reversible", "studies", "Younger", "medically", "report", "thermo", "tightness", "dual", "eliminate", "researcher", "Minimization", "cutaneous", "hydration", "O2", "taupe", "increase", "moisturization", "dark", "preliminary", "excellent", "Quad", "well", "appearance", "dusky", "quickly", "instantly", "CVS", "Dermal", "great", "revolutionary", "biologist", "epidermis", "blackish", "disclosed", "problem", "youngsters", "murky", "scientific", "teenager", "oz", "dark circles", "clinically", "emphasis", "absorption", "skin", "loosen", "intractable", "technological", "reduction", "clinician", "nutritional", "forthwith", "grocer", "scientifically", "swiftly", "examination", "state-of-the-art", "not acne prone", "zone", "decrease", "younger-looking", "excellently", "troublesome", "system", "radius", "tighten", "FDA", "decent", "noticeably", "WD-40", "clearer", "scientist", "saggy", "significantly", "improvement", "Teamine", "interchangeable", "visible", "visable", "no fine line", "shortly", "minimize", "survey", "problematic", "young", "glance", "racoon", "vicinity", "youthful", "exacerbated", "focal", "region", "groundbreaking", "reddish", "focus", "reduce", "increments", "nad", "fasten", "area", "soon", "complexion", "squinting", "look", "grocery", "eyliplex-2", "Eyliplex-2", "subsequently", "even-toned", "bothersome", "eyes", "mitigate", "markedly", "philosophy:you", "difficult", "darkish", "bluish", "satisfactory", "darken", "epidermal", "lessen", "appearence", "ocular", "ergonomically", "diminished", "progression", "purplish", "sun-damaged", "Cellex-C", "visibly", "diagnosis", "drugstore", "under-eye", "apothecary", ":-D", "terrific", "clinical", "oz.", "Endocrinology", "time-released", "Nouriva", "tight", "adolescent", "subject", "eyeballs", "sking", "Pro-Retinol", "aggravate", "younger", "shortcomings", "solution", "assess", "promptly", "teenage", "Kinetin", "24-hour", "Mart", "youth", "visibility", "scientists", "taut", "better", "eyesight", "no dark circles", "not reduce", "photoaging", "Pending"], "related_concepts": ["A22", "A82", "Amazon", "daytime", "HK", "nighttime", "smell", "dark circles", "purchased"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "1304146537"}}
Luminoso uses word embeddings from ConceptNet which it also develops and the technology is above and beyond what ConceptNet gives you. I'm biased, but every time I've run data through it I'm amazed. Not free, but it really works with absolutely zero pre-training of the data and nothing is actually free.
Similar task for this subject (lexical substitution) would belong to LS07 and LS14.
One researcher achieved the SOTA in those benchmarks using the BERT.
You'd be interested in reading this paper.
https://www.aclweb.org/anthology/P19-1328.pdf
The author says as below.
applies dropout to the target word’s embedding for partially masking
the word, allowing BERT to take balanced consideration of the target
word’s semantics and contexts for proposing substitute candidates, and
then validates the candidates based on their substitution’s influence
on the global contextualized representation of the sentence."
I don't know how to reproduce the same result because the implementation is not open to public. But here's the hint - the embedding dropout could be applied to generate substitute candidates.

Elasticsearch: mapping text field for search optimization

I have to implement a text search application which indexes news articles and then allows a user to search for keywords, phrases or dates inside these texts.
After some consideration regarding my options(SOLR vs. elasticsearch mainly), I ended up doing some testing with elasticsearch.
Now the part that I am stuck on regards the mapping and search query construction options best suited for some special cases that I have encountered. My current mapping has only one field that contains all the text and needs to be analyzed in order to be searchable.
The specific part of the mapping with the field:
"txt": {
"type" : "string",
"term_vector" : "with_positions_offsets",
"analyzer" : "shingle_analyzer"
}
where shingle_analyzer is:
"analysis" : {
"filter" : {
"filter_snow": {
"type":"snowball",
"language":"romanian"
},
"shingle":{
"type":"shingle",
"max_shingle_size":4,
"min_shingle_size":2,
"output_unigrams":"true",
"filler_token":""
},
"filter_stop":{
"type":"stop",
"stopwords":["_romanian_"]
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase","asciifolding", "filter_stop","filter_snow","shingle"]
}
}}
My question regards the following situations:
I have to search for "ING" and there are several "ing." that are returned.
I have to search for "E!" and the analyzer kills the
punctuation and thus no results.
I have to search for certain uppercased common terms that are used as company names (like "Apple" but with multiple words) and the lowercase filter creates useless results.
The idea that I have would be to build different fields with different filters that could cover all these possible issues.
Three questions:
Is splitting the field in three fields with different analyzers the correct way?
How would I use the different fields when searching?
Could someone explain how scoring would work to include all these fields?

Mongoose.js schema description issue (array vs object)

I need to store some user dictionaries (sets of words, and I need to store it as user property) and each dictionary has actually only one property: language.
So I can describe that property like that:
dictionaries:[{language: 'string', words: [.. word entry schema desc..]
}]
and store dictionaries like that:
dictionaries: [
{language: en, words: [.. words of English dictionary..]},
{language: es, words: [.. words of Spanish dictionary..]}
]
But actually I could store the dictionaries in "less nested" way: not array abut object:
dictionaries: {
en: [.. words of English dictionary..],
es: [.. words of Spanish dictionary..]
}
But I don't see a way to describe such object with mongoose schema. So the question is what is the best (more reasonable in terms of storage and querying) option to go with considering I use mongoose.

Resources