I want to format a parallel text so that words and sentences are aligned in two or more languages. Most of the structured text formats I found are XML based and are used by translation tools or Bible software. I want to find or create a format suitable for reading foreign language texts. The reader will have the ability to select words and see their equivalent in the source or target language.
I've thought about using multidimensional arrays with words aligned by index. But the issue is that there are many words and phrases which do not have a one-to-one mapping. So then I thought about using a relational database, such as SQLite. I could have a table for each language with each word indexed by id and join tables for the alignment. But then the question is how to represent punctuation, paragraph breaks, and other necessary formatting.
Are there other data structures or formats I have not thought of? Ideally it would be a flat-file, markup format to facilitate editing.
Presumably you have one or more text files, one in Language A, one in Language B, etcetera, the latter being a translation of the first. With that assumption:
You could mark up your plain-text file(s) with uniquely numbered tags around words, phrases and/or punctuation, e.g.: "Dear Sir, How are you today?" translated to German becomes: "Sehr geehrter Herr, wie geht es dir heute?":
<Language-English:<11:<4:<1:Dear> <2:Sir><3:,>> <10:<7:<5:How are> <6:you>> <8:today><9:?>>>>
<Language-Deutsch:<11:<4:<1:Sehr geehrter> <2:Herr><3:,>> <10:<7:<5:wie geht> <6:es>> <8:dir heute><9:?>>>>
My German is fairly rusty, so I may not have the tags quite correct, but they should still show what I have in mind.
As you can see, the entire sentence and its parts each have their own tags. When displaying the text, each <n: > pair would be stripped out, and could be replaced with an underline or some other form of highlighting to indicate the groups. Of course, there could be multiple underlines/highlights (this example would have up to four). When clicking on (and visually emphasizing) the highlight on the text in Language A, the corresponding highlight(s) in Language B (and other languages if present) would also be emphasized.
Naturally, it would most likely be the job of a human translator to do the markup as automating the actual translation and applying tags at that point is a non-trivial task.
However, a UI where elements in each language could be simultaneously highlighted then marked as being equivalent could facilitate the process of generating the marked-up file(s).
As to your other considerations (arrays and databases), they seem to be something of an over-complication. You would still have to somehow mark-up your texts so that they could be loaded into these structures, since words or even phrases in one language don't necessarily have a 1:1 transliteration to the equivalent in another language), and usually can't easily be translated by machine. Once you have the markup, talking about array/dictionary/database/other structures becomes a bit irrelevant, and only of concern to the UI programmer.
EDIT:
On further consideration, the tags may not be perfectly nested, and may be split, so you may need a <n: :n> tag pair, to allow partially overlapping and split tagged areas. E.g.:
<1:The:1> <2:black:2> <1:dog:1> <3:and <4:the dog:3>'s puppies:4>
has the fragments: "The dog", "black", "and the dog", and "the dog's puppies".
Edit 2:
You could support multi-texts by just having the tag pair IDs unique across all languages:
EN: "The Knight's coat of arms was defaced"
DE: "Das Wappen des Ritters wurde verunstaltet"
FR: "Le blason du Chevalier a été abîmé"
<1:The Knight's:1> <2:coat of arms:2> <5:<3:was:3> <4:defaced:4>:5>.
<2:Das Wappen:2> <1:des Ritters:1> <3:wurde:3> <4:verunstaltet:4>.
<2:Le blason:2> <1:du Chevalier:1> <5:a été abîmé:5>.
As you can see tags 1, 2, 3 & 4 are applicable to English and German, and tags 1, 2 & 5 are applicable to English and French. The tags could quite easily be split and partially overlapping.
Related
I have a text field for tags. For example some entities:
{"tags": "apple. fruits. eat."}
{"tags": "green apple."}
{"tags": "banana. apple."}
I want to select entities with tag apple, not green apple or smth apple smth. Different variants lead to the one point: select a sentence with existing expression and it doesn't matter how this sentence looks like. But in this case it's matter.
How can I do it by using Lucene syntax or Azure Search tools? Or (in general) how can I search for a completely same sentence?
I presume that the "." is a deliminator for the different tags. There may be a way to express this in lucene, but you may need to add some custom analyzers to preserve the "."'s in tokenization.
A better strategy in this case would be use use a field of type Collection(Edm.String). This will allow you to better preserve structure the phrases for the tags, and you can use a filter to select the specific value of "apple". Collection(Edm.String) also allows you to enable faceting of the tags which is useful.
I am using syntaxnet in Spanish and I have found that all the words have a field called "feats" whose format depends on the type of word (noun, pronoun, verb). There are some fields whose meaning is obvious, but in other cases I cannot figure out what it is showing. For example, this is the case of fields such as "fPOS" or "Case" in pronouns. Is there any guide or list with explanations avaiable?
Universal features are documented here .
I have a document
doc = nlp('x-xxmessage-id:')
When I want to extract the tokens of this one I get 'x', 'xx', 'message' and 'id', ':'. Everything goes well.
Then I create a new document
test_doc = nlp('id')
If I try to extract the tokens of test_doc, I will get 'i' and 'd'. Is there any way to get past this problem? Because I want to get the same token as above and this is creating problems in the text processing.
Just like language itself, tokenization is context-dependent and the language-specific data defines rules that tell spaCy how to split the text based on the surrounding characters. spaCy's defaults are also optimised for general-purpose text, like news text, web texts and other modern writing.
In your example, you've come across an interesting case: the abstract string "x-xxmessage-id:" is split on punctuation, while the isolated lowercase string "id" is split into "i" and "d", because in written text, it's most commonly an alternate spelling of "I'd" or "i'd" ("I could", "I would" etc.). You can find the respective rules here.
If you're dealing with specific texts that are substantially different from regular natural language texts, you usually want to customise the tokenization rules or possibly even add a Language subclass for your own custom "dialect". If there's a fixed number of cases you want to tokenize differently that can be expressed by rules, another option would be to add a component to your pipeline that merges the split tokens back together.
Finally, you could also try using the language-independent xx / MultiLanguage class instead. It still includes very basic tokenization rules, like splitting on punctuation, but none of the rules specific to the English language.
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()
I have a list of "known phrases" stored in an XML Document under an element named label. I am trying to figure out how to write a function, that can tokenize a search phrase into all of its label pieces (if available).
For instance. I have a Label for North Korea, and ICBM.
If the user types in North Korea ICBM, I would expect to get back two tokens, one for each label as opposed to North and Korea and ICBM.
In another example if the user types in New York City, I would expect only one token (label) of "New York City".
If there is no labels found, it would return the default tokenization of each word.
I tried to start writing this, but am not sure how to do this properly without a while loop facility, and am pretty new to xQuery in general.
The code below was how I started, but quickly realized it would not work for scaling out search terms.
Basically, it checks to see if the full phrase is in the Label fields. If it is not, it starts to strip away from the back of the search phrase checking what's left for a label.
let $label-query := cts:element-value-query(fn:QName('','label'), $searchTerm, ('case-insensitive', 'whitespace-sensitive'))
let $results := cts:search(fn:collection('typea'),$label-query)
let $test :=
if (fn:empty($results)) then
let $tokens := (fn:tokenize($searchTerm, " "))
let $tokenCount := fn:count($tokens)
let $lastWord := $tokens[last()]
let $firstPhrase := $tokens[position() ne (last())]
let $_ :=
if (fn:count($firstPhrase) = 1 ) then
()
else
let $label-query2 := cts:element-value-query(fn:QName('','label'), $firstPhrase, ('case-insensitive', 'whitespace-sensitive'))
let $results2 := cts:search(fn:collection('typea'),$label-query2)
return
if (fn:empty($results2)) then
xdmp:log('second empty')
else
xdmp:log($results2)
let $l := xdmp:log( $firstPhrase )
return $tokens
else
let $_ := xdmp:log('full')
return element {'result'} {$results}
Does anyone have any advice how I could implement this recursively or perhaps any alternate strategies. I am essentially trying to say, break this sentence up into all of the phrases found that exist in the Label fields of the typea collection. If there are no labels found, tokenize by word.
Thanks I look forward to your guidance.
Update to help clarify my ultimate intention.
Below is the document referring to North Korea.
The goal is to parse the search phrase, and use extra information found in these documents to aid in search.
Meaning if the person types in DPRK or North Korea they should both search the same way. It should also include Narrower Labels as an Or Condition on the search, and will more likely than not be updated to include other relationships that will also be included in search. (IE: Kim Jong Un is Notably Associated with North Korea.)
So in short I would like to reconcile the multi phrase search terms using the label field, and then if it was found, use the information from all labels + the narrower labels as well from that document.
Edit 2: Trying to use cts:highlight to get the phrases. Once I have the phrases I will do an element lookup to get to the right document, and then get the associated documents data for submission to query building.
The issue now is that the cts:highlight does not always return the full phrase under one <phrase> tag.
let $phrases := cts:highlight(<nod>New York City FC</nod>, cts:or-query((//label)), <phrase>{ $cts:text }</phrase>)
A possible alternative approach, if you are using MarkLogic 9, is to set up a custom tokenization dictionary. See custom dictionary API documentation1 and search developer's guide2 for details.
But the gist is, if you add an entry "North Korea" in your tokenization dictionary for a language, you'll get it back as a single token for that language. This will apply everywhere: in content or searches.
That said, it isn't clear from your code what you are ultimately trying to achieve with this. If it is more accuracy with phrase searches, there are better ways to achieve this (enabling fast-phrase for 2-word phrases, or word positions for longer ones).
If this us about search parsing only, you could still use the tokenization dictionary approach, but you probably want to use a special language code for it so it doesn't mess up your actual content, and then use cts:tokenize, e.g. cts:tokenize("North Korea ICBM","xen") where "xen" is your special language code.
Another approach is to use cts:highlight to apply markup to matches to your phrases in the string and go from there:
cts:highlight(<node>North Korea ICBM</node>,
cts:or-query((//label)),
<phrase>{$cts:text}</phrase>)
That would embed the markup for any matching phrase: <node><phrase>North Korea</phrase></node>
Some care would have to be taken around overlaps, if you want to force a particular winner, by applying the set you want to win first, and then running a second pass with the others.
I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.
[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.
You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.
Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.