gcloud translate submitting lists - python-3.x

The codelab example for using gcloud translate via python only translates one string:
sample_text = "Hello world!"
target_language_code = "tr"
response = client.translate_text(
contents=[sample_text],
target_language_code=target_language_code,
parent=parent,
)
for translation in response.translations:
print(translation.translated_text)
But since it puts sample_text in a list and iterates over the response, I take it one can submit a longer list. Is this true and can I count on the items in the response corresponding to the order of items in contents? This must be the case but I can't find a clear answer in the docs.

translate_text contents is a Sequence[str] but must be less than 30k (codepoints).
For longer than 30k, use batch_translate_text
APIs Explorer provides an explanation of the request and response types for the translateText method. This allows you to call the underlying REST API method and it generates a 'form' for you in which content is an array of string (as expected).
The TranslateTextResponse describes translations as having the same length as contents.
There's no obvious other way to map entries in contents with translations so these must be in the same order, translations[foo] being the translation of contents[foo].
You can prove this to yourself by:
making the call with multiple known translations
including one word not in the source language (i.e. notknowninenglish in English) to confirm the translation result.

Related

How to find most similar to an array in gensim

I know the most_similar method works when entering a previously added string, but how do you reverse search a numpy array of some word?
modelw2v = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True)
differenceArr = modelw2v["King"] - modelw2v["Queen"]
# This line does not work
modelw2v.most_similar(differenceArr)
The most_similar() method can take vectors as the origin of a search, but you should explicitly specify them as one member of a list provided to the method's positive parameter, so that its logic for handling more simple origins (like a string or list of strings) isn't confused.
Specifically, this should work with your other code:
model23v.most_similar(positive=[differenceArr,])
More generally, you can supply lists of vectors (or word-keys for looking up vectors) to both the positive and negative parameters of this method, and the method will combine them (according to the exact logic you can see in the source code). So for example the prominent word2vec example...
wv('king') - wv('man') + wv('woman') = ?
...can be effected with the most_similar() method without doing your own other vector-arithmetic:
sims = modelw2v.most_similar(positive=['king', 'woman'], negative=['man'])

How to perform Lexer Actions that send an Exception?

I'm new to ANTL4 and I can't seem to figure out how to get lexer actions to perform properly.
I have a code snippet that looks for input text:
SIZE10 : [a-zA-Z]* {getText().length() <= 10}?
I would expect that it does not match any combinations of letters that are over 10 letters long, however what this does is treat a 10+ letter string as two different tokens, instead of just nullifying the whole set of 10+ letters. How can I get this action to nullify the whole set of letters?
In addition, where can I go to see all the different token functions I can use (other than getText())? The documentation about lexer actions is really poor. In general, I'm having a hard time figuring out what resources can give me a definitive list of everything in the language. Even an entry point into the source code for me to read would be good at this point. The documentation is too general/basic for me.
EDIT: I've figured out how to send a RuntimeException, but I don't know where to get the elements needed for a proper RecognitionException.
The predicate in a rule directs the parsing process in a way that allows to match only partial input (like in your case) or essentially switch off a part of the grammar depending on certain conditions. In your case the SIZE10 rule is matched until the predicate returns false. Everything up to this event is then returned as a match for SIZE10. After that lexing continues at the point it ended for the previous token and if that is again a letter it will again match SIZE10 as long as the predicate says it is correct. That's a bit different than what you would expect (e.g. using the predicate as an all or nothing switch).
However, if you instead want to match the full set of letters first and then check if the length is <= 10 you can do this in a listener. You can hook into the exitSIZE10() event and reject the match by throwing a recognition exception.
For the usable functions in your actions see the API documentation for ANTLR. For instance here is the one for Token which shows you other possibilities beside getText(). In your action, consider the context you have. In a lexer rule you deal with a Token, hence getText() etc. work on the token. In a parser rule you have a ParserContext instead, which also has a getText() function but that works differently (collecting all child contexts text into a comma separated list).

issues while serializing to YAML file

I have started using .net API for yaml and it seems to be helpful. However I have few questions and wondering if you can provide some sample/work around for the same.
(1) I have an object consisting 4 strings I would like to serialize its collection (List or String[]). I wrote a helper method to return me the strings in the format I want, however it adds an extra single quote before and after the string. So I am getting
-'{str1: str2, str3: str4}'
-'{str5: str6, str7: str8}'
instead of
-{str1: str2, str3: str4}
-{str5: str6, str7: str8}
Can you suggest any workarounds?
(2) I am trying to insert xaml as a string in a yaml document. My xaml is well formed xml but when I serialize it, it cuts before 3rd last element. Any idea why?
Regarding the first question, if you are serializing an array of strings, then it is normal that each element is quoted because it starts with a '{'. In this case, you should be serializing the list of objects directly instead of converting them to string first.
Regarding the second question, you should add some code to the question to clarify what you are doing.

Checking if values in List is part of String

I have a string like this:
val a = "some random test message"
I have a list like this:
val keys = List("hi","random","test")
Now, I want to check whether the string a contains any values from keys. How can we do this using the in built library functions of Scala ?
( I know the way of splitting a to List and then do a check with keys list and then find the solution. But I'm looking a way of solving it more simply using standard library functions.)
Something like this?
keys.exists(a.contains(_))
Or even more idiomatically
keys.exists(a.contains)
The simple case is to test substring containment (as remarked in rarry's answer), e.g.
keys.exists(a.contains(_))
You didn't say whether you actually want to find whole word matches instead. Since rarry's answer assumed you didn't, here's an alternative that assumes you do.
val a = "some random test message"
val words = a.split(" ")
val keys = Set("hi","random","test") // could be a List (see below)
words.exists(keys contains _)
Bear in mind that the list of keys is only efficient for small lists. With a list, the contains method typically scans the entire list linearly until it finds a match or reaches the end.
For larger numbers of items, a set is not only preferable, but also is a more true representation of the information. Sets are typically optimised via hashcodes etc and therefore need less linear searching - or none at all.

How to get (translatable) strings from specific domain with POEdit

I have been trying for hours finding a way to setup POEdit so that it can grab the text from specific domain only
My gettext function looks like this:
function ri($id, $parameters = array(), $domain = 'default', $locale = null)
A sample call:
echo ri('Text %xyz%', array('%xyz%'=>100), 'myDomain');
I will need to grab only the text with the domain myDomain to translate, or at least I want POEdit to put these texts into domain specific files. Is there a way to do it?
I found several questions that are similar but the answers don't really tell me what to do (I think I'm such a noob it must be explained in plain English for me to understand):
How to set gettext text domain in Poedit?
How to get list of translatable messages
So I finally figured it out after days of searching, I finally found the answer here:
http://sourceforge.net/mailarchive/message.php?msg_id=27691818
xgettext recognizes context in strings, and gives a msgctxt field in the *.pot file, which is recognized by translation software as a
context and is shown as such (check image of Pootle showing context
below)
This can be done in 3 ways:
String in code should be in the format _t('context','string'); and xgettext invocation should be in the form --keyword=_t:1c,2
(this basically explains to xgettext that there are 2 arguments in
the keyword function, 1st one is context, 2nd one is string)
String in code in the format _t('string','context'); and xgettext invocation should be in the form --keyword=_t:1,2c
String in the code should be as _t('context|string') and xgettext invocation should be in the form --keyword=_t:1g
So to answer my own question, I added this to the "sources keywords" tab of Poedit:
ri:1,3c
ri is the function name, 1 is the location of the stringid, 3 is the location of the context/domain
Hope this helps someone else, I hate all these cryptic documents
(This is a repost of my answer to the same thing here.)
Neither GNU gettext tools nor Poedit (which uses them) support this particular misuse of gettext.
In gettext, domain is roughly “a piece of software” — a program, a library, a plugin, a theme. As such, it typically resides in a single directory tree and is alone there — or at the very least, if you have multiple pieces=domains, you have them organized sanely into some subdirectories that you can limit the extraction to.
Mixing and matching domains within a single file as you do is not how gettext was intended to be used, and there’s no reasonable solution to handle it other than using your own helper function, e.g. by wrapping all myDomain texts into __mydomain (which you must define, obviously) and adding that to the list of keywords in Poedit when extracting for myDomain and not adding that to the list of keywords for other domains' files.

Resources