Produce a path to a field in protobuf in Python - protobuf-python

I am working on a function that analyzes data (based on some domain-specific logic) in protobufs. When the function finds an issue, I want to include the path to the offending field, including the indexes for the repeated fields.
For example, given the protobuf below:
proto = ECS(
service=[
Service(),
Service(
capacity_provider_strategy=[
ServiceCapacityProviderStrategyItem(base=1),
ServiceCapacityProviderStrategyItem(base=2),
]
)
]
)
Let's assume that the offending field is field = proto.service[1].capacity_provider_strategy[0].
How would I, given only the field produce ecs.service[1].capacity_provider_strategy[0] in a general way?
Please, note that I am looking for a way to produce the path mentioned above solely based on the supplied field since the logic of producing the error message is de-coupled from the analyzing logic. I realize, that (in the analyzing logic) I could keep track of the indexes of the repeated fields, but this would put more overhead on the analyzing function.

Related

gcloud translate submitting lists

The codelab example for using gcloud translate via python only translates one string:
sample_text = "Hello world!"
target_language_code = "tr"
response = client.translate_text(
contents=[sample_text],
target_language_code=target_language_code,
parent=parent,
)
for translation in response.translations:
print(translation.translated_text)
But since it puts sample_text in a list and iterates over the response, I take it one can submit a longer list. Is this true and can I count on the items in the response corresponding to the order of items in contents? This must be the case but I can't find a clear answer in the docs.
translate_text contents is a Sequence[str] but must be less than 30k (codepoints).
For longer than 30k, use batch_translate_text
APIs Explorer provides an explanation of the request and response types for the translateText method. This allows you to call the underlying REST API method and it generates a 'form' for you in which content is an array of string (as expected).
The TranslateTextResponse describes translations as having the same length as contents.
There's no obvious other way to map entries in contents with translations so these must be in the same order, translations[foo] being the translation of contents[foo].
You can prove this to yourself by:
making the call with multiple known translations
including one word not in the source language (i.e. notknowninenglish in English) to confirm the translation result.

Text first data serialization with separate metadata

I'm trying to find a format that will help solve a very particular problem:
Text first solution.
Ability to specify complex objects in a single text line (properties, key\value, lists, complex objects)
Object metadata structure should be separate from the data.
For example:
Metadata: Prop1:int|Prop2:string|PropList:int[,]
Data: 20|Something|10,20,30
that would mean:
Prop1 = 20
Prop2 = "Something"
PropList = [10,20,30]
Is there any existing serialization format resembling this?
I don't see any format can support the scheme from the example you provided. If you really need this schema (Type section, Data section), then you need to write your own parser, and it's easy.
But the most suitable mature format should still be JSON if you don't want to write your own parser.
specify complex objects in a single text line: not YAML, not XML, not INI, not TOML.
Any common format is designed less semantics or business related.

MATLAB selecting items considering the end of their name

I have to extract the onset times for a fMRI experiment. I have a nested output called "ResOut", which contains different matrices. One of these is called "cond", and I need the 4th element of it [1,2,3,4]. But I need to know its onset time just when the items in "pict" matrix (inside ResOut file) have a name that ends with "*v.JPG".
Here's the part of the code that I wrote (but it's not working):
for i=1:length(ResOut);
if ResOut(i).cond(4)==1 && ResOut(i).pict== endsWith(*"v.JPG")
What's wrong? Can you halp me to fix it out?
Thank you in advance,
Adriano
It's generally helpful to start with unfamiliar functions by reading their documentation to understand what inputs they are expecting. Per the documentation for endsWith, it expects two inputs: the input text and the pattern to match. In your example, you are only passing it one (incorrectly formatted) string input, so it's going to error out.
To fix this, call the function properly. For example:
filepath = ["./Some Path/mazeltov.jpg"; "~/Some Path/myfile.jpg"];
test = endsWith(filepath, 'v.jpg')
Returns:
test =
2×1 logical array
1
0
Or, more specifically to your code snippet:
endsWith(ResOut(i).pict, 'v.JPG')
Note that there is an optional third input, 'IgnoreCase', which you can pass as a boolean true/false to control whether or not the matching ignores case.

Possible to balance unidic vs. unidic-neologd?

With the sentence "場所は多少わかりづらいんですけど、感じのいいところでした。" (i.e. "It is a bit hard to find, but it is a nice place.") using mecab with -d mecab-unidic-neologd the first line of output is:
場所 バショ バショ 場所 名詞-固有名詞-人名-姓
I.e. it says "場所" is a person's surname. Using normal mecab-unidic it more accurately says the "場所" is just a simple noun.
場所 バショ バショ 場所 名詞-普通名詞-一般
My first question is has unidic-neologd replaced all the entries in unidic, or has it simply appended its 3 million proper nouns?
Then, secondly, assuming it is a merger, is it possibly to re-weight the entries, to prefer plain unidic entries a bit more strongly? I.e. I'd love to be getting 中居正広のミになる図書館 and SMAP each recognized as single proper nouns, but I also need it to see that 場所 is always going to mean "place" (except in the cases it is followed by a name suffix such as さん or 様, of course).
References: unidic-neologd
Neologd merges with unidic (or ipadic), which is the reason it keeps "unidic" in the name. If an entry has multiple parts of speech, like 場所, which entry to use is chosen by minimizing cost across the sentence using part-of-speech transitions and, for words in the dictionary, the per-token cost.
If you look in the CSV file that contains neologd dictionary entries you'll see two entries for 場所:
場所,4786,4786,4329,名詞,固有名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
場所,4790,4790,4329,名詞,固有名詞,人名,姓,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
And in lex.csv, the default unidic dictionary:
場所,5145,5145,4193,名詞,普通名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,混,*,*,*,*
The fourth column is the cost. A lower cost item is more likely to be selected, so in this case you can raise the cost for 場所 as a proper noun, though honestly I would just delete it. You can read more about fiddling with cost here (Japanese).
If you want to weight all default unidic entries more strongly, you can modify the neolog CSV file to increase all weights. This is one way to create a file like that:
awk -F, 'BEGIN{OFS=FS}{$4 = $4 * 100; print $0}' neolog.csv > neolog.fix.csv
You will have to remove the original csv file before building (see Note 2 below).
In this particular case, I think you should report this as a bug to the Neologd project.
Note 1: As mentioned above, since which entry is selected depends on the sentence as a whole, it's possible to get the non-proper-noun tag even with the default configuration. Example sentence:
お店の場所知っている?
Note 2: The way the neologd dictionary combines with the default unidic dictionary is based on a subtle aspect of the way Mecab dictionary builds work. Specifically, all CSV files in a dictionary build directory are used when creating the system dictionary. Order isn't specified so it's unclear what happens in the case of collisions.
This feature is mentioned in the Mecab documentation here (Japanese).

Logstash - Convert Field Names to Lowercase

I am processing http logs and converting querystring parameters to fields.
kv
{
source => "uriQuerystring"
field_split => "&"
target => "uriQuerystringKeys"
}
However because callers are using mixed case parameters, I end up with numerous duplicates.
eg: uriQuerystringKeys.apiKey, uriQuerystringKeys.ApiKey, uriQuerystringKeys.APIKey
What do I need to do in my logstash configuration to convert all these field names to lowercase?
I see there's an open issue for this feature to be implemented in Logstash, but it's incomplete. There's a suggestion for some ruby code to be directly executed, but it looks like this converts all fields (not just ones of a certain prefix).
Here's a prior answer that contains the basic code you would need.
You can see a conditional inside the loop, which you could use to enforce the prefix limitations on the fields.

Resources