Avro Schema Validation

Avro Schema Validation - python-3.x

According to Avro Schema specification (for Unions): https://avro.apache.org/docs/current/spec.html
Unions
Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.
(
Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union.
Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.)
It appears from the standard, when declaring unions, the first word must be the default value and the second must be the data type.
In our product, we are using Avro encoding with the following Schema:
{
"name": "data",
"type": {
"name": "data",
"type": "record",
"fields": [
{
"name": "data_asset",
"type": ["string", "null"],
"default": null,
"doc": "The serialized JSON describing the entity - can be null for special cases"
}
]
}
}
What we have found is that, while Unions have a "MUST" requirement that the first item is the default, no errors are thrown by the Schema-validator when we reverse the order (["string", "null"]) as shown above.
The question I have is:
Why does the validation pass, even though it is "incorrect" as per the standard?

This is a case where the implementation doesn't match the specification. Some libraries might implement this check and so it's probably best to make sure your schema matches the specification even if the specific library you are using doesn't check it.

Related

How to read json files with nested categories in node.js

I am using Perspective API (you can check out at: http://perspectiveapi.com/) for my discord application. I am sending an analyze request and api returning this:
{
"attributeScores": {
"TOXICITY": {
"spanScores": [
{
"begin": 0,
"end": 22,
"score": {
"value": 0.9345592,
"type": "PROBABILITY"
}
}
],
"summaryScore": {
"value": 0.9345592,
"type": "PROBABILITY"
}
}
},
"languages": [
"en"
],
"detectedLanguages": [
"en"
]
}
I need to get "value" in "summaryScore" as an integer. I searched it on Google, but i just found reading value for not categorized or only 1 times categorized json files. How can i do that?
Note: Sorry if i asked something really easy or if i slaughtered english. My primary language is not english and i am not much experienced on node.js

First you must make sure the object you have recived is presived by nodeJS as a JSON object, look at this answer for how first. After the object is stored as a JSON object you can do the following:
Reading from nested objects or arrays is as easy as doing this:
object.attributeScores.TOXICITY.summaryScore.value
If you look closer to the object and its structure you can see that the root object (the first {}) contains 3 values: "attributeScores", "languages" and "detectedLanguages".
The field you are looking for exists inside the "summeryScore" object that exists inside the "TOXICITY" object and so on. Thus you need to traverse the object structure until you get to the value you need.

Azure API Management: Discriminate operations by both path and query parameters

I have a backend API (that implements ApiController) which I'd like to put behind an APIM API. ApiController allows us to discriminate between two different GET operations based on the query parameters that are passed in. When I attempt to define these endpoints in APIM, I get the following error:
The message suggests an endpoint is defined solely by the path and operation. But that seems to contradict documentation I found here which suggests there's a way to differentiate between operations based on query parameters:
Required parameters across both path and query must have unique names.
(In OpenAPI a parameter name only needs to be unique within a
location, for example path, query, header. However, in API Management
we allow operations to be discriminated by both path and query
parameters (which OpenAPI doesn't support). That's why we require
parameter names to be unique within the entire URL template.)
I have an ApiController that defines two different Get operations, differing only by the query parameters. How do I represent that in my APIM API?

The problem comes from multiple operation objects with the same OperationId. This is invalid swagger. In the Swagger file did not match the name of the selected API, so change the title attribute of the doc tag to match the destination API it worked..
Here is a similar SO thread you could refer to.

I got my answer from Azure support, sharing the info here:
APIM endpoints are defined by the path, method, and the name you assign to the operation. To differentiate between two GET endpoints to the same controller, differing only by query parameters, you need to hardcode required query parameters into the path. See the following two images:
In the latter image, the hardcoded query parameter is classified by the UI as a template parameter, but it still behaves like a regular query parameter. Query arguments defined in this way:
Are required
Can appear in anywhere in a request's list of query arguments
Are not case-sensitive
Are listed as a "Request Parameter" along side all other path parameters and query arguments in the Development Portal
Edit:
There's a typo in the screenshots. The URLs are case sensitive, and the casing of "blah" were different in each case. Here's what the the Open API Specification looks like when the casing is consistent. The overloaded path (with the query parameter hardcoded into the path template) appears in a section called x-ms-paths:
{
"swagger": "2.0",
"info": {
"title": "Echo API",
"version": "1.0"
},
"host": "<hostUrl>",
"basePath": "/echo",
"schemes": ["https"],
"securityDefinitions": {
"apiKeyHeader": {
"type": "apiKey",
"name": "Ocp-Apim-Subscription-Key",
"in": "header"
},
"apiKeyQuery": {
"type": "apiKey",
"name": "subscription-key",
"in": "query"
}
},
"security": [{
"apiKeyHeader": []
}, {
"apiKeyQuery": []
}],
"paths": {
"/Blah": {
"get": {
"operationId": "blah",
"summary": "Blah",
"responses": {}
}
}
},
"tags": [],
"x-ms-paths": {
"/Blah?alpha={alpha}": {
"get": {
"operationId": "blah2",
"summary": "Blah2",
"parameters": [{
"name": "alpha",
"in": "query",
"required": true,
"type": "string"
}],
"responses": {}
}
}
}
}

How to do field mapping in azure search for complex json objects for example nested array

I have following problem
I have a field mapping update to an index .Payload is complex where
I have:
{
"type": "abc",
"Party": [{
"Type": "abc",
"Id": "123",
"Name": "manasa",
"Phone": [{
"Type": "Office",
"Number": "12345"
}]
}]
}
And now I want to create a field for an index. The field name is phonenumber of type Collection(Edm.String)
where mapping is
{
"sourceFieldName" : "/Party/Phone/Number",
"targetFieldName" : "phonenumber",
"mappingFunction" : { "name" : "jsonArrayToStringCollection" }
}
In http post body
But still after indexing i get phone number result as null.That means the mapping went wrong.If you see the phone number in source json, it is inside a json array and it itself is an array and result needs to get stored inside a collection of a string.Is it possible how can I achieve this?
If this is not possible I atleast want field mapping till phone array ie., /Party/Phone/
If i index complete party array as a text, I get an error while running the index saying:
"Field 'partydetails' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields."
Can someone please help!

If party would have been a Json object than an array and phone would have been only a string array for example
{
"type": "abc",
"Party": {
"Type": "abc",
"Id": "123",
"Name": "manasa",
"Phone": [{
"12345",
"23463"
}]
}
}
Then I could have mapped
{
"sourceFieldName" : "Party/Phonenumber",
"targetFieldName" : "phonenumbers",
"mappingFunction" : { "name" : "jsonArrayToStringCollection" }
}
It map as collection of type odata EDM.string.
So to put this in better and straight forward way,
Either transform your json to something flatter (the example that I
gave above) or
Use the proper index incase if you know before inhand as
#Luis Cabrera said,
“sourceFieldName”: “/Party/0/Phone/0/Type
It is a limitation from azure search side.

Note that Party and Phone are arrays, so the field mapping you mention won't work.
You will need to index into the specific element. For example:
{
"sourceFieldName": "/Party/0/Phone/0/Type",
"targetFieldName": "firstPhoneNumberTypeOfFirstParty"
}
You may want to give that a shot.
Thanks!
Luis Cabrera | Program Manager | Azure Search

No results when in the mapping, the field _all has specified an index_analyzer

With Elasticsearch I have created an index using a custom mapping and custom set of analszers, however I'm not able to do query search on the _all field.
I'm using these analyzers:
{
"analysis": {
"analyzer": {
"case_insensitive": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": "punctuation"
}
},
"char_filter": {
"punctuation": {
"type": "mapping",
"mappings": [
".=>\\u0020",
"-=>\\u0020",
"_=>\\u0020"
]
}
}
}
}
and this mapping:
{
"article": {
"_all": {
"enabled": true,
"store": "yes",
"index_analyzer": "case_insensitive",
"search_analyzer": "case_insensitive"
},
"properties": {
"title": {
"type": "string",
"index": "analyzed"
},
"subtitle": {
"type": "string",
"analyzer": "case_insensitive"
},
"comment": {
"type": "string",
"index": "not_analyzed"
},
"review": {
"type":"string",
"index": "not_analyzed",
"include_in_all":false
}
}
}
}
Then I add a document like this:
{
"title": "This is the story of a wonderful man.",
"subtitle":"A man goes on vacation in the worst place possible.",
"comment": "I like the movie very much, however I did not undertand it.",
"review":"Very well"
}
and I expect the following 3 out of 4 fields shall be included in _all, in particular title, subtitle and comment.
The analyzer is working as following (tested using the analyzer test in elasticsearch):
"I like the movie very much, however I did not undertand it." -> "i like the movie very much, however i did not undertand it "
"This is the story of a wonderful man." -> "this is the story of a wonderful man "
I expect that at least searching on _all using the query: "This is the story of a wonderful man." I should be able to find the document.
What am I doing wrong?
How is elasticsearch populating the _all field?
If the field 'title' shall be added to the _all field, which data is used and how? is it using the output of the analyzer selected for the 'title' field as input for the analyzer of the _all or is using the raw data?
How is the flow of data in the _all field? For example
input -> analyzer -> title -> index_analyser -> _all
or
input -> analyzer -> title
-> index_analyser -> _all
Thank you in advance...

Your mapping looks ok to me. The only thing I would try is to set one of the fields explicitly to include_in_all=true and then rerun your query.
According to the docs, it may be that as you are overriding the default value of include_in_all for one of the fields, it may have changed it for all the other fields of the objects. See here _all
Relevant text from the documentation is below:
Inclusion in the _all field can be controlled on a field-by-field basis by using the include_in_all setting, which defaults to true. Setting include_in_all on an object (or on the root object) changes the default for all fields within that object.
UPDATE:
I think I know why its not working. Here is what I did. First, I removed the custom analysers from the _all_ field (so using the standard analyser). With this I was able to query and get the results as expected. Results were returned for terms that were in any of the document attributes but review. At least this confirms that the general behaviour of _all is correct. Next to test the analysers, I did a query on the subtitle field with the exact text(as it is using keyword analyser). This also worked. Then I realised that _all is an aggregated field and then analysed.
So the query should include all the text from all the fields to work. But again, how do we know in which order they were aggregated :)
This link _all custom analyser has some information. Relevant bits extracted below (from Shay).
You don't want to set the analyzer for _all to be keyword, _all is an aggregation of all the other fields int the doc, so you basically treat the whole aggregation of text as a single token.

Changing the default analyzer in ElasticSearch or LogStash

I've got data coming in from Logstash that's being analyzed in an overeager manner. Essentially, the field "OS X 10.8" would be broken into "OS", "X", and "10.8". I know I could just change the mapping and re-index for existing data, but how would I change the default analyzer (either in ElasticSearch or LogStash) to avoid this problem in future data?
Concrete Solution: I created a mapping for the type before I sent data to the new cluster for the first time.
Solution from IRC: Create an Index Template

According this page analyzers can be specified per-query, per-field or per-index.
At index time, Elasticsearch will look for an analyzer in this order:
The analyzer defined in the field mapping.
An analyzer named default in the index settings.
The standard analyzer.
At query time, there are a few more layers:
The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.
On the other hand, this page point to important thing:
An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.
So the only way to define a custom analyzer as default is overriding one of pre-defined analyzers, in this case the default analyzer. it means we can not use an arbitrary name for our analyzer, it must be named default
here a simple example of index setting:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"char_filter": {
"charMappings": {
"type": "mapping",
"mappings": [
"\\u200C => "
]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
},
"analyzer": {
"default": {<--------- analyzer name must be default
"tokenizer": "standard",
"char_filter": [
"charMappings"
],
"filter": [
"lowercase",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}

As you know, elasticsearch uses standard analyzer when no analyzer is specified explicitly. So while setting the templates, you can set your custom analyzer which is named as standard. And there you can set you own rules of setting analyzer, tokenzier, token filters.
Here are some helpful links that will help you understand better:
http://elasticsearch-users.115913.n3.nabble.com/How-we-can-change-Elasticsearch-default-analyzer-td4040411.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string