Json Schema minimum validator but for string - python-3.x

using:
from jsonschema import validate
import jsonschema
I am trying to validate a minimum of 4096 with no upper bound on a string using json schema. I see regex pattern matching may be an option but unsure of how to do this with no upper bound.
json_data = {"value": "4096"}
# what i would like to do
json_schema = {"type": "string", "minimum": 4096}
try:
validate(json_data, schema=json_schema)
return True
except jsonschema.ValidationError:
return False
Really appreciate any input. Please comment if other info is needed. Thank you.

First you should enter your value as int instead of string.
Then you're checking the schema of your whole json object, from which value is a key/property:
from jsonschema import validate
import jsonschema
json_data = {"value": 4096}
json_schema = {
"type": "object",
"properties": {
"value": {
"type": "number",
"minimum": 4096
}
}
}
try:
validate(json_data, schema=json_schema)
print(True)
except jsonschema.ValidationError:
print(False)
Try the above code with values < 4096 and >= 4096
EDIT: if changing the type of value is not an option, you can still use regex (in my opinion far uglier):
json_schema = {
"type": "object",
"properties": {
"value": {
"type": "string",
"pattern": "^\d{5,}|[5-9]\d{3}|4[1-9]\d\d|409[6-9]$"
}
}
}
This supposes your string has no leading zeroes and no kind of separators (,, ....). We match the following:
\d{5,}: 5 or more digits (>= 10000)
[5-9]\d{3}: 5 to 9 followed by 3 digits (5000-9999)
4[1-9]\d\d: 4100 to 4999
409[6-9]: 4096 to 4099

Related

How to mock Athena query results values with Moto3 for a specific table?

I am using pytest and moto3 to test some code similar to this:
response = athena_client.start_query_execution(
QueryString='SELECT * FROM xyz',
QueryExecutionContext={'Database': myDb},
ResultConfiguration={'OutputLocation': someLocation},
WorkGroup=myWG
)
execution_id = response['QueryExecutionId']
if response['QueryExecution']['Status']['State'] == 'SUCCEEDED':
response = athena_client.get_query_results(
QueryExecutionId=execution_id
)
results = response['ResultSet']['Rows']
...etc
In my test I need that the values from results = response['ResultSet']['Rows'] are controlled by the test. I am using some code like this:
backend = athena_backends[DEFAULT_ACCOUNT_ID]["us-east-1"]
rows = [{"Data": [{"VarCharValue": "xyz"}]}, {"Data": [{"VarCharValue": ...}, etc]}]
column_info = [
{
"CatalogName": "string",
"SchemaName": "string",
"TableName": "xyz",
"Name": "string",
"Label": "string",
"Type": "string",
"Precision": 123,
"Scale": 123,
"Nullable": "NOT_NULL",
"CaseSensitive": True,
}
]
results = QueryResults(rows=rows, column_info=column_info)
backend.query_results[NEEDED_QUERY_EXECUTION_ID] = results
but that is not working as I guess NEEDED_QUERY_EXECUTION_ID is not known before from the test. How can I control it?
UPDATE
Based on suggestion I tried to use:
results = QueryResults(rows=rows, column_info=column_info)
d = defaultdict(lambda: results.to_dict())
backend.query_results = d
to force a return of values, but it seems not working as from the moto3's models.AthenaBackend.get_query_results, I have this code:
results = (
self.query_results[exec_id]
if exec_id in self.query_results
else QueryResults(rows=[], column_info=[])
)
return results
which will fail as the if condition won't be satifsfied.
Extending the solution of the defaultdict, you could create a custom dictionary that contains all execution_ids, and always returns the same object:
class QueryDict(dict):
def __contains__(self, item):
return True
def __getitem__(self, item):
rows = [{"Data": [{"VarCharValue": "xyz"}]}, {"Data": [{"VarCharValue": "..."}]}]
column_info = [
{
"CatalogName": "string",
"SchemaName": "string",
"TableName": "xyz",
"Name": "string",
"Label": "string",
"Type": "string",
"Precision": 123,
"Scale": 123,
"Nullable": "NOT_NULL",
"CaseSensitive": True,
}
]
return QueryResults(rows=rows, column_info=column_info)
backend = athena_backends[DEFAULT_ACCOUNT_ID]["us-east-1"]
backend.query_results = QueryDict()
An alternative solution to using custom dictionaries would to be seed Moto.
Seeding Moto ensures that it will always generate the same 'random' identifiers, which means you always know what the value of NEEDED_QUERY_EXECUTION_ID is going to be.
backend = athena_backends[DEFAULT_ACCOUNT_ID]["us-east-1"]
rows = [{"Data": [{"VarCharValue": "xyz"}]}, {"Data": [{"VarCharValue": "..."}]}]
column_info = [...]
results = QueryResults(rows=rows, column_info=column_info)
backend.query_results["bdd640fb-0667-4ad1-9c80-317fa3b1799d"] = results
import requests
requests.post("http://motoapi.amazonaws.com/moto-api/seed?a=42")
# Test - the execution id will always be the same because we just seeded Moto
execution_id = athena_client.start_query_execution(...)
Documentation on seeding Moto can be found here: http://docs.getmoto.org/en/latest/docs/configuration/recorder/index.html#deterministic-identifiers
(It only talks about seeding Moto in the context of recording/replaying requests, but the functionality can be used on it's own.)

Make a group of attributes of which only 1 is required be reflected in the schema in Pydantic

I want to as the title says create a group of attributes a, b and c, such that any combination can be supplied as long as one is given. I have managed to achieve the functionality but it is not reflected in the schema which is what I can't manage to do.
from pydantic import BaseModel, root_validator
class Foo(BaseModel):
a: str | None = None
b: str | None = None
c: str | None = None
#root_validator
def check_at_least_one_given(cls, values):
if not any((values.get('a'), values.get('b'), values.get('c'))):
raise ValueError("At least of a, b, or c must be given")
return values
# Doesn't have required fields
print(Foo.schema_json(indent=2))
{
"title": "Foo",
"type": "object",
"properties": {
"a": {
"title": "A",
"type": "string"
},
"b": {
"title": "B",
"type": "string"
},
"c": {
"title": "C",
"type": "string"
}
}
}
# No error
print(Foo(a="1"))
>>> a='1' b=None c=None
print(Foo(b="2"))
>>> a=None b='2' c=None
print(Foo(c="3"))
>>> a=None b=None c='3'
print(Foo(a="1", b="2"))
>>> a='1' b='2' c=None
print(Foo(a="1", c="3"))
>>> a='1' b=None c='3'
print(Foo(b="2", c="3"))
>>> a=None b='2' c='3'
print(Foo(a="1", b="2", c="3"))
>>> a='1' b='2' c='3'
# Invalid
Foo()
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pydantic\main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Foo
__root__
At least of a, b, or c must be given (type=value_error)
I want the schema to output something like
{
"title": "Foo",
"type": "object",
"properties": {
"a": {
"title": "A",
"type": "string"
},
"b": {
"title": "B",
"type": "string"
},
"c": {
"title": "C",
"type": "string"
}
},
"required": [
["a", "b", "c"]
]
}
or something else that (probably) more clearly expresses the intent of at least one of these is required.
Is this possible and if so how is it done?
As far as I can tell, Pydantic does not have a built in mechanism for this. And the validation logic you provide in a custom validator will never find its way into the JSON schema. You could try and search through and if unsuccessful post a feature request for something like this in their issue tracker.
The JSON schema core specification defines the anyOf keyword, which takes subschemas to validate against. This allows specifying the required keyword once for each of your fields in its own subschema. See this answer for details and an example.
In the Pydantic Config you can utilize schema_extra to extend the auto-generated schema. Here is an example of how you can write a corresponding workaround:
from typing import Any
from pydantic import BaseModel, root_validator
class Foo(BaseModel):
a: str | None = None
b: str | None = None
c: str | None = None
#root_validator
def check_at_least_one_given(cls, values: dict[str, Any]) -> dict[str, Any]:
if all(
(v is None for v in (values.get("a"), values.get("b"), values.get("c")))
):
raise ValueError("Any one of `a`, `b`, or `c` must be given")
return values
class Config:
#staticmethod
def schema_extra(schema: dict[str, Any]) -> None:
assert "anyOf" not in schema, "Oops! What now?"
schema["anyOf"] = [
{"required": ["a"]},
{"required": ["b"]},
{"required": ["c"]},
]
for prop in schema.get("properties", {}).values():
prop.pop("title", None)
if __name__ == "__main__":
print(Foo.schema_json(indent=2))
Output:
{
"title": "Foo",
"type": "object",
"properties": {
"a": {
"type": "string"
},
"b": {
"type": "string"
},
"c": {
"type": "string"
}
},
"anyOf": [
{
"required": [
"a"
]
},
{
"required": [
"b"
]
},
{
"required": [
"c"
]
}
]
}
This conforms to the specs and expresses your custom validation.
But note that I put in that assert to indicate that I have no strong basis to assume that the automatic schema will not provide its own anyOf key at some point, which would greatly complicate things. Consider this an unstable solution.
Side note:
Be careful with the any check in your validator. An empty string is "falsy" just like None, which might lead to unexpected results, depending on whether you want to consider empty strings to be valid values in this context or not. any(v for v in ("", 0, False, None)) is False.
I adjusted your validator in my code to explicitly check against None for this reason.

Converting Key=Value text file to JSON

I'm looking for a library to convert a text file to JSON.
Do you know which one has the following behavior?
I already test some libraries but without success.
The source files contains a list of key=value pairs, one key per line.
Converting to correct data type is important, my files has:
string keys
number keys
boolean keys
object (JSON) keys
arrays (of simple strings or of JSON objects)
Example
name = "test"
version = 3
enabled = true
fruit = {"type":"orange","color":"orange"}
consumers = ["kids", "adults"]
years = [2014, 2015]
fruits = [{"type":"orange","color":"orange"},{"type":"apples","method":"red"}]
Expected Result after conversion: Valid JSON (don't need style/identation)
{
"name": "test",
"version": 3,
"enabled": true,
"fruit": {
"type": "orange",
"color": "orange"
},
"consumers": [
"kids",
"adults"
],
"years": [
2014,
2015
],
"fruits": [
{
"type": "orange",
"color": "orange"
},
{
"type": "apples",
"method": "red"
}
]
}
The format you're using isn't standardized so I'm doubtful you'll find a package that can parse it out of the box. Your values do look to be valid JSON primitives so you can leverage JSON.parse to parse the right hand side. With that you'd just need a parser to robustly pull out all the raw [key, value] pairs, but most parsers probably try to do more than just that which might not be what you want.
If you know the input will always be clean and don't need a completely robust parser, it's not difficult to roll this yourself:
const data = fs.readFileSync('./data.txt', {encoding: 'utf8'}).split('\n').filter(Boolean)
const obj = {}
for (const line of data) {
const [key, val] = line.split(/\s*=\s*(.+)/)
obj[key] = JSON.parse(val)
}

Work with decimal values after avro deserialization

I take AVRO bytes from Kafka and deserialize them.
But I get strange output because of decimal value and I cannot work with them next (for example turn into json or insert into DB):
import avro.schema, json
from avro.io import DatumReader, BinaryDecoder
# only needed part of schemaDict
schemaDict = {
"name": "ApplicationEvent",
"type": "record",
"fields": [
{
"name": "desiredCreditLimit",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 14,
"scale": 2
}
],
"default": null
}
]
}
schema_avro = avro.schema.parse(json.dumps(schemaDict))
reader = DatumReader(schema_avro_io)
decoder = BinaryDecoder(data) #data - bynary data from kafka
event_dict = reader.read(decoder)
print (event_dict)
#{'desiredCreditLimit': Decimal('100000.00')}
print (json.dumps(event_dict))
#TypeError: Object of type Decimal is not JSON serializable
I tried to use avro_json_serializer, but got error: "AttributeError: 'decimal.Decimal' object has no attribute 'decode'".
And because of this "Decimal" in dictionary I cannot insert values to DB too.
Also tried to use fastavro library, but I couldnot deserealize message, as I understand because sereliazation done without fastavro.

How to get variables into a json object?

A really simple noob question. How do I place variables in this json_body variable? The problem is with all the quotation marks. I can't use the .format string method for this as the json_body variable isn't holding a simple string.
from influxdb import InfluxDBClient
json_body = [
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"value": 0.64
}
}
]
client = InfluxDBClient('localhost', 8086, 'root', 'root', 'example')
client.create_database('example')
client.write_points(json_body)
Source: https://github.com/influxdata/influxdb-python#examples
So, for example, how do I get a variable in there, e.g.
"value": 0.64
to:
"value": variable_name?
In the example code, all values are hard-coded.
The right way is to build the dictionary as a Python object, using your variables, then use json.dumps.
For example
import json
x = 42
d = {'answer': x}
print(json.dumps(d))
This prints
{"answer": 42}
As you surmised in your question, you must not use string interpolation because the values you interpolate may contain quotes. Good observation!
EDIT
You mentioned in a comment that you had a larger object. Perhaps you want to do something like this?
def create_json_body_for_value(value):
return json.dumps([
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"value": value
}
}
])

Resources