kafka-python : avro.io.SchemaResolutionException: Can't access branch index 55 for union with 2 branches - python-3.x

I am using kafka-python 2.0.1 for consuming avro data. Following is the code I have tried:
from kafka import KafkaConsumer
import avro.schema
from avro.io import DatumReader, BinaryDecoder
import io
schema_path="schema.avsc"
schema = avro.schema.parse(open(schema_path).read())
reader = DatumReader(schema)
consumer = KafkaConsumer(
bootstrap_servers='xxx.xxx.xxx.xxx:9093',
security_protocol='SASL_SSL',
sasl_mechanism = 'GSSAPI',
auto_offset_reset = 'latest',
ssl_check_hostname=False,
api_version=(1,0,0))
consumer.subscribe(['test'])
for message in consumer:
message_val = message.value
print(message_val)
bytes_reader = io.BytesIO(message_val)
bytes_reader.seek(5)
decoder = avro.io.BinaryDecoder(bytes_reader)
record = reader.read(decoder)
print(record)
I am getting following error:
avro.io.SchemaResolutionException: Can't access branch index 55 for union with 2 branches
Writer's Schema: [
"null",
"int"
]
Reader's Schema: [
"null",
"int"
]
Can anyone please suggest what can be the possible cause of this error? I already followed this thread to skip initial 5 bytes:
How to decode/deserialize Avro with Python from Kafka

I got it working. Issue was with the wrong schema being referred. Thanks.

Related

unable to initialize snowflake data source

I am trying to access the snowflake datasource using "great_expectations" library.
The following is what I tried so far:
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest
context = ge.get_context()
datasource_config = {
"name": "my_snowflake_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"connection_string": "snowflake://myusername:mypass#myaccount/myDB/myschema?warehouse=mywh&role=myadmin",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
"default_inferred_data_connector_name": {
"class_name": "InferredAssetSqlDataConnector",
"include_schema_name": True,
},
},
}
print(context.test_yaml_config(yaml.dump(datasource_config)))
I initiated great_expectation before executing above code:
great_expectations init
but I am getting the error below:
great_expectations.exceptions.exceptions.DatasourceInitializationError: Cannot initialize datasource my_snowflake_datasource, error: 'NoneType' object has no attribute 'create_engine'
What am I doing wrong?
Your configuration seems to be ok, corresponding to the example here.
If you look at the traceback you should notice that the error propagates starting at the file great_expectations/execution_engine/sqlalchemy_execution_engine.py in your virtual environment.
The actual line where the error occurs is:
self.engine = sa.create_engine(connection_string, **kwargs)
And if you search for that sa at the top of that file:
import sqlalchemy as sa
make_url = import_make_url()
except ImportError:
sa = None
So sqlalchemy is not installed, which you
don't get automatically in your environement if you install greate_expectiations. The thing to do is to
install snowflake-sqlalchemy, since you want to use sqlalchemy's snowflake
plugin (assumption based on your connection_string).
/your/virtualenv/bin/python -m pip install snowflake-sqlalchemy
After that you should no longer get an error, it looks like test_yaml_config is waiting for the connection
to time out.
What worries me greatly is the documented use of a deprecated API of ruamel.yaml.
The function ruamel.yaml.dump is going to be removed in the near future, and you
should use the .dump() method of a ruamel.yaml.YAML() instance.
You should use the following code instead:
import sys
from ruamel.yaml import YAML
import great_expectations as ge
context = ge.get_context()
datasource_config = {
"name": "my_snowflake_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"connection_string": "snowflake://myusername:mypass#myaccount/myDB/myschema?warehouse=mywh&role=myadmin",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
"default_inferred_data_connector_name": {
"class_name": "InferredAssetSqlDataConnector",
"include_schema_name": True,
},
},
}
yaml = YAML()
yaml.dump(datasource_config, sys.stdout, transform=context.test_yaml_config)
I'll make a PR for great-excpectations to update their documentation/use of ruamel.yaml.

Default schema value conversion fails in to_avro() while publishing data to Kafka using databricks spark-avro

Trying to publish data into Kafka topic using confluent schema registry.
Following is my schema registry
schemaRegistryClient.register("primitive_type_str_avsc", new Schema.Parser().parse(
s"""
|{
| "type": "record",
| "name": "RecordLevel",
| "fields": [
| {"name": "id", "type":["string","null"], "default": null}
| ]
|}
""".stripMargin
))
Following case class is used to match the schema
case class myCaseClass (id:Option[String] = None)
Here is my notebook code snippet
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.util.Try
import spark.implicits._
val df1 = Seq(("Welcome")).toDF("a")
.map(row => myCaseClass(Some(row.getAs("a"))))
val cols = df1.columns
df1.select(struct(cols.map(column):_*).as('struct))
.select(to_avro('struct, lit("primitive_type_str_avsc"), schemaRegistryAddress).as('value))
.show()
Facing following exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 77.0 failed 4 times, most recent failure: Lost task 0.3 in stage 77.0 (TID 186, 10.73.122.72, executor 3): org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:191)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:218)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:284)
at org.spark_project.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:272)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getIdFromRegistry(CachedSchemaRegistryClient.java:78)
at org.spark_project.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getId(CachedSchemaRegistryClient.java:205)
at org.apache.spark.sql.avro.SchemaRegistryClientProxy.getId(SchemaRegistryClientProxy.java:52)
at org.apache.spark.sql.avro.SchemaRegistryAvroEncoder.encoder(SchemaRegistryUtils.scala:97)
at org.apache.spark.sql.avro.CatalystDataToAvroWithSchemaRegistry.nullSafeEval(CatalystDataToAvroWithSchemaRegistry.scala:57)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:544)
Could you please help in resolving this issue. Thanks in advance.

Python SnowflakeOperator setup snowflake_default

Good day, I cannot find how to do basic setup to airflow.contrib.operators.snowflake_operator.SnowflakeOperatorto connect to snowflake. snowflake.connector.connect works fine.
When I do it with SnowflakeOperator :
op = snowflake_operator.SnowflakeOperator(sql = "create table test(*****)", task_id = '123')
I get the
airflow.exceptions.AirflowException: The conn_idsnowflake_defaultisn't defined
I tried to insert in backend sqlite db
INSERT INTO connection(
conn_id, conn_type, host
, schema, login, password
, port, is_encrypted, is_extra_encrypted
) VALUES (*****)
But after it I get an error:
snowflake.connector.errors.ProgrammingError: 251001: None: Account must be specified.
Passing account kwarg into SnowflakeOperator constructor does not help. It seems I cannot pass account into db or into constructor, but it's required.
Please help me, let me know what data I should insert into backend local db to be able to connect via SnowflakeOperator
Go to Admin -> Connections and update snowflake_default connection like this:
based on source code airflow/contrib/hooks/snowflake_hook.py:53 we need to add extras like this:
{
"schema": "schema",
"database": "database",
"account": "account",
"warehouse": "warehouse"
}
With this context:
$ airflow version
2.2.3
$ pip install snowflake-connector-python==2.4.1
$ pip install apache-airflow-providers-snowflake==2.5.0
You have to specify the Snowflake Account and Snowflake Region twice like this:
airflow connections add 'my_snowflake_db' \
--conn-type 'snowflake' \
--conn-login 'my_user' \
--conn-password 'my_password' \
--conn-port 443 \
--conn-schema 'public' \
--conn-host 'my_account_xyz.my_region_abc.snowflakecomputing.com' \
--conn-extra '{ "account": "my_account_xyz", "warehouse": "my_warehouse", "region": "my_region_abc" }'
Otherwise it doesn't work throwing the Python exception:
snowflake.connector.errors.ProgrammingError: 251001: 251001: Account must be specified
I think this might be due to that airflow command parameter --conn-host that is expecting a full domain with subdomain (the my_account_xyz.my_region_abc), that usually for Snowflake are specified as query parameters in a way similar to this template (although I did not check all the combinations of the command airflow connections add and the DAG execution):
"snowflake://{user}:{password}#{account}{region}{cloud}/{database}/{schema}?role={role}&warehouse={warehouse}&timezone={timezone}"
Then a dummy Snowflake DAG like this SELECT 1; will find its own way to the Snowflake cloud service and will work:
import datetime
from datetime import timedelta
from airflow.models import DAG
# https://airflow.apache.org/docs/apache-airflow-providers-snowflake/stable/operators/snowflake.html
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
my_dag = DAG(
"example_snowflake",
start_date=datetime.datetime.utcnow(),
default_args={"snowflake_conn_id": "my_snowflake_db"},
schedule_interval="0 0 1 * *",
tags=["example"],
catchup=False,
dagrun_timeout=timedelta(minutes=10),
)
sf_task_1 = SnowflakeOperator(
task_id="sf_task_1",
dag=my_dag,
sql="SELECT 1;",
)

Create Stack Instances Parameter Issue

I'm creating stack instance, using python boto3 SDK. According to the documentation I should be able to use ParameterOverrides but I'm getting following error..
botocore.exceptions.ParamValidationError: Parameter validation failed:
Unknown parameter in input: "ParameterOverrides", must be one of: StackSetName, Accounts, Regions, OperationPreferences, OperationId
Environment :
aws-cli/1.11.172 Python/2.7.14 botocore/1.7.30
imports used
import boto3
import botocore
Following is the code
try:
stackset_instance_response = stackset_client.create_stack_instances(
StackSetName=cloudtrail_stackset_name,
Accounts=[
account_id
],
Regions=[
stack_region
],
OperationPreferences={
'RegionOrder': [
stack_region
],
'FailureToleranceCount': 0,
'MaxConcurrentCount': 1
},
ParameterOverrides=[
{
'ParameterKey': 'CloudtrailBucket',
'ParameterValue': 'test-bucket'
},
{
'ParameterKey': 'Environment',
'ParameterValue': 'SANDBOX'
},
{
'ParameterKey': 'IsCloudTrailEnabled',
'ParameterValue': 'NO'
}
]
)
print("Stackset create Response : " + str(stackset_instance_response))
operation_id = stackset_instance_response['OperationId']
print (operation_id)
except botocore.exceptions.ClientError as e:
print("Stackset creation error : " + str(e))
I'm not sure where I'm doing wrong, any help would be greatly appreciated.
Thank you.
1.8.0 is the first version of Botocore that has parameteroverrides defined.
https://github.com/boto/botocore/blob/1.8.0/botocore/data/cloudformation/2010-05-15/service-2.json#L1087-L1090
1.7.30 doesn't have that defined. https://github.com/boto/botocore/blob/1.7.30/botocore/data/cloudformation/2010-05-15/service-2.json

Error when running job that queries against Cassandra via Spark SQL through Spark Jobserver

So I'm trying to run job that simply runs a query against cassandra using spark-sql, the job is submitted fine and the job starts fine. This code works when it is not being run through spark jobserver (when simply using spark submit). Could someone tell my what is wrong with my job code or configuration files that is causing the error below?
{
"status": "ERROR",
"ERROR": {
"errorClass": "java.util.concurrent.ExecutionException",
"cause": "Failed to open native connection to Cassandra at {127.0.1.1}:9042",
"stack": ["com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSes
sion(CassandraConnector.scala:155)", "com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scal
a:141)", "com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:141)", "com.datastax.spark
.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)", "com.datastax.spark.connector.cql.RefCountedCache
.acquire(RefCountedCache.scala:56)", "com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:73)
", "com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:101)", "com.datastax.spark.connecto
r.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:112)", "com.datastax.spark.connector.cql.Schema$.fromCassandra(Sch
ema.scala:243)", "org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:22)", "org.apache.spark.sql.
cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:19)", "com.google.common.cache.LocalCache$LoadingValueReference.loa
dFuture(LocalCache.java:3599)", "com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)", "com.google.common.ca
che.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)", "com.google.common.cache.LocalCache$Segment.get(LocalCache.java:225
7)", "com.google.common.cache.LocalCache.get(LocalCache.java:4000)", "com.google.common.cache.LocalCache.getOrLoad(LocalCache.java
:4004)", "com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)", "org.apache.spark.sql.cassandra.Cassand
raCatalog.lookupRelation(CassandraCatalog.scala:28)", "org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org$apache$spark
$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:218)", "org.apache.spark.sql.catalyst.analy
sis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161)", "org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$
anonfun$lookupRelation$3.apply(Catalog.scala:161)", "scala.Option.getOrElse(Option.scala:120)", "org.apache.spark.sql.catalyst.ana
lysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)", "org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookup
Relation(CassandraSQLContext.scala:218)", "org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.sca
la:174)", "org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)", "or
g.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)", "org.apache.spar
k.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)", "org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.appl
y(TreeNode.scala:188)", "org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)", "org.apache.spark.sql.
catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)", "org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNod
e.scala:208)", "scala.collection.Iterator$$anon$11.next(Iterator.scala:328)", "scala.collection.Iterator$class.foreach(Iterator.sc
ala:727)", "scala.collection.AbstractIterator.foreach(Iterator.scala:1157)", "scala.collection.generic.Growable$class.$plus$plus$e
q(Growable.scala:48)", "scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)", "scala.collection.mutable.Arra
yBuffer.$plus$plus$eq(ArrayBuffer.scala:47)", "scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)", "scala.colle
ction.AbstractIterator.to(Iterator.scala:1157)", "scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)", "sc
ala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)", "scala.collection.TraversableOnce$class.toArray(TraversableOnce.sc
ala:252)", "scala.collection.AbstractIterator.toArray(Iterator.scala:1157)", "org.apache.spark.sql.catalyst.trees.TreeNode.transfo
rmChildrenDown(TreeNode.scala:238)", "org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:193)", "org.apache
.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:178)", "org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelatio
ns$.apply(Analyzer.scala:181)", "org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:171)", "or
g.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)", "org.apache.spark.
sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)", "scala.collection.LinearSeqOptimi
zed$class.foldLeft(LinearSeqOptimized.scala:111)", "scala.collection.immutable.List.foldLeft(List.scala:84)", "org.apache.spark.sq
l.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)", "org.apache.spark.sql.catalyst.rules.RuleExecutor$$a
nonfun$apply$1.apply(RuleExecutor.scala:51)", "scala.collection.immutable.List.foreach(List.scala:318)", "org.apache.spark.sql.cat
alyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)", "org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLCon
text.scala:1082)", "org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1082)", "org.apache.spark.sql.SQLCont
ext$QueryExecution.assertAnalyzed(SQLContext.scala:1080)", "org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)", "org.apac
he.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:211)", "org.apache.spark.sql.cassandra.Cassandra
SQLContext.sql(CassandraSQLContext.scala:214)", "CassSparkTest$.runJob(CassSparkTest.scala:23)", "CassSparkTest$.runJob(CassSparkT
est.scala:9)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.sca
la:235)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$P
romiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"],
"causingClass": "java.io.IOException",
"message": "java.io.IOException: Failed to open native connection to Cassandra at {127.0.1.1}:9042"
}
}
Here is the job I am running:
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql._
import spark.jobserver._
import com.typesafe.config.Config
import com.typesafe.config.ConfigFactory
object CassSparkTest extends SparkJob {
def main(args: Array[String]) {
val sc = new SparkContext("spark://192.168.10.11:7077", "test")
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Results:" + results)
}
override def validate(sc:SparkContext, config: Config): SparkJobValidation = {
SparkJobValid
}
override def runJob(sc:SparkContext, config: Config): Any = {
val sqlC = new CassandraSQLContext(sc)
val df = sqlC.sql(config.getString("input.sql"))
df.collect()
}
}
and here is my configuration file for spark-jobserver
# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
#
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "spark://192.168.10.11:7077"
# master = "mesos://vm28-hulk-pub:5050"
# master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 1
jobserver {
port = 2020
jar-store-rootdir = /tmp/jobserver/jars
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
}
# predefined Spark contexts
# contexts {
# my-low-latency-context {
# num-cpu-cores = 1 # Number of cores to allocate. Required.
# memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
# }
# # define additional contexts here
# }
# universal context configuration. These settings can be overridden, see README.md
context-settings {
num-cpu-cores = 1 # Number of cores to allocate. Required.
memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# in case spark distribution should be accessed from HDFS (as opposed to being installed on every mesos slave)
# spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
spark-cassandra-connection-host="127.0.0.1"
# uris of jars to be loaded into the classpath for this context. Uris is a string list, or a string separated by commas ','
# dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
dependent-jar-uris = ["file:///home/vagrant/lib/spark-cassandra-connector-assembly-1.3.0-M2-SNAPSHOT.jar"]
# If you wish to pass any settings directly to the sparkConf as-is, add them here in passthrough,
# such as hadoop connection settings that don't use the "spark." prefix
passthrough {
#es.nodes = "192.1.1.1"
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
# home = "/home/spark/spark"
}
# Note that you can use this file to define settings not only for job server,
# but for your Spark jobs as well. Spark job configuration merges with this configuration file as defaults.
#vicg, first you need spark.cassandra.connection.host -- periods not dashes. Also note in the error how the IP is "127.0.1.1", not the one in the config. You can also pass the IP when you create a context, like:
curl -X POST 'localhost:8090/contexts/my-context?spark.cassandra.connection.host=127.0.0.1'
If the above don't work, try the following PR:
https://github.com/spark-jobserver/spark-jobserver/pull/164

Resources