Python SnowflakeOperator setup snowflake_default - python-3.x

Good day, I cannot find how to do basic setup to airflow.contrib.operators.snowflake_operator.SnowflakeOperatorto connect to snowflake. snowflake.connector.connect works fine.
When I do it with SnowflakeOperator :
op = snowflake_operator.SnowflakeOperator(sql = "create table test(*****)", task_id = '123')
I get the
airflow.exceptions.AirflowException: The conn_idsnowflake_defaultisn't defined
I tried to insert in backend sqlite db
INSERT INTO connection(
conn_id, conn_type, host
, schema, login, password
, port, is_encrypted, is_extra_encrypted
) VALUES (*****)
But after it I get an error:
snowflake.connector.errors.ProgrammingError: 251001: None: Account must be specified.
Passing account kwarg into SnowflakeOperator constructor does not help. It seems I cannot pass account into db or into constructor, but it's required.
Please help me, let me know what data I should insert into backend local db to be able to connect via SnowflakeOperator

Go to Admin -> Connections and update snowflake_default connection like this:
based on source code airflow/contrib/hooks/snowflake_hook.py:53 we need to add extras like this:
{
"schema": "schema",
"database": "database",
"account": "account",
"warehouse": "warehouse"
}

With this context:
$ airflow version
2.2.3
$ pip install snowflake-connector-python==2.4.1
$ pip install apache-airflow-providers-snowflake==2.5.0
You have to specify the Snowflake Account and Snowflake Region twice like this:
airflow connections add 'my_snowflake_db' \
--conn-type 'snowflake' \
--conn-login 'my_user' \
--conn-password 'my_password' \
--conn-port 443 \
--conn-schema 'public' \
--conn-host 'my_account_xyz.my_region_abc.snowflakecomputing.com' \
--conn-extra '{ "account": "my_account_xyz", "warehouse": "my_warehouse", "region": "my_region_abc" }'
Otherwise it doesn't work throwing the Python exception:
snowflake.connector.errors.ProgrammingError: 251001: 251001: Account must be specified
I think this might be due to that airflow command parameter --conn-host that is expecting a full domain with subdomain (the my_account_xyz.my_region_abc), that usually for Snowflake are specified as query parameters in a way similar to this template (although I did not check all the combinations of the command airflow connections add and the DAG execution):
"snowflake://{user}:{password}#{account}{region}{cloud}/{database}/{schema}?role={role}&warehouse={warehouse}&timezone={timezone}"
Then a dummy Snowflake DAG like this SELECT 1; will find its own way to the Snowflake cloud service and will work:
import datetime
from datetime import timedelta
from airflow.models import DAG
# https://airflow.apache.org/docs/apache-airflow-providers-snowflake/stable/operators/snowflake.html
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
my_dag = DAG(
"example_snowflake",
start_date=datetime.datetime.utcnow(),
default_args={"snowflake_conn_id": "my_snowflake_db"},
schedule_interval="0 0 1 * *",
tags=["example"],
catchup=False,
dagrun_timeout=timedelta(minutes=10),
)
sf_task_1 = SnowflakeOperator(
task_id="sf_task_1",
dag=my_dag,
sql="SELECT 1;",
)

Related

PySpark write data to Ceph returns 400 Bad Request

I have a problem with pySpark configuration when writing data inside a ceph bucket.
With the following Python code snippet I can read data from the Ceph bucket but when I try to write inside the bucket, I get the following error:
22/07/22 10:00:58 DEBUG S3ErrorResponseHandler: Failed in parsing the error response :
org.apache.hadoop.shaded.com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
at [row,col {unknown-source}]: [1,0]
at org.apache.hadoop.shaded.com.ctc.wstx.sr.StreamScanner.throwUnexpectedEOF(StreamScanner.java:701)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.handleEOF(BasicStreamReader.java:2217)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2123)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1179)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.createException(S3ErrorResponseHandler.java:122)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:71)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:52)
[...]
22/07/22 10:00:58 DEBUG request: Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
22/07/22 10:00:58 DEBUG AwsChunkedEncodingInputStream: AwsChunkedEncodingInputStream reset (will reset the wrapped stream because it is mark-supported).
Pyspark code (not working):
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-bundle:1.12.264,org.apache.spark:spark-sql-kafka-0-10_2.13:3.3.0,org.apache.hadoop:hadoop-aws:3.3.3 pyspark-shell"
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.getOrCreate()
)
spark.sparkContext.setLogLevel("TRACE")
# This works
spark.read.csv("s3a://test-data/data.csv")
# This throws the provided error
df_to_write = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df_to_write.write.csv("s3a://test-data/with_love.csv")
Also, referring to the same ceph bucket, I am able to read and write data to the bucket via boto3:
import boto3
from botocore.exceptions import ClientError
from botocore.client import Config
config = Config(connect_timeout=20, retries={'max_attempts': 0})
s3_client = boto3.client('s3', config=config,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name="defaut",
endpoint_url='http://HOST_NAME:88',
verify=False
)
response = s3_client.list_buckets()
# Read
print('Existing buckets:')
for bucket in response['Buckets']:
print(f' {bucket["Name"]}')
# Write
dummy_data = b'Dummy string'
s3_client.put_object(Body=dummy_data, Bucket='test-spark', Key='awesome_key')
Also s3cmd with the same configuration is working fine.
I think I'm missing some pyspark (hadoop-aws) configuration, could anyone help me in identifying the configuration problem? Thanks.
After some research on the web, I was able to solve the problem using this hadoop-aws configuration:
fs.s3a.signing-algorithm: S3SignerType
I configured this property in pySpark with:
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.config("spark.hadoop.fs.s3a.signing-algorithm", "S3SignerType") \
.getOrCreate()
)
From what I understand, the version of ceph I am using (16.2.3), does not support the default signing algorithm used in Spark version v3.3.0 on hadoop version 3.3.2.
For further details see this documentation.

unable to initialize snowflake data source

I am trying to access the snowflake datasource using "great_expectations" library.
The following is what I tried so far:
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest
context = ge.get_context()
datasource_config = {
"name": "my_snowflake_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"connection_string": "snowflake://myusername:mypass#myaccount/myDB/myschema?warehouse=mywh&role=myadmin",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
"default_inferred_data_connector_name": {
"class_name": "InferredAssetSqlDataConnector",
"include_schema_name": True,
},
},
}
print(context.test_yaml_config(yaml.dump(datasource_config)))
I initiated great_expectation before executing above code:
great_expectations init
but I am getting the error below:
great_expectations.exceptions.exceptions.DatasourceInitializationError: Cannot initialize datasource my_snowflake_datasource, error: 'NoneType' object has no attribute 'create_engine'
What am I doing wrong?
Your configuration seems to be ok, corresponding to the example here.
If you look at the traceback you should notice that the error propagates starting at the file great_expectations/execution_engine/sqlalchemy_execution_engine.py in your virtual environment.
The actual line where the error occurs is:
self.engine = sa.create_engine(connection_string, **kwargs)
And if you search for that sa at the top of that file:
import sqlalchemy as sa
make_url = import_make_url()
except ImportError:
sa = None
So sqlalchemy is not installed, which you
don't get automatically in your environement if you install greate_expectiations. The thing to do is to
install snowflake-sqlalchemy, since you want to use sqlalchemy's snowflake
plugin (assumption based on your connection_string).
/your/virtualenv/bin/python -m pip install snowflake-sqlalchemy
After that you should no longer get an error, it looks like test_yaml_config is waiting for the connection
to time out.
What worries me greatly is the documented use of a deprecated API of ruamel.yaml.
The function ruamel.yaml.dump is going to be removed in the near future, and you
should use the .dump() method of a ruamel.yaml.YAML() instance.
You should use the following code instead:
import sys
from ruamel.yaml import YAML
import great_expectations as ge
context = ge.get_context()
datasource_config = {
"name": "my_snowflake_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"connection_string": "snowflake://myusername:mypass#myaccount/myDB/myschema?warehouse=mywh&role=myadmin",
},
"data_connectors": {
"default_runtime_data_connector_name": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["default_identifier_name"],
},
"default_inferred_data_connector_name": {
"class_name": "InferredAssetSqlDataConnector",
"include_schema_name": True,
},
},
}
yaml = YAML()
yaml.dump(datasource_config, sys.stdout, transform=context.test_yaml_config)
I'll make a PR for great-excpectations to update their documentation/use of ruamel.yaml.

Authentication failure for user in connecting to GCP posrgreSQL cloud SQL via cloud sql proxy and using sqlalchemy in python on Ubuntu machine

I have crated a postgreSQL cloud SQL instance in GCP and I have created a user and a DB for it. I can connect to it via cloud_sql_proxy tool:
$ cloud_sql_proxy -instances=project_name:REGION:instance_name=tcp:5432 -credential_file=/path/to/key.json
I then can successfully connect to instance via psql and run queries and insert data etc in command line:
$ psql "host=127.0.0.1 port=5432 sslmode=disable dbname=myDBname user=myUser"
Password:
psql (10.18 (Ubuntu 10.18-0ubuntu0.18.04.1), server 13.3)
WARNING: psql major version 10, server major version 13.
Some psql features might not work.
Type "help" for help.
myDBname=>SELECT * FROM MyTable;
My issue is that when I try to use the sqlalchemy library and use the sample code provided in sqlalchemy example code like this:
import sqlalchemy
import os
db_config = {
"pool_size": 5,
"max_overflow": 2,
"pool_timeout": 30, # 30 seconds
"pool_recycle": 1800, # 30 minutes
}
def init_tcp_connection_engine(db_config):
db_user = "myUser"
db_pass = "myPassword"
db_name = "myDBname"
db_hostname = "127.0.0.1"
db_port = 5432
pool = sqlalchemy.create_engine(
# Equivalent URL:
# postgresql+pg8000://<db_user>:<db_pass>#<db_host>:<db_port>/<db_name>
sqlalchemy.engine.url.URL.create(
drivername="postgresql+pg8000",
username=db_user, # e.g. "my-database-user"
password=db_pass, # e.g. "my-database-password"
host=db_hostname, # e.g. "127.0.0.1"
port=db_port, # e.g. 5432
database=db_name # e.g. "my-database-name"
),
**db_config
)
# [END cloud_sql_postgres_sqlalchemy_create_tcp]
pool.dialect.description_encoding = None
return pool
def main():
db = init_tcp_connection_engine(db_config)
with db.connect() as conn:
rows = conn.execute("SELECT * FROM MyTable;").fetchall()
for row in rows:
print(row)
if __name__ == "__main__":
main()
I get the error of
Exception has occurred: ProgrammingError (note: full exception trace is shown but execution is paused at: <module>)
(pg8000.dbapi.ProgrammingError) {'S': 'FATAL', 'V': 'FATAL', 'C': '28P01', 'M': 'password authentication failed for user "myUser"', 'F': 'auth.c', 'L': '347', 'R': 'auth_failed'}
(Background on this error at: https://sqlalche.me/e/14/f405)
Any idea what is wrong and how I can resolve this?
I changed the password via webUI and then paste it into code and it worked.

Elasticsearch pyspark connection in insecure mode

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

Azure-ML Deployment does NOT see AzureML Environment (wrong version number)

I've followed the documentation pretty well as outlined here.
I've setup my azure machine learning environment the following way:
from azureml.core import Workspace
# Connect to the workspace
ws = Workspace.from_config()
from azureml.core import Environment
from azureml.core import ContainerRegistry
myenv = Environment(name = "myenv")
myenv.inferencing_stack_version = "latest" # This will install the inference specific apt packages.
# Docker
myenv.docker.enabled = True
myenv.docker.base_image_registry.address = "myazureregistry.azurecr.io"
myenv.docker.base_image_registry.username = "myusername"
myenv.docker.base_image_registry.password = "mypassword"
myenv.docker.base_image = "4fb3..."
myenv.docker.arguments = None
# Environment variable (I need python to look at folders
myenv.environment_variables = {"PYTHONPATH":"/root"}
# python
myenv.python.user_managed_dependencies = True
myenv.python.interpreter_path = "/opt/miniconda/envs/myenv/bin/python"
from azureml.core.conda_dependencies import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-defaults")
myenv.python.conda_dependencies=conda_dep
myenv.register(workspace=ws) # works!
I have a score.py file configured for inference (not relevant to the problem I'm having)...
I then setup inference configuration
from azureml.core.model import InferenceConfig
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)
I setup my compute cluster:
from azureml.core.compute import ComputeTarget, AksCompute
from azureml.exceptions import ComputeTargetException
# Choose a name for your cluster
aks_name = "theclustername"
# Check to see if the cluster already exists
try:
aks_target = ComputeTarget(workspace=ws, name=aks_name)
print('Found existing compute target')
except ComputeTargetException:
print('Creating a new compute target...')
prov_config = AksCompute.provisioning_configuration(vm_size="Standard_NC6_Promo")
aks_target = ComputeTarget.create(workspace=ws, name=aks_name, provisioning_configuration=prov_config)
aks_target.wait_for_completion(show_output=True)
from azureml.core.webservice import AksWebservice
# Example
gpu_aks_config = AksWebservice.deploy_configuration(autoscale_enabled=False,
num_replicas=3,
cpu_cores=4,
memory_gb=10)
Everything succeeds; then I try and deploy the model for inference:
from azureml.core.model import Model
model = Model(ws, name="thenameofmymodel")
# Name of the web service that is deployed
aks_service_name = 'tryingtodeply'
# Deploy the model
aks_service = Model.deploy(ws,
aks_service_name,
models=[model],
inference_config=inference_config,
deployment_config=gpu_aks_config,
deployment_target=aks_target,
overwrite=True)
aks_service.wait_for_deployment(show_output=True)
print(aks_service.state)
And it fails saying that it can't find the environment. More specifically, my environment version is version 11, but it keeps trying to find an environment with a version number that is 1 higher (i.e., version 12) than the current environment:
FailedERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: 0f03a025-3407-4dc1-9922-a53cc27267d4
More information can be found here:
Error:
{
"code": "BadRequest",
"statusCode": 400,
"message": "The request is invalid",
"details": [
{
"code": "EnvironmentDetailsFetchFailedUserError",
"message": "Failed to fetch details for Environment with Name: myenv Version: 12."
}
]
}
I have tried to manually edit the environment JSON to match the version that azureml is trying to fetch, but nothing works. Can anyone see anything wrong with this code?
Update
Changing the name of the environment (e.g., my_inference_env) and passing it to InferenceConfig seems to be on the right track. However, the error now changes to the following
Running..........
Failed
ERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: f0dfc13b-6fb6-494b-91a7-de42b9384692
More information can be found here: https://some_long_http_address_that_leads_to_nothing
Error:
{
"code": "DeploymentFailed",
"statusCode": 404,
"message": "Deployment not found"
}
Solution
The answer from Anders below is indeed correct regarding the use of azure ML environments. However, the last error I was getting was because I was setting the container image using the digest value (a sha) and NOT the image name and tag (e.g., imagename:tag). Note the line of code in the first block:
myenv.docker.base_image = "4fb3..."
I reference the digest value, but it should be changed to
myenv.docker.base_image = "imagename:tag"
Once I made that change, the deployment succeeded! :)
One concept that took me a while to get was the bifurcation of registering and using an Azure ML Environment. If you have already registered your env, myenv, and none of the details of the your environment have changed, there is no need re-register it with myenv.register(). You can simply get the already register env using Environment.get() like so:
myenv = Environment.get(ws, name='myenv', version=11)
My recommendation would be to name your environment something new: like "model_scoring_env". Register it once, then pass it to the InferenceConfig.

Resources