org.postgresql.util.PSQLException: SSL error: Received fatal alert: handshake_failure while writing from Azure Databricks to Azure Postgres Citus - azure

I am trying to write pyspark dataframe to Azure Postgres Citus (Hyperscale).
I am using latest Postgres JDBC Driver and I tried writing on Databricks Runtime 7,6,5.
df.write.format("jdbc").option("url","jdbc:postgresql://<HOST>:5432/citus?user=citus&password=<PWD>&sslmode=require" ).option("dbTable", table_name).mode(method).save()
This is what I get after running the above command
org.postgresql.util.PSQLException: SSL error: Received fatal alert: handshake_failure
I have already tried different parameters in the URL and unders the option as well, but no luck so far.
However, I am able to connect to this instance using my local machine and on databricks driver/notebook using psycopg2
Both the Azure Postgres Citus and Databricks are in the same region and Azure Postgres Citus is public.

It worked by overwriting the java security properties for driver and executor
spark.driver.extraJavaOptions -Djava.security.properties=
spark.executor.extraJavaOptions -Djava.security.properties=
Explanation:
What is happening in reality is that the “security” variable of the JVM is reading by default the following file (/databricks/spark/dbconf/java/extra.security) and in this file there are some TLS algorithms that are being disabled by default. That means that if I edit this file and replace the TLS cyphers that work for PostGres citus for an empty string that should also work.
When I set this variable to the executors (spark.executor.extraJavaOptions) it will not change the default variables from the JVM. The same does not happen for the driver which overwrites and so it starts to work.
Note: We need to edit this file before the variable is read and so the init script is the only way of accomplishing that.

Related

Airflow can't reach logs from webserver due to 403 error

I use Apache Airflow for daily ETL jobs. I installed it in Azure Kubernetes Service using the provided Helm chart. It's been running fine for half a year, but since recently I'm unable to access the logs in the webserver (this used to always work fine).
I'm getting the following error:
*** Log file does not exist: /opt/airflow/logs/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** Fetching from: http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log'
For more information check: https://httpstatuses.com/403
What have I tried:
I've made sure that the log file exists (I can exec into the airflow-worker-0 pod and read the file on command line in the location specified in the error).
I've rolled back my deployment to an earlier commit from when I know for sure it was still working, but it made no difference.
I was using webserverSecretKeySecretName in the values.yaml configuration. I changed the secret to which that name was pointing (deleted it and created a new one, as described here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key) but it didn't work (no difference, same error).
I changed the config to use a webserverSecretKey instead (in plain text), no difference.
My thoughts/observations:
The error states that the log file doesn't exist, but that's not true. It probably just can't access it.
The time is the same in all pods (I double checked be exec-ing into them and typing date in the command line)
The webserver secret is the same in the worker, the scheduler, and the webserver (I double checked by exec-ing into them and finding the corresponding env variable)
Any ideas?
Turns out this was a known bug with the latest release (2.4.0) of the official Airflow Helm chart, reported here:
https://github.com/apache/airflow/discussions/26490
Should be resolved in version 2.4.1 which should be available in the next couple of days.

Providing AWS_PROFILE when reading S3 files with Spark

I want my Spark app (Scala) to be able to read S3 files
spark.read.parquet("s3://my-bucket-name/my-object-key")
On my dev machine, I could access S3 files using awscli a pre-configured profile in ~/.aws/config or ~/.aws/credentials, like:
aws --profile my-profile s3 ls s3://my-bucket-name/my-object-key
But when trying to read those files from Spark, with the aws_profile provided as an env variable (AWS_PROFILE), I got the following error:
doesBucketExist on my-bucket-name: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
Also tried to provide the profile as a JVM option (-Daws.profile=my-profile), with no luck.
Thanks for reading.
The solution is to provide the spark property: fs.s3a.aws.credentials.provider, setting it to com.amazonaws.auth.profile.ProfileCredentialsProvider.
If I could change the code to build the Spark Session, then something like:
SparkSession
.builder()
.config("fs.s3a.aws.credentials.provider","com.amazonaws.auth.profile.ProfileCredentialsProvider")
.getOrCreate()
The other way is to provide the JVM option -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider.*NOTE the prefix spark.hadoop
If problems arise still after setting fs.s3a.aws.credentials.provider to com.amazonaws.auth.profile.ProfileCredentialsProvider and correctly setting AWS_PROFILE, it might be because you're using Hadoop 2 for which the above configuration is not supported.
Therefore, the only workaround I found was to upgrade to Hadoop 3.
Check this post and Hadoop docs for more information.

How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

I am trying to read data from a secured Kafka cluster using spark structured streaming.
Also I am using the below library to read the data - "spark-sql-kafka-0-10_2.12":"3.0.0-preview" since it has the feature to specify our custom group id (instead of spark setting its own custom group id)
Dependency used in code:
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview</version>
I am getting the below error - even after specifying the required JAAS configuration in spark options.
Caused by: java.lang.IllegalArgumentException: requirement failed: Delegation token must exist for this connector.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
Following document specifies that we can disable the feature of obtaining delegation token - https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
I tried setting this property spark.security.credentials.kafka.enabled to false in spark config, but it is still failing with the same error.
Apparently there seems to be a bug on the preview release and has been fixed on the GA Spark 3.x release.
Reference :
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-30495
Now, we can specify our custom consumer group name while fetching the data from Kafka (Even though it's not recommended and we will see a warning message while specifying it).

Access hdfs cluster from pydoop

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!
Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Resources