ISSUE:
Able to successfully download the file using AWS CLI as well as boto 3.
However, while using the S3A connector of Hadoop/Spark , receiving the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o24.parquet.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: BCFFD14CB2939D68, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: MfT8J6ZPlJccgHBXX+tX1fpX47V7dWCP3Dq+W9+IBUfUhsD4Nx+DcyqsbgbKsPn8NZzjc2U
Configuration:
Running this on my Local machine
Spark Version 2.4.4
Hadoop Version 2.7
Jars added:
hadoop-aws-2.7.3.jar
aws-java-sdk-1.7.4.jar
Hadoop Config:
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.session.token", session_key)
hadoop_conf.set("fs.s3a.endpoint", "s3-us-west-2.amazonaws.com") # yes, I am using central eu server.
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
Code to Read the file:
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext.getOrCreate()
hadoop_conf=sc._jsc.hadoopConfiguration()
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(path)
print(df.head())
Set AWS credentials provider to profile credentials:
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
I faced the same issue as above. It turns out that I needed to add in the session token into my configuration. See the documentation.
configuration.set("fs.s3a.session.token", System.getenv("AWS_SESSION_TOKEN"))
Related
I am trying to load data from a remote HDFS file system to my local PySpark session on my local Mac machine:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
path = "/xx/yy/order_info_20220413/partn_date=20220511/part-00085-dd.gz.parquet"
host = "host"
port = 1234
orders = spark.read.parquet(
f"hdfs://{host}:{port}{path}"
)
Here is the error:
Py4JJavaError: An error occurred while calling o55.parquet.
: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1936)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1238)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1134)
I tried to understand what RPC response exceeds maximum data length is. I DID NOT find anything similar to the code block in core-site.xml as what https://stackoverflow.com/a/60701948/6693221 shows:
<property>
<name>fs.default.name</name>
<value>hdfs://host:port</value>
</property>
However, when I typed telnet host port in my Mac OS terminal, I was CONNECTED. What is the solution?
You sould configure your file system before creating the spark session, you can do that in the core-site.xml file or directly in your session config, then to read the parquet, you need just to provide the path, where you already configured your session to use the remote hdfs cluster as a FS:
from pyspark.sql import SparkSession
path = "/xx/yy/order_info_20220413/partn_date=20220511/part-00085-dd.gz.parquet"
host = "host"
port = 1234
spark = (
SparkSession.builder
.config("spark.hadoop.fs.default.name", f"hdfs://{host}:{port}")
.config("spark.hadoop.fs.defaultFS", f"hdfs://{host}:{port}")
.getOrCreate()
)
orders = spark.read.parquet(path)
I am trying to read a file in SPARK on EMR, that I have provided TEMPORARY credentials for in a different system (Illumina ICA).
When trying to read the file using spark.read.csv, using the S3 URI, it gives me the error:
Py4JJavaError: An error occurred while calling o65.csv.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
But when I try the same credentials using a BOTO3 call, it works just fine, so the credentials (in the environment) are just fine.
Here's my test code (from a notebook)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('s3://stratus-gds-use1/241dd164-decb-48f6-eba1-08d881d902b2/dummy.vcf.gz', sep='\t')
#... Py4JJavaError: An error occurred while calling o65.csv. ##: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
access_key_id=os.environ['AWS_ACCESS_KEY_ID']
secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
region=os.environ['AWS_DEFAULT_REGION']
session_token=os.environ['AWS_SESSION_TOKEN']
bucket_name='stratus-gds-use1'
key_prefix='241dd164-decb-48f6-eba1-08d881d902b2/dummy.vcf.gz'
import boto3
s3_session = boto3.session.Session(aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
aws_session_token=session_token,
region_name = region)
s3_client = s3_session.client('s3')
%ls -l dummy.vcf.gz
#-=> ls: cannot access dummy.vcf.gz: No such file or directory
r = s3_client.download_file(Filename='dummy.vcf.gz',
Bucket=bucket_name,
Key=key_prefix)
%ls -l dummy.vcf.gz
#-=> -rw-rw-r-- 1 hadoop hadoop 2535 Apr 6 18:45 dummy.vcf.gz
Any ideas why spark on AWS EMR cannot access the file with the provided S3 URI?
I have tested other S3 URIs like that and they work fine, so the java classes work fine.
I finally figured out the solutions. I needed to provide the temporary AWS credentials in the Spark configuration and provide the special class
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as well as the session_token..
So this is the procedure to ensure spark can read an S3 bucket with temporary credentials.
import pyspark
from pyspark.sql import SparkSession
conf = (
pyspark.SparkConf()
.set('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
.set('spark.hadoop.fs.s3a.access.key', access_key_id)
.set('spark.hadoop.fs.s3a.secret.key', secret_access_key)
.set('spark.hadoop.fs.s3a.session.token', session_token)
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.read.csv(f's3a://{BUCKET}/{KEY_PREFIX}', sep='\t')
I am trying to read streaming data into Azure Databricks coming from Azure Eventhubs.
This is the code i've been using:
connectionString = "Connection string"
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
query = df \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
And its giving me an error saying:
ERROR: Some streams terminated before this command could finish!
I understood that we have to give the Jar file of Azure Eventhub according to the Databricks run time and also according to the spark version.
My spark version is 2.4.5 and Databricks runtime is 6.6, and the jar file i used is azure-eventhubs-spark_2.12-2.3.17.jar for this combination as specified
But i'm still facing this issue as "Some streams terminated before this command could finish!".Can anyone please help me on this.
Thanks
As I started working on this issue: First faced the same issue as you are experiencing.
ERROR: Some streams terminated before this command could finish!
After making this changes, it works perfectly with the below configuration:
Databrick Runtime: 6.6 (includes Apache Spark 2.4.5, Scala 2.11)
Azure EventHub library: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17
Step1: Install libraries using Library.
You can try to install "com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17" using Install Library option.
Step2: Change the configuration related to Azure Event Hubs.
If you are using "ehConf = {'eventhubs.connectionString' : connectionString}" with version above 2.3.15, you will receive the below error message.
java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
Note: All configuration relating to Event Hubs happens in your Event Hubs configuration dictionary. The configuration dictionary must contain an Event Hubs connection string:
connectionString = "YOUR.CONNECTION.STRING"
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
For **2.3.15** version and above, the configuration dictionary requires that connection string be encrypted.
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem
While using spark 2.3.0, hadoop-aws 2.7.6 I tried to read from s3
spark.sparkContext.textFile("s3a://ap-northeast-2-bucket/file-1").take(10)
But AmazonS3Exception raised.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 202ABEDF0E955321, AWS Error Code: null, AWS Error Message: Bad Request
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
...
I launched ec2 instance with Instance Profile, so AWS SDK are using InstaneProfileCredential, and in console I can use AWS CLI successfuly
aws s3 ls ap-northeast-2-bucket
aws s3 cp s3://ap-northeast-2-bucket/file-a file-a
I did set fs.s3a.endpoint to s3.ap-northeast-2.amazonaws.com in spark-defaults.conf
# spark-defaults.conf
spark.hadoop.fs.s3a.endpoint s3.ap-northeast-2.amazonaws.com
This was caused by combination of many facts.
I was using spark 2.3.0 with hadoop 2.7. So I was using hadoop-aws 2.7.6 and then by dependency aws-java-sdk version is 1.7.4.
My bucket is located in Seoul (ap-northeast-2) and (Seoul and Frankfurt) region only support V4 signing mechanism. So I should set endpoint for aws-sdk to use V4 properly. This can be fixed by setting hadoop conf
spark.hadoop.fs.s3a.endpoint s3.ap-northeast-2.amazonaws.com
And aws-java-sdk released before June 2016 is using V2 signing mechanism as default. So I should explicitly set aws-sdk to use V4. This can be fixed by setting java system property.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
If both fix is not applied, BadRequest error occurs.