How to rename GCS files in Spark running on Dataproc Serverless? - apache-spark

After writing a spark dataframe to a file, I am attempting to rename the file using code like below:
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val file = fs.globStatus(new Path(path + "/part*"))(0).getPath().getName()
fs.rename(new Path(path + "/" + file), new Path(path + "/" + fileName))
This works great running Spark locally... However when I run my jar on Dataproc, I get an error like below:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong bucket: prj-***, in path: gs://prj-*****/part*, expected bucket: dataproc-temp-***
It seems files may not be saved to target buckets until the end of the job, and therefore struggling to rename them. I have tried using the .option("mapreduce.fileoutputcommitter.algorithm.version", "2") as I read something about this that looked promising.
Update:
Still no luck. It seems that spark.sparkContext.hadoopConfiguration expects the base bucket to be a dataproc-temp-* bucket. Full stack trace below:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong bucket: prj-**, in path: gs://p**, expected bucket: dataproc-temp-u***
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:95)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:667)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:394)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:149)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1085)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.globStatus(GoogleHadoopFileSystemBase.java:1059)

HCFS instance returned by FileSystem.get(...) call is tied to the specific FS (in this case GCS bucket). By default Dataproc Serverless Spark configured to use gs://daptaproc-temp-*/ bucket as a default HCFS via spark.hadoop.fs.defaultFS Spark property.
To solve this issue you need to create HCFS instance using FileSystem#get(URI uri, Configuration conf) call:
val fs = FileSystem.get(path.toUri, spark.sparkContext.hadoopConfiguration)

Related

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first

Writing data to timestreamDb from AWS Glue

I'm trying to use glue streaming and write data to AWS TimestreamDB but I'm having a hard time in configuring the JDBC connection.
Steps I’m following are below and the documentation link: https://docs.aws.amazon.com/timestream/latest/developerguide/JDBC.configuring.html
I’m uploading the jar to S3. There are multiple jars here and I tried with each one of it. https://github.com/awslabs/amazon-timestream-driver-jdbc/releases
In the glue job I’m pointing the jar lib path to the above s3 location
In the job script I’m trying to read from timestream using both spark/ glue with the below code but its not working. Can someone explain what I'm doing wrong here
This is my code:
url = jdbc:timestream://AccessKeyId=<myAccessKeyId>;SecretAccessKey=<mySecretAccessKey>;SessionToken=<mySessionToken>;Region=us-east-1
source_df = sparkSession.read.format("jdbc").option("url",url).option("dbtable","IoT").option("driver","software.amazon.timestream.jdbc.TimestreamDriver").load()
datasink1 = glueContext.write_dynamic_frame.from_options(frame = applymapping0, connection_type = "jdbc", connection_options = {"url":url,"driver":"software.amazon.timestream.jdbc.TimestreamDriver", database = "CovidTestDb", dbtable = "CovidTestTable"}, transformation_ctx = "datasink1")
To this date (April 2022) there is not support for write operations using timestream's jdbc driver (reviewed the code and saw a bunch of no write support exceptions). It is possible to read data from timestream using glue though. Following steps worked for me:
Upload timestream-query and timestream-jdbc to an S3 bucket that you can reference in your glue script
Ensure that the IAM role for the script has access to read operations to the timestream database and table
You don't need to use the access key and secret parameters in the jdbc url, using something like jdbc:timestream://Region=<timestream-db-region> should be enough
Specify the driver and fetchsize options option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
option("fetchsize", "100") (tweak the fetchsize according to your needs)
Following is a complete example of reading a dataframe from timestream:
val df = sparkSession.read.format("jdbc")
.option("url", "jdbc:timestream://Region=us-east-1")
.option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
// optionally add a query to narrow the data to fetch
.option("query", "select * from db.tbl where time between ago(15m) and now()")
.option("fetchsize", "100")
.load()
df.write.format("console").save()
Hope this helps

Moving data from Kinesis -> RDS using Spark with AWS Glue implementation locally

I have a Spark project with AWS Glue implementation running locally.
I listen to a Kinesis stream so when Data is arrived in JSON format, I can storage to S3 correctly.
I want to store in AWS RDS instead of storing in S3.
I have tried to use:
dataFrame.write
.format("jdbc")
.option("url","jdbc:mysql://aurora.cluster.region.rds.amazonaws.com:3306/database")
.option("user","user")
.option("password","password")
.option("dbtable","test-table")
.option("driver","com.mysql.jdbc.Driver")
.save()
Spark project get data from a Kinesis stream using AWS glue job.
I want to add the data to Aurora database.
It fails with error
Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL
server version for the right syntax to use near '-glue-table (`label2` TEXT , `customerid` TEXT , `sales` TEXT , `name` TEXT )' a
t line 1
This is the test dataFrame Im using, dataFrame.show():
+------+----------+-----+--------------------+
|label2|customerid|sales| name|
+------+----------+-----+--------------------+
| test6| test| test|streamingtesttest...|
+------+----------+-----+--------------------+
Using Spark DynamicFrame instead of DataFrame and using the glueContext sink to publish to Aurora:
So the final code could be:
lazy val mysqlJsonOption = jsonOptions(MYSQL_AURORA_URI)
//Write to Aurora
val dynamicFrame = DynamicFrame(joined, glueContext)
glueContext.getSink("mysql", mysqlJsonOption).writeDynamicFrame(dynamicFrame)

Running Custom Java Class in PySpark on EMR

I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,
def create_spark_session(master_dns, kind, jars):
# 8998 is the port on which the Livy server runs
host = 'http://' + master_dns + ':8998'
data = {'kind': kind, 'jars': jars}
headers = {'Content-Type': 'application/json'}
response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
logging.info(response.json())
return response.headers
Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)
The data transformation is attempting to import the jar and call the methods via:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()
The error I am receiving is
"py4j.protocol.Py4JError: An error occurred while calling
None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException:
Constructor com.cerner.bunsen.Bundles([]) does not exist"
This is the first time I have attempted to use java_import so any help would be appreciated.
EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Manage logging
#sc.setLogLevel("INFO")
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer
fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)
and the new error is:
'JavaPackage' object is not callable
Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.
If you want to use a Scala/Java function in Pyspark you have also to add the jar package in classpath. You can do it with 2 different ways:
Option1:
In Spark submit with the flag --jars
spark-submit example.py --jars /path/to/bunsen-shaded-1.4.7.jar
Option2: Add it in spark-defaults.conf file in property:
Add the following code in : path/to/spark/conf/spark-defaults.conf
# Comma-separated list of jars include on the driver and executor classpaths.
spark.jars /path/to/bunsen-shaded-1.4.7.jar

NativeAzureFileSystem not recognizing other containers

My objective is to access, from spark-shell of an HD Insight instance, blobs that are located in a container inside the storage account over which the cluster was created.
These are the steps I took:
Created an HD Insight cluster over the container https://mystorage.blob.core.windows.net:443/maincontainer.
Created another container on the same storage account: https://mystorage.blob.core.windows.net:443/extracontainer.
Created a file named person.json inside the extracontainer: https://mystorage.blob.core.windows.net:443/extracontainer/data/person.json
Opened a spark-shell session
Then I executed the following code:
scala> import org.apache.hadoop.fs._
scala> val conf = sc.hadoopConfiguration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml
scala> val fs: FileSystem = FileSystem.newInstance(conf)
fs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.azure.NativeAzureFileSystem#417e5282
scala> val files = fs.listFiles(new Path("wasbs://extracontainer#mystorage.blob.core.windows.net/data"), true)
java.io.FileNotFoundException: Filewasbs://extracontainer#mystorage.blob.core.windows.net/data does not exist.
Then I created the same folder and file on the maincontainer:
https://mystorage.blob.core.windows.net:443/maincontainer/data/person.json and I got the following result:
scala> val files = fs.listFiles(new Path("wasbs://extracontainer#mystorage.blob.core.windows.net/data"), true)
scala> while( files.hasNext() ) { println(files.next().getPath) }
wasb://maincontainer#mystorage.blob.core.windows.net/data/person.json
It shows me the file in the maincontainer and not the one in the extracontainer.
Does any one knows what's happening?
I also tried creating the FileSystem object using new Configuration() and I got the same behavior.
The correct behavior is obtained when using hadoop fs command line:
> hadoop fs -ls wasbs://extracontainer#mystorage.blob.core.windows.net/data/
Found 1 item
-rwxrwxrwx 1 977 2017-02-27 08:46 wasbs://extracontainer#mystorage.blob.core.windows.net/data/person.json
According to your description, based on my understanding, I think you want to read data from Azure Blob Storage with Spark, but the fs.defaultFS setting for Hadoop Configuration was set for your maincontainer when you created the HDInsight instance.
There are two ways for implementing your needs.
Using the methods addResource(new Path("wasbs://extracontainer#mystorage.blob.core.windows.net/data")) or set("fs.defaultFS", "wasbs://extracontainer#mystorage.blob.core.windows.net/data") of class Configuration to override the fs.defaultFS value for switching the resource reference, if the fs.defaultFS property in core-site.xml was not marked <final>true</final>. So first, you need to move to /etc/hadoop/conf to change it.
Refer to the similar SO thread Reading data from Azure Blob with Spark, you can try to use the code below to read data.
conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
conf.set("fs.azure.account.key.<youraccount>.blob.core.windows.net", "<yourkey>")
Hope it helps.

Resources