Load external jars to Zeppelin from s3

Load external jars to Zeppelin from s3 - apache-spark

Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?

A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

Related

spark application how to use jar's methods function in spark

How to use external jar lib function and method in spark application (jupyter notebook)
Example I was trying to use ua-parser as external lib to parse user agent string,
So in my jupyter notebook I have added that jar file in spark config using
SparkSession.builder.appName(sparkAppName).master(sparkMaster)
.config("spark.jars", "my_hdfs_jar_path/ua_parser1.3.0.jar")
how can someone use class/method from such external jar into pyspark/spark sql code?
In Python or Java app, someone can easily use methods and classes from external jar by just adding it into application class path and importing classnames
from ua_parser import user_agent_parser
In java we can add jar as external jar and we can use methods/functions just doing
import ua_parser.Parser
Note - I have given a ua_parse just for a example purpose

AWS Glue Scala Job (from S3 bucket) throws ClassNottFound

I'm trying to follow the AWS Glue documentation to develop a scala program and create a new Glue Job. Following have been my steps so far
Built a sample Scala program as guided by https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-example.html.
Bundled the scala main class into a jar-with-dependencies assembly file and uploaded it to S3 under a /bin folder
Launched AWS Glue Service on AWS Management Console
Under "jobs" clicked on Add Job and setup the following
Name:
IAM Role : Role that has access to S3, Glue, etc
Type: Spark
Glue Version: Spark 3.1, Scala 2 (Glue Version 3.0)
This job runs as : "An existing Script that you provided"
Script file Name: FQCN for the scala main class
S3 path where the script is stored : s3 link to the jar with dependencies jar file
Temporary directory: did not change
Click on "Next"
Save the job and clicked on Run job
After a while it shows the following error in the cloudwatch logs
2021-12-14 02:24:50,558 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): User class 'xxx.xxx.xxxx.GlueJob' is not initialized. java.lang.ClassNotFoundException: xxx.xxx.xxxx.GlueJob
where am I going wrong?

I was able to go past the above error and run scala job on glue. To answer the question above, I would say it was more of my understanding of how glue executes scala.
Outside of AWS, a spark job lifecycle follows the following steps
code development -> compile -> bundle with libraries -> deploy -> invoke main class
The exception with glue is the first step of COMPILE ! Glue expects the following
The main class is in a separate file (default GlueApp.scala) on s3
All the relevant libraries and files are accordingly passed in "Security configuration, script libraries and job parameters" section
When the job is started, Glue compiles the main class with all the relevant libraries and any external libraries passed to it.
The learning for me has been NOT TO BUNDLE the main class into the jar.
If there is any other better way, please feel free to add. Thanks

How to use EMRFS S3-optimized committer without EMR?

I want to use EMRFS S3-optimized committer locally without EMR cluster.
I have set "fs.s3a.impl" = "com.amazon.ws.emr.hadoop.fs.EmrFileSystem" instead of "org.apache.hadoop.fs.s3a.S3AFileSystem" and following exception raised:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
Tried to use following packages from maven without any success:
com.amazonaws:aws-java-sdk:1.12.71
com.amazonaws:aws-java-sdk-emr:1.12.70

Sorry, but using EMRFS, including the S3-optimized committer, is not possible off of EMR.
EMRFS is not an open source package, nor is the library available in Maven Central. This is why the class is not found when you try to add aws-java-sdk-emr as a dependency; that package is solely for the AWS Java SDK client package used when interfacing with the EMR service (e.g., to create clusters).

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4

You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here

Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

How can I read a XML file Azure Databricks Spark

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances.
So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW.
Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported..
So any help pushing me a a good direction is appreciated.

One way is to use the databricks spark-xml library :
Import the spark-xml library into your workspace
https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :

I found this one is really helpful.
https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
install in your databricks cluster at the 'library' tab.
install it via launching spark-shell in the notebook itself.

I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0
using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df)
reference image for solution to read xml file in databricks

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Load external jars to Zeppelin from s3 - apache-spark

A bit late thought but for anyone else who might need this in future try below option, https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url. %spark.dep z.reset() z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

Related

spark application how to use jar's methods function in spark

AWS Glue Scala Job (from S3 bucket) throws ClassNottFound

How to use EMRFS S3-optimized committer without EMR?

Read/Load avro file from s3 using pyspark

How can I read a XML file Azure Databricks Spark

Categories

Resources