I'm trying to follow the AWS Glue documentation to develop a scala program and create a new Glue Job. Following have been my steps so far
Built a sample Scala program as guided by https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-example.html.
Bundled the scala main class into a jar-with-dependencies assembly file and uploaded it to S3 under a /bin folder
Launched AWS Glue Service on AWS Management Console
Under "jobs" clicked on Add Job and setup the following
Name:
IAM Role : Role that has access to S3, Glue, etc
Type: Spark
Glue Version: Spark 3.1, Scala 2 (Glue Version 3.0)
This job runs as : "An existing Script that you provided"
Script file Name: FQCN for the scala main class
S3 path where the script is stored : s3 link to the jar with dependencies jar file
Temporary directory: did not change
Click on "Next"
Save the job and clicked on Run job
After a while it shows the following error in the cloudwatch logs
2021-12-14 02:24:50,558 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): User class 'xxx.xxx.xxxx.GlueJob' is not initialized. java.lang.ClassNotFoundException: xxx.xxx.xxxx.GlueJob
where am I going wrong?
I was able to go past the above error and run scala job on glue. To answer the question above, I would say it was more of my understanding of how glue executes scala.
Outside of AWS, a spark job lifecycle follows the following steps
code development -> compile -> bundle with libraries -> deploy -> invoke main class
The exception with glue is the first step of COMPILE ! Glue expects the following
The main class is in a separate file (default GlueApp.scala) on s3
All the relevant libraries and files are accordingly passed in "Security configuration, script libraries and job parameters" section
When the job is started, Glue compiles the main class with all the relevant libraries and any external libraries passed to it.
The learning for me has been NOT TO BUNDLE the main class into the jar.
If there is any other better way, please feel free to add. Thanks
Related
How to use external jar lib function and method in spark application (jupyter notebook)
Example I was trying to use ua-parser as external lib to parse user agent string,
So in my jupyter notebook I have added that jar file in spark config using
SparkSession.builder.appName(sparkAppName).master(sparkMaster)
.config("spark.jars", "my_hdfs_jar_path/ua_parser1.3.0.jar")
how can someone use class/method from such external jar into pyspark/spark sql code?
In Python or Java app, someone can easily use methods and classes from external jar by just adding it into application class path and importing classnames
from ua_parser import user_agent_parser
In java we can add jar as external jar and we can use methods/functions just doing
import ua_parser.Parser
Note - I have given a ua_parse just for a example purpose
I have a spark application that contains multiple spark jobs to be run on Azure data bricks. I want to build and package the application into a fat jar. The application is able to compile successfully. While I am trying to package (command: sbt package) the application, it gives an error "[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list".
How to build the application jar (Without specifying any main class) so that I can upload it to Databricks job and specify the main classpath over there?
This message is just a warning (see [warn] in it), it doesn't prevent generation of the jar files (normal or fat). Then you can upload resulting jar to DBFS (or ADLS for newer Databricks Runtime versions), and create Databricks job either as Jar task, or Spark Submit task.
If sbt fails, and doesn't produce jars, then you have some plugin that forces error on the warnings.
Also notice that sbt package doesn't produce fat jar - it produce jar only for classes in your project. You will need to use sbt assembly (install sbt-assembly plugin for that) to generate fat jar, but make sure that you marked Spark & Delta dependencies as provided.
I want to use EMRFS S3-optimized committer locally without EMR cluster.
I have set "fs.s3a.impl" = "com.amazon.ws.emr.hadoop.fs.EmrFileSystem" instead of "org.apache.hadoop.fs.s3a.S3AFileSystem" and following exception raised:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
Tried to use following packages from maven without any success:
com.amazonaws:aws-java-sdk:1.12.71
com.amazonaws:aws-java-sdk-emr:1.12.70
Sorry, but using EMRFS, including the S3-optimized committer, is not possible off of EMR.
EMRFS is not an open source package, nor is the library available in Maven Central. This is why the class is not found when you try to add aws-java-sdk-emr as a dependency; that package is solely for the AWS Java SDK client package used when interfacing with the EMR service (e.g., to create clusters).
I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.
Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?
A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")