AWS Glue Passing Parameters via Boto3 causing exception - python-3.x

For the life of me I cannot figure out what is going on here.
I am starting a Glue Job via Boto3 (from Lambda but testing locally gives the exact same issue) and when I pass parameters in via the "start job run" api I get the same error, but looking at the logs the parameters all look correct. Here is the output (I have changed some names of the buckets etc.)
Glue Code (sample):
def main():
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
's3_bucket',
's3_temp_prefix',
's3_schema_prefix',
's3_processed_prefix',
'ingestion_run_id'
]
)
sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
s3_client = boto3.client('s3')
s3_bucket = args['s3_bucket']
temp_prefix = args['s3_temp_prefix']
schema_prefix = args['s3_schema_prefix']
processed_prefix = args['s3_processed_prefix']
ingestion_run_id = args['ingestion_run_id']
logger.info(f's3_bucket: {s3_bucket}')
logger.info(f'temp_prefix {temp_prefix}')
logger.info(f'schema_prefix: {schema_prefix}')
logger.info(f'processed_prefix: {processed_prefix}')
logger.info(f'ingestion_run_id: {ingestion_run_id}')
SAM Template to make the Glue Job:
CreateDataset:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
PythonVersion: 3
ScriptLocation: !Sub "s3://bucket-name/GLUE/create_dataset.py"
DefaultArguments:
"--extra-py-files": "s3://bucket-name/GLUE/S3GetKeys.py"
"--enable-continuous-cloudwatch-log": ""
"--enable-metrics": ""
GlueVersion: 2.0
MaxRetries: 0
Role: !GetAtt GlueRole.Arn
Timeout: 360
WorkerType: Standard
NumberOfWorkers: 15
Code to attempt to start the Glue Job:
import boto3
session = boto3.session.Session(profile_name='glue_admin', region_name=region)
client = session.client('glue')
name = 'CreateDataset-1uPuNfIw1Tjd'
args = {
"--s3_bucket": 'bucket-name',
"--s3_temp_prefix": 'TEMP',
"--s3_schema_prefix": 'SCHEMA',
"--s3_processed_prefix": 'PROCESSED',
"--ingestion_run_id": 'FakeRun'
}
client.start_job_run(JobName=name, Arguments=args)
This starts the job fine put then the script errors and this is the log left behind, from what I can see it seems the parameters are lined up fine?
Wed Feb 10 09:16:00 UTC 2021/usr/bin/java -cp /opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/*:/opt/amazon/lib/emrfs-lib/*:/opt/amazon/spark/jars/*:/opt/amazon/superjar/*:/opt/amazon/lib/*:/opt/amazon/Scala2.11/* com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=29 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --conf spark.driver.memory=5g --JOB_ID j_76c49a0d580594d5c0f584458cc0c9d519 --enable-metrics --extra-py-files s3://bucket-name/GLUE/S3GetKeys.py --JOB_RUN_ID jr_c0b9049abf1ee1161de189a901dd4be05694c1c42863 --s3_schema_prefix SCHEMA --enable-continuous-cloudwatch-log --s3_bucket bucket-name --scriptLocation s3://bucket-name/GLUE/create_dataset.py --s3_temp_prefix TEMP --ingestion_run_id FakeRun --s3_processed_prefix PROCESSED --JOB_NAME CreateDataset-1uPuNfIw1Tjd
Bucket name has been altered for this post but it matches exactly.
Fail point in Glue JOb log:
java.lang.IllegalArgumentException: For input string: "--s3_bucket"
The bucket name has no illegal chars but does have '-' in it?
Thanks in advance for help.

This happened because --enable-continuous-cloudwatch-log argument expects a value and since you didn't provide a value, the argument parser assumed the next argument is the value for it(--enable-continuous-cloudwatch-log --s3_bucket), which in this case was --s3_bucket, now --s3_bucket is an invalid value for --enable-continuous-cloudwatch-log option, therefore that error happens.

Related

AWS Glue job throwing java.lang.OutOfMemoryError: Java heap space

I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.
Below is the glue source.... sourcePath is the location of the s3 file.
In this location we have around 100 million json files.. all of them are nested into sub-folders.
So that is the reason I am applying exclusionPattern to exclude and files starting with a (which are around 2.7 million files) and I believe that only the files starting with a will be processed.
val file_paths = Array(sourcePath)
val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""
glueContext
.getSourceWithFormat(connectionType = "s3",
options = JsonOptions(Map(
"paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
)),
format = "json",
transformationContext = "sourceDF"
)
.getDynamicFrame()
.map(transformRow, "error in row")
.toDF()
After running this job with Standard worker type and with G2 worker type as well. I keep getting error
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27788"...
And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.
When running the job I am setting spark.driver.memory=10g and spark.driver.memoryOverhead=4096 and --conf job parameter.
This is the details in the logs
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.executor.memory=20g
--conf spark.executor.cores=16
--conf spark.driver.memory=20g
--conf spark.default.parallelism=80
--conf spark.sql.shuffle.partitions=80
--conf spark.network.timeout=600
--job-bookmark-option job-bookmark-disable
--TempDir s3://my-location/admin
--class com.example.ETLJob
--enable-spark-ui true
--enable-metrics
--JOB_ID j_111...
--spark-event-logs-path s3://spark-ui
--conf spark.driver.memory=20g
--JOB_RUN_ID jr_111...
--conf spark.driver.memoryOverhead=4096
--scriptLocation s3://my-location/admin/Job/ETL
--SOURCE_DATA_LOCATION s3://xyz/
--job-language scala
--DESTINATION_DATA_LOCATION s3://xyz123/
--JOB_NAME ETL
Any ideas what could be the issue.
Thanks
If you have too many files, you are probably overwhelming the driver. Try using the useS3ListImplementation. This is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses.
try to add:
"useS3ListImplementation" -> true
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/
As suggested by #eman...
I applied all 3 groupFiles, groupSize and useS3ListImplementation.. like below
options = JsonOptions(Map(
"path" -> sourcePath,
"recurse" -> true,
"groupFiles" -> "inPartition",
"groupSize" -> 104857600,//100 mb
"useS3ListImplementation" -> true
))
And that is working for me... there is also an option of "acrossPartitions" if data is not arranged properly.

Not able to read jceks file in yarn cluster mode in python

I am using a jceks file to decrypt my password and unable to read the encrypted password in yarn cluster mode
I have tried different methods like included
spark-submit --deploy-mode cluster
--file /localpath/credentials.jceks#credentials.jceks
--conf spark.hadoop.hadoop.security.credential.provider.path=jceks://file////localpath/credentials.jceks test.py
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
x.set("hadoop.security.credential.provider.path", "jceks://file///credentials.jceks")
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
I am getting the below error:
attributeError: 'NoneType' object has no attribute 'len'
and when I am printing a ,it has None
FWIW, if you try putting your jceks file to hdfs, the yarn workers would be able to find it while running on cluster mode, at least it works for me. Hope it works for you.
hadoop fs -put ~/.jceks /user/<uid>/.jceks
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
jceks_hdfs_path = "jceks://hdfs#<host>/user/<uid>/.jceks"
x.set("hadoop.security.credential.provider.path", jceks_hdfs_path)
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
With that you won't need to specify --files and --conf in the arguments when you run spark-submit. Hope it helps.
You can refer to jceks like this
#Choosing jceks from spark staging directory
jceks_location=“jceks://“ + str(“/user/your_user_name/.sparkStaging/“ + str(spark.sparkContest.applicationId) + “/credentials.jceks“)
x.set("hadoop.security.credential.provider.path",jceks_location)

Retrieve the time it took for spark to finish the job

I need to time some things in spark, e.g. how long it takes for spark to read my file, so I like to use sc.setLogLevel("INFO") to enable extra information printed to the screen and one thing I find really useful is when a message like this is printed
2018-12-18 02:05:38 INFO DAGScheduler:54 - Job 2 finished: count at <console>:26, took 9.555080 s because this tells me how long something took.
Is there anyway to get this programmatically (preferably in scala) ? Right now I just copy this result and save it in a text file.
You can create something like:
import scala.concurrent.duration._
case class TimedResult[R](result: R, durationInNanoSeconds: FiniteDuration)
def time[R](block: => R): TimedResult[R] = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
val duration = t1 - t0
TimedResult(result, duration nanoseconds)
}
and then use it call your code block :
val timedResult = time{
someDataframe.count()
}
println("Count of records ${timedResult.result}")
println("Time taken : ${timedResult.durationInNanoSeconds.toSeconds}")
There are 2 solutions available to record logging for your spark program.
a) You can redirect your console output to a desired file while using the spark-submit command.
spark-submit your_code_file > logfile.txt 2>&1
b) There are 2 log files (log4j.properties) can be created for driver and executor and while issuing the spark-submit command include them by giving their path in java options for driver and executer.
spark-submit --class MAIN_CLASS --driver-java-options "-Dlog4j.configuration=file:LOG4J_PATH" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:LOG4J_PATH" --master MASTER_IP:PORT JAR_PATH

Unable to access HBase Via API

I have hbase installed over three nodes. I am trying to load hbase via spark with the help of below code.
from __future__ import print_function
import sys
from pyspark import SparkContext
import json
if __name__ == "__main__":
print ("*******************************")
sc = SparkContext(appName="HBaseOutputFormat")
host = sys.argv[1]
table = "hbase_test"
port = "2181"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"hbase.zookeeper.property.clientPort":port,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
rdd = sc.parallelize([sys.argv[2:]]).map(lambda x: (x[0], x))
print (rdd.collect())
rdd.saveAsNewAPIHadoopDataset(
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
sc.stop()
I am executing code as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py HBASE_MASTER_ip row1 f1 q1 value1
But the job stucks and doesn't proceed. Below is the snapshot:
As per some previous threads I tried changing /etc/hosts to comment localhost line but it didn't worked.
Requesting your help.
On further debugging I referred to below blog post from Hortononworks link for best practice:
https://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html
I have added hbase configuration file to driver class path and ran the code and it worked perfectly fine.
Modified spark-submit can be viewed as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar:**/etc/hbase/conf** --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py host row1 f1 q1 value1

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Resources