Not able to read jceks file in yarn cluster mode in python - apache-spark

I am using a jceks file to decrypt my password and unable to read the encrypted password in yarn cluster mode
I have tried different methods like included
spark-submit --deploy-mode cluster
--file /localpath/credentials.jceks#credentials.jceks
--conf spark.hadoop.hadoop.security.credential.provider.path=jceks://file////localpath/credentials.jceks test.py
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
x.set("hadoop.security.credential.provider.path", "jceks://file///credentials.jceks")
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
I am getting the below error:
attributeError: 'NoneType' object has no attribute 'len'
and when I am printing a ,it has None

FWIW, if you try putting your jceks file to hdfs, the yarn workers would be able to find it while running on cluster mode, at least it works for me. Hope it works for you.
hadoop fs -put ~/.jceks /user/<uid>/.jceks
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
jceks_hdfs_path = "jceks://hdfs#<host>/user/<uid>/.jceks"
x.set("hadoop.security.credential.provider.path", jceks_hdfs_path)
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
With that you won't need to specify --files and --conf in the arguments when you run spark-submit. Hope it helps.

You can refer to jceks like this
#Choosing jceks from spark staging directory
jceks_location=“jceks://“ + str(“/user/your_user_name/.sparkStaging/“ + str(spark.sparkContest.applicationId) + “/credentials.jceks“)
x.set("hadoop.security.credential.provider.path",jceks_location)

Related

AWS Glue Passing Parameters via Boto3 causing exception

For the life of me I cannot figure out what is going on here.
I am starting a Glue Job via Boto3 (from Lambda but testing locally gives the exact same issue) and when I pass parameters in via the "start job run" api I get the same error, but looking at the logs the parameters all look correct. Here is the output (I have changed some names of the buckets etc.)
Glue Code (sample):
def main():
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
's3_bucket',
's3_temp_prefix',
's3_schema_prefix',
's3_processed_prefix',
'ingestion_run_id'
]
)
sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
s3_client = boto3.client('s3')
s3_bucket = args['s3_bucket']
temp_prefix = args['s3_temp_prefix']
schema_prefix = args['s3_schema_prefix']
processed_prefix = args['s3_processed_prefix']
ingestion_run_id = args['ingestion_run_id']
logger.info(f's3_bucket: {s3_bucket}')
logger.info(f'temp_prefix {temp_prefix}')
logger.info(f'schema_prefix: {schema_prefix}')
logger.info(f'processed_prefix: {processed_prefix}')
logger.info(f'ingestion_run_id: {ingestion_run_id}')
SAM Template to make the Glue Job:
CreateDataset:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
PythonVersion: 3
ScriptLocation: !Sub "s3://bucket-name/GLUE/create_dataset.py"
DefaultArguments:
"--extra-py-files": "s3://bucket-name/GLUE/S3GetKeys.py"
"--enable-continuous-cloudwatch-log": ""
"--enable-metrics": ""
GlueVersion: 2.0
MaxRetries: 0
Role: !GetAtt GlueRole.Arn
Timeout: 360
WorkerType: Standard
NumberOfWorkers: 15
Code to attempt to start the Glue Job:
import boto3
session = boto3.session.Session(profile_name='glue_admin', region_name=region)
client = session.client('glue')
name = 'CreateDataset-1uPuNfIw1Tjd'
args = {
"--s3_bucket": 'bucket-name',
"--s3_temp_prefix": 'TEMP',
"--s3_schema_prefix": 'SCHEMA',
"--s3_processed_prefix": 'PROCESSED',
"--ingestion_run_id": 'FakeRun'
}
client.start_job_run(JobName=name, Arguments=args)
This starts the job fine put then the script errors and this is the log left behind, from what I can see it seems the parameters are lined up fine?
Wed Feb 10 09:16:00 UTC 2021/usr/bin/java -cp /opt/amazon/conf:/opt/amazon/lib/hadoop-lzo/*:/opt/amazon/lib/emrfs-lib/*:/opt/amazon/spark/jars/*:/opt/amazon/superjar/*:/opt/amazon/lib/*:/opt/amazon/Scala2.11/* com.amazonaws.services.glue.PrepareLaunch --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=29 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --conf spark.driver.memory=5g --JOB_ID j_76c49a0d580594d5c0f584458cc0c9d519 --enable-metrics --extra-py-files s3://bucket-name/GLUE/S3GetKeys.py --JOB_RUN_ID jr_c0b9049abf1ee1161de189a901dd4be05694c1c42863 --s3_schema_prefix SCHEMA --enable-continuous-cloudwatch-log --s3_bucket bucket-name --scriptLocation s3://bucket-name/GLUE/create_dataset.py --s3_temp_prefix TEMP --ingestion_run_id FakeRun --s3_processed_prefix PROCESSED --JOB_NAME CreateDataset-1uPuNfIw1Tjd
Bucket name has been altered for this post but it matches exactly.
Fail point in Glue JOb log:
java.lang.IllegalArgumentException: For input string: "--s3_bucket"
The bucket name has no illegal chars but does have '-' in it?
Thanks in advance for help.
This happened because --enable-continuous-cloudwatch-log argument expects a value and since you didn't provide a value, the argument parser assumed the next argument is the value for it(--enable-continuous-cloudwatch-log --s3_bucket), which in this case was --s3_bucket, now --s3_bucket is an invalid value for --enable-continuous-cloudwatch-log option, therefore that error happens.

Error initializing SparkContext when using SPARK-SHELL in spark standalone

I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2
Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up

Unable to use a local file using spark-submit

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?
A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

Unable to access HBase Via API

I have hbase installed over three nodes. I am trying to load hbase via spark with the help of below code.
from __future__ import print_function
import sys
from pyspark import SparkContext
import json
if __name__ == "__main__":
print ("*******************************")
sc = SparkContext(appName="HBaseOutputFormat")
host = sys.argv[1]
table = "hbase_test"
port = "2181"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"hbase.zookeeper.property.clientPort":port,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
rdd = sc.parallelize([sys.argv[2:]]).map(lambda x: (x[0], x))
print (rdd.collect())
rdd.saveAsNewAPIHadoopDataset(
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
sc.stop()
I am executing code as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py HBASE_MASTER_ip row1 f1 q1 value1
But the job stucks and doesn't proceed. Below is the snapshot:
As per some previous threads I tried changing /etc/hosts to comment localhost line but it didn't worked.
Requesting your help.
On further debugging I referred to below blog post from Hortononworks link for best practice:
https://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html
I have added hbase configuration file to driver class path and ran the code and it worked perfectly fine.
Modified spark-submit can be viewed as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar:**/etc/hbase/conf** --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py host row1 f1 q1 value1

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Resources