How to assume a AWS role in pyspark - apache-spark

I am currently using spark 3.1, and I am using
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id)
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", config.access_key)
to run read data from s3.
I have 2 questions here:
i) Which is the best way to run a pyspark job, AWS access_id and key or assume role and run it and why?
ii) Can someone please give an example of using role in pyspark job?

Related

How to get cluster details like clusterID from code running on hdinsight cluster

I need ClusterInner object using Azure API or some cluster information like cluster id etc.
But to get ClusterInner object or cluster ID I need to provide the authentication object to API, but this code will be running on same HDInsight cluster so ideally it worn't ask for credential or use some env etc (My spark job already running on this cluster and spark job need this information).
Is there any API or alternative there to get this information from same running HDInsight cluster.
Editing my answer as per the comment
This particular method is not a clean way but you can get the details.
Note: This is applicable only for HDInsight clusters.
The deployment details can be extracted only from the head nodes. You have to look for the field in /etc/ambari-server/conf/ambari.properties
server.jdbc.database_name.For example if it is v40e8b2c1e26279460ca3e8c0cbc75af8f8AmbariDb then you can trim out first 3 characters and last 8 characters of the String.The left out string will be your clusterid.
You can use Linux script within your job to extract the details from the file.
Below is the shell command
#!/bin/bash
string=$(sed -n 's/server.jdbc.database_name=//p' /etc/ambari-server/conf/ambari.properties)
POS=3
LEN=32
clusterid=${string:$POS:$LEN}
You can embed the script in Python/Java.I am using Python to achieve this
import os
import subprocess
subprocess.call(['sh', '/path/to/script.sh'])

Suggestions for using Spark in AWS without any cluster?

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.
Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Airflow Livy jobs subission on AWS EMR

We are using Airflow to schedule our jobs on EMR and currently we want to use apache Livy to submit Spark jobs via Airflow
I need more guidance on below :
Which Airflow-Livy operator we should use for python 3+ pyspark and scala jobs.
I have seen below :
https://github.com/rssanders3/airflow-spark-operator-plugin
and
https://github.com/panovvv/airflow-livy-operators
Wants to know more about stable AirflowLivy operator anyone using in production probably in AWS stack.
Also Step by step installation guide for integration.
I would recommend using LivyOperator from https://github.com/apache/airflow/blob/master/airflow/providers/apache/livy/operators/livy.py
Currently, it is only available in Master but you could copy-paste the code and use that as a Custom Operator till we backport all the new operators for Airflow 1.10.* series

EMR pyspark trackable logging architecture

I am in middle of building a pyspark application that fails alot and has lot of jobs with lot of steps, so it is not possible to search with cluster id and step id. the current format in which spark on emr save is below
S3/buckt-name/logs/sparksteps/j-{clusterid}/steps/s-{stepid}/stderr.gz
I want something traceable in place of {clusterid} and {stepid} such that clustername+datetime and step-name
I saw log4j.properties and it has something named datepattern, but it is not saving anything with datetime
You could index the logs into an ELK cluster (managed or not) using filebeats.
Or send the logs to cloudwatch logs using a bootstrap script on the EMR or a Lambda. You can then customize the log group and log stream names to your needs.

Can Spark access DynamoDb without EMR

I have a set of AWS Instances where Apache Hadoop distribution along with apache spark is setup
I am trying to access DynamoDb through Spark streaming for reading and writing to the table But
During writing the Spark- DynamoDB code, I got to know emr-ddb-hadoop.jar is required to get DynamoDB Input Format and OutputFormat which is present in EMR Cluster only.
After checking few blogs it seems that it is accessible only with EMR Spark.
Is It correct?
However I use standalone JAVA SDK to access Dynamodb which worked fine
I got the solution of the problem.
I downloaded the emr-ddb-hadoop.jar file from EMR and using it in my environment.
Please note: To run the DynamoDB, we only need above jar only.

Resources