Reuse a cached spark dataset loaded in previous AWS EMR step - apache-spark

I am using AWS EMR and Spark for processing data on S3. My use-case is to access the same data in a new EMR step. Can this be achieved using dataset.persist()?
These are the set of steps:
EMR Step start
dataset = sqlContext.read().textFile("s3a://path/to/folder")
dataset.persist()
EMR Step complete
New EMR Step started
newDataset = sqlContext.read().textFile("s3a://path/to/folder")
In such a case, will Spark read all the data from the S3 path again or will it use the data available in-memory because of the call to persist()?
If not, is there another way of accessing the cached data?

Related

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

Connect to Databricks managed Hive from outside

I have:
An existing Databricks cluster
Azure blob store (wasb) mounted to HDFS
A Database with its LOCATION set to a path on wasb (via mount path)
A Delta table (Which ultimately writes Delta-formatted parquet files to blob store path)
A kubernetes cluster
Reads and writes data in parquet and/or Delta format within the same Azure blob store that Databricks uses (writing as delta format via spark-submit pyspark)
What I want to do:
Utilize the managed Hive metastore in Databricks to act as data catalog for all data within Azure blob store
To this end, I'd like to connect to the metastore from my outside pyspark job such that I can use consistent code to have a catalog that accurately represents my data.
In other words, if I were to prep my db from within Databricks:
dbutils.fs.mount(
source = "wasbs://container#storage.blob.core.windows.net",
mount_point = "/mnt/db",
extra_configs = {..})
spark.sql('CREATE DATABASE db LOCATION "/mnt/db"')
Then from my Kubernetes pyspark cluster, I'd like to execute
df.write.mode('overwrite').format("delta").saveAsTable("db.table_name")
Which should write the data to wasbs://container#storage.blob.core.windows.net/db/table_name as well as register this table with Hive (and thus be able to query it with HiveQL)
How to I connect to the Databricks managed Hive from a pyspark session outside of Databricks environment?
This doesn't answer my question (I don't think it's possible), but it mostly solves my problem: Writing a crawler to create tables from delta files.
Mount Blob container and create a DB as in question
Write a file in delta format from anywhere:
df.write.mode('overwrite').format("delta").save("/mnt/db/table") # equivilantly, save to wasb:..../db/table
Create a Notebook, schedule it as a job to run regularly
import os
def find_delta_dirs(ls_path):
for dir_path in dbutils.fs.ls(ls_path):
if dir_path.isFile():
pass
elif dir_path.isDir() and ls_path != dir_path.path:
if dir_path.path.endswith("_delta_log/"):
yield os.path.dirname(os.path.dirname(dir_path.path))
yield from find_delta_dirs(dir_path.path)
def fmt_name(full_blob_path, mount_path):
relative_path = full_blob_path.split(mount_path)[-1].strip("/")
return relative_path.replace("/", "_")
db_mount_path = f"/mnt/db"
for path in find_delta_dirs(db_mount_path):
spark.sql(f"CREATE TABLE IF NOT EXISTS {db_name}.{fmt_name(path, db_mount_path)} USING DELTA LOCATION '{path}'")

fs.s3 configuration with two s3 account with EMR

I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B.
I created EMR in account B and has access to s3 in account B.
I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token.
METHOD1
I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.
I have pyspark code which reads from s3 (A) and write to parquet s3 (B) I submit job 100 of jobs at time.This pyspark code runs in EMR.
Reading using following setting
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
spark_df_csv = spark_session.read.option("Header", "True").csv("s3://somepath")
Writing:
I am using s3a protocol s3a://some_bucket/
It works but sometimes i see
_temporary folder present in s3 bucket and not all csv converted to parquet
When i enable EMR concurrency to 256 (EMR-5.28) and submit 100 jobs it this i get _temporary rename error.
Issues:
This method creates temporary folder and sometimes it doesn't deletes it.I can see _temporary folder in s3 bucket.
when i enable EMR concurrency (EMR latest versin5.28) it allows to run steps in parallel, i get rename _temporary error for some of the files.
METHOD2:
I feel s3a is not good for parallel job.
So i want to read and write using fs.s3 as it has better file commiters.
So i did this initially i set hadoop configuration as above to account A and then unset the configuration, so that it can access default account B eventually s3 bucket.
In this way
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")
spark_df_csv.repartition(1).write.partitionBy(['org_id', 'institution_id']). \
mode('append').parquet(write_path)
Issues:
This works but the issue is let say if i trigger lambda which in turn submit job for 100 files (in loop) some 10 odd files result in access denied while writing file to s3 bucket.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n ... 1 more\nCaused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:
This could be because of either this unset is not working sometimes or
because of parallel run Spark context/session set unset happening in paralleling? I mean spark context for one job is unsettling the hadoop configuration and other is setting it, which may cause this issue, though not sure how spark context works in parallel.
Isn't each job has separate Spark context and session.
Please suggest alternatives for my situation.

What does "avoid multiple Kudu clients per cluster" mean?

I am looking at kudu's documentation.
Below is a partial description of kudu-spark.
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.
Does this mean that I can only run one kudu-spark task at a time?
If I have a spark-streaming program that is always writing data to the kudu,
How can I connect to kudu with other spark programs?
In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
See http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
See https://kudu.apache.org/docs/developing.html
The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

Redis and Pyspark configuration

We have one EC2 test VM with spark master and 3 spark workers, what configs need to be done for Redis to work with PySpark? Thanks.
1) Make a zip file of Redis Module
2) Use PySpark's addPyFile as below
sc.addPyFile("/path/to/redis.zip")
Reference : Write data to Redis from PySpark

Resources