Spark + EMRFS/S3 - Is there a way to read client side encrypted data and write it back using server side encryption? - apache-spark

I have a use-case in spark where I have to read data from a S3 that uses client-side encryption, process it and write it back using only server-side encryption. I'm wondering if there's a way to do this in spark?
Currently, I have these options set:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3.enableServerSideEncryption=true
spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=<kms id here>
But obviously, it's ending up using both CSE and SSE while writing the data. So, I'm wondering it it's possible to somehow only set spark.hadoop.fs.s3.cse.enabled to true while reading and then set it to false or maybe another alternative.
Thanks for the help.

Using programmatic configuration to define multiple S3 filesystems:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3sse.impl=foo.bar.S3SseFilesystem
and then add a custom implementation for s3sse:
package foo.bar
import java.net.URI
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.s3a.S3AFileSystem
class S3SseFilesystem extends S3AFileSystem {
override def initialize(name: URI, originalConf: Configuration): Unit = {
val conf = new Configuration()
// NOTE: no prefix spark.hadoop here
conf.set("fs.s3.enableServerSideEncryption", "true")
conf.set("fs.s3.serverSideEncryption.kms.keyId", "<kms id here>")
super.initialize(name, conf)
}
}
After this, the custom file system can be used with Spark read method
spark.read.json("s3sse://bucket/prefix")

Related

Running Custom Java Class in PySpark on EMR

I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,
def create_spark_session(master_dns, kind, jars):
# 8998 is the port on which the Livy server runs
host = 'http://' + master_dns + ':8998'
data = {'kind': kind, 'jars': jars}
headers = {'Content-Type': 'application/json'}
response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
logging.info(response.json())
return response.headers
Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)
The data transformation is attempting to import the jar and call the methods via:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()
The error I am receiving is
"py4j.protocol.Py4JError: An error occurred while calling
None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException:
Constructor com.cerner.bunsen.Bundles([]) does not exist"
This is the first time I have attempted to use java_import so any help would be appreciated.
EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Manage logging
#sc.setLogLevel("INFO")
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer
fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)
and the new error is:
'JavaPackage' object is not callable
Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.
If you want to use a Scala/Java function in Pyspark you have also to add the jar package in classpath. You can do it with 2 different ways:
Option1:
In Spark submit with the flag --jars
spark-submit example.py --jars /path/to/bunsen-shaded-1.4.7.jar
Option2: Add it in spark-defaults.conf file in property:
Add the following code in : path/to/spark/conf/spark-defaults.conf
# Comma-separated list of jars include on the driver and executor classpaths.
spark.jars /path/to/bunsen-shaded-1.4.7.jar

Hide a spark property from displaying in the spark web UI without implementing a security filter

The application web UI at http://:4040 lists Spark properties in the “Environment” tab. All values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. However, for security reasons, I do not want my Cassandra password to display in the web UI. Is there some sort of switch to ensure that certain spark properties are not displayed??
Please note, I see some solutions that suggest implementing a security filter and using spark.ui.filters setting to refer to the class. I am hoping to avoid this complexity.
I think there is no common solution how to hide your custom property from spark WebUI for previous releases.
I assume you are using spark 2.0 or below (i have not seen feature described below in 2.0) because 2.0.1 supports passwords preprocessing to "*****".
Check issue SPARK-16796 Visible passwords on Spark environment page
When we take a look into apache spark source code and do some investigation we can see some processing how to "hide" property in spark web ui.
SparkUI
by default the Environment page is attached within initialization attachTab(new EnvironmentTab(this)) [line 71]
EnvironmentPage renders properties to EnvironmentPage as tab in web gui as:
def render(request: HttpServletRequest): Seq[Node] = {
val runtimeInformationTable = UIUtils.listingTable(
propertyHeader, jvmRow, listener.jvmInformation, fixedWidth = true)
val sparkPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.sparkProperties.map(removePass), fixedWidth = true)
val systemPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.systemProperties, fixedWidth = true)
val classpathEntriesTable = UIUtils.listingTable(
classPathHeaders, classPathRow, listener.classpathEntries, fixedWidth = true)
val content =
<span>
<h4>Runtime Information</h4> {runtimeInformationTable}
<h4>Spark Properties</h4> {sparkPropertiesTable}
<h4>System Properties</h4> {systemPropertiesTable}
<h4>Classpath Entries</h4> {classpathEntriesTable}
</span>
UIUtils.headerSparkPage("Environment", content, parent)
}
all properties are rendered without some kind of hiding preprocessing except sparkProperties - with functionality provided in removePass.
private def removePass(kv: (String, String)): (String, String) = {
if (kv._1.toLowerCase.contains("password")) (kv._1, "******") else kv
}
as we can see every key that contains "password" (BTW: in the master branch they also filtering keys with keyword "secret" check if u are interested in)
I cannot tested now but u can try to update spark. so eg. SparkSubmitArguments.scala in mergeDefaultSparkProperties() will consider spark.cassandra.auth.password as spark and populate as sparkProperties (with removePass preprocessing).
And at the end of the day in EnvironmentTab in web gui this property should be visible as ****.

Query Spark SQL from Node.js server

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.
I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Resources