Redis and Pyspark configuration - apache-spark

We have one EC2 test VM with spark master and 3 spark workers, what configs need to be done for Redis to work with PySpark? Thanks.

1) Make a zip file of Redis Module
2) Use PySpark's addPyFile as below
sc.addPyFile("/path/to/redis.zip")
Reference : Write data to Redis from PySpark

Related

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

How can we add MySQL details as property in PySpark?

While creating a SparkSession, as there is a property to connect to Cassandra called
.config("spark.cassandra.connection.host", "ip-address")
that can be directly added while creating a SparkSession, can we add the MySQL details similar to this so that we can avoid passing them in every Spark function?
No, there is no such option when connecting to MySQL. Cassandra has its own spark-cassandra-connector while for MySQL it uses JDBC which requires the connection params to be passed as Java Properties.
They differ in configuration options and in how they works.

Reuse a cached spark dataset loaded in previous AWS EMR step

I am using AWS EMR and Spark for processing data on S3. My use-case is to access the same data in a new EMR step. Can this be achieved using dataset.persist()?
These are the set of steps:
EMR Step start
dataset = sqlContext.read().textFile("s3a://path/to/folder")
dataset.persist()
EMR Step complete
New EMR Step started
newDataset = sqlContext.read().textFile("s3a://path/to/folder")
In such a case, will Spark read all the data from the S3 path again or will it use the data available in-memory because of the call to persist()?
If not, is there another way of accessing the cached data?

Apache Spark: resulting file being created at worker node instead of master node

I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.

How to process DynamoDB Stream in a Spark streaming application

I would like to consume a DynamoDB Stream from a Spark Streaming application.
Spark streaming uses KCL to read from Kinesis. There is a lib to make KCL able to read from a DynamoDB Stream: dynamodb-streams-kinesis-adapter.
But is it possible to plug this lib into spark? Anyone done this?
I'm using Spark 2.1.0.
My backup plan is to have another app reading from DynamoDB stream into a Kinesis stream.
Thanks
The way to do this it to implement the KinesisInputDStream to use the worker provided by dynamodb-streams-kinesis-adapter
The official guidelines suggest something like this:
final Worker worker = StreamsWorkerFactory
.createDynamoDbStreamsWorker(
recordProcessorFactory,
workerConfig,
adapterClient,
amazonDynamoDB,
amazonCloudWatchClient);
From the Spark's perspective, it is implemented under the kinesis-asl module in KinesisInputDStream.scala
I have tried this for Spark 2.4.0. Here is my repo. It needs little refining but gets the work done
https://github.com/ravi72munde/spark-dynamo-stream-asl
After modifying the KinesisInputDStream, we can use it as shown below.
val stream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName("sample-tablename-2")
.regionName("us-east-1")
.initialPosition(new Latest())
.checkpointAppName("sample-app")
.checkpointInterval(Milliseconds(100))
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()

Resources