Issue while storing data from Spark-Streaming to Cassandra - apache-spark

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error :
java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?
I have looked at fews examples here but it didn't help.
Here is the snippet of the code :
val ssc = new StreamingContext(sparkConf,30)
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
val lines = receiverStream.map(EventData.fromString(_))
lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition { it => for (tuple <- it) {
val cookieid = tuple.cookieid
val sessionid = tuple.sessionid
val logdate = tuple.logdate
val EventRows = ssc.cassandraTable("SparkTest", CassandraTable).select("*")
.where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')
Somelogic Whether row exist or not for Cookieid
} } }

The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons.
Else do this withing your function that gets passed around
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
....
session.executeAsync(<CQL Statement>)
and in the SparkConf you need to give the Cassandra details
val conf = new SparkConf()
.setAppName("test")
.set("spark.ui.enabled", "true")
.set("spark.executor.memory", "8g")
// .set("spark.executor.core", "4")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/ephemeral/spark-events")
//to avoid disk space issues - default is /tmp
.set("spark.local.dir", "/ephemeral/spark-scratch")
.set("spark.cleaner.ttl", "10000")
.set("spark.cassandra.connection.host", cassandraip)
.setMaster("spark://10.255.49.238:7077")
The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example
val rdd_inital_parse = rdd.mapPartitions(pLines).
def pLines(lines: Iterator[String]) = {
val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
lines.map(x => parseCSVLine(x, parser.parseLine))
}

Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

Related

Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:
df.coalesce(1).write.csv('Data1.csv',header = 'true')
After an hour of runtime I am getting the below error.
Error: Invalid status code from http://.....session isn't active.
My config is like:
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")
I have checked the container nodes and the error there is:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Not able to figure out the issue.
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
edit the /etc/livy/conf/livy.conf file (in the cluster's master
node)
set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
restart Livy to update the setting: sudo restart livy-server in the cluster's master
test your code again
I am not well versed in pyspark but in scala the solution would involve something like this
First we need to create a method for creating a header file:
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
You will also need a method for calling hadoop to merge your part files which would be written by df.write method:
def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use for loop to iterate over the muliple source paths if not use the code below
// for (sourcePath <- sourcePaths) {
//Get the path under destination where the partitioned files are temporarily stored
val pathText = sourcePaths.split("/")
val destPath = "%s/%s".format(destLocation, pathText.last)
//Merge files into 1
FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
// }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)
}
Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:
def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
spark: SparkSession): String = {
import spark.implicits._
val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
//write as csv to output directory and add file path to array to be sent for merging and create header file
processedDf.write.mode("overwrite").csv(fileName)
createHeaderFile(fileName, processedDf.columns)
//create an array of the partitioned file paths
outputFilePathList = fileName
// you can use array of string or string only depending on if the output needs to get divided in multiple file based on some parameter in that case chagne the signature ot Array[String] as output
// add below code
// outputFilePathList(counter) = fileName
// just use a loop in the above and increment it
//counter += 1
return outputFilePathList
}
With all the methods defined here is how you can implement them:
def processyourlogic( your parameters if any):Dataframe=
{
// your logic to do whatever needs to be done to your data
}
Assuming the above method returns a dataframe, here is how you can put everything together:
val yourbigD f = processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it
val outputPathFinal = " location where you want your file to be saved"
val tempOutputFolderLocation = "temp/"
val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
mergeOutputFiles(partFiles, outputPathFinal)
Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.

Filtering and selecting data from a DataFrame in Spark

I am working on a Spark-JDBC program
I came up with the following code so far:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
In the above code, will the statement:val gpTable = spark.read.format("jdbc").option("url", connectionUrl) dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data. I have this doubt as the table: bank_accounts is a very small table and it doesn't have an effect if it is loaded into memory as a dataframe as a whole. But in our production, there are tables with billions of records. In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Could anyone let me know the concept of Spark-Jdbc's entry point here ?
will the statement ... dump the whole data of the table: bank_accounts into the DataFrame: gpTable and then DataFrame: rc gets the filtered data.
No. DataFrameReader is not eager. It only defines data bindings.
Additionally, simple predicates, like trivial equality, checks are pushed to the source and only required columns should loaded when plan is executed.
In the database log you should see a query similar to
SELECT 1 FROM table WHERE source_system_name = 'ORACLE'
if it is loaded into memory as a dataframe as a whole.
No. Spark doesn't load data in memory unless it instructed to (primarily cache) and even then it limits itself to the blocks that fit into available storage memory.
During standard process it keep only the data that is required to compute the plan. For global plan memory footprint shouldn't depend on the amount of data.
In that case what is the recommended way to load data into a DataFrame using a JDBC connection ?
Please check Partitioning in spark while reading from RDBMS via JDBC, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?, https://stackoverflow.com/a/45028675/8371915 for questions related to scalability.
Additionally you can read Does spark predicate pushdown work with JDBC?

How does Spark work with a JDBC connection?

I am new to Spark and I am trying to work on a spark-jdbc program to count the number of rows in a database.
I have come up with this code:
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val log = LogManager.getLogger("Spark-JDBC Program")
Logger.getLogger("org").setLevel(Level.ERROR)
val conFile = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
val properties = new Properties()
properties.load(new FileInputStream(conFile))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName = properties.getProperty("devUserName")
val devPassword = properties.getProperty("devPassword")
val driverClass = properties.getProperty("gpDriverClass")
val tableName = "source.bank_accounts"
try {
Class.forName(driverClass).newInstance()
} catch {
case cnf: ClassNotFoundException =>
log.error("Driver class: " + driverClass + " not found")
System.exit(1)
case e: Exception =>
log.error("Exception: " + e.printStackTrace())
System.exit(1)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
println("gpTable Count: " + rc)
}
}
So far, this code is working. But I have 2 conceptual doubts about this.
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way.
If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable",tableName)
.option("user",devUserName)
.option("password",devPassword).load()
The current table used here has total rows of 2000. I can use the filter/select/aggregate functions accordingly.
But in our production there are tables with millions of rows and if I put one of the huge table in the above statement, even though our requirement has filtering it later, wouldn't is create a huge dataframe first?
Could anyone care to give me some insight regarding the doubts I mentioned above?
Pass an SQL query to it first known as pushdown to database.
E.g.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample where k = 1) e", connectionProperties)
You can substitute with s""" the k = 1 for hostvars, or, build your own SQL string and reuse as you suggest, but if you don't the world will still exist.

How to broadcast data from MySQL and use it in streaming batches?

// How do I get attributes from MYSQL DB during each streaming batch and broadcast it.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext (sc, Seconds(streamingBatchSizeinSeconds))
val eventDStream=getDataFromKafka(ssc)
val eventDtreamFiltered=eventFilter(eventDStream,eventType)
Whatever you do in getDataFromKafka and eventFilter I think you get a DStream to work with. That's how your future computations are described by and every batch interval you have a RDD to work with.
The answer to your question greatly depends on what exactly you want to do exactly, but let's assume that you're done with this stream processing of Kafka records and you want to do something with them.
If foreach were acceptable, you could do the following:
// I use Spark 2.x here
// Read attributes from MySQL
val myAttrs = spark.read.jdbc([mysql-url-here]).collect
// Broadcast the attributes so they're available on executors
val attrs = sc.broadcast(myAttrs) // do it once OR move it as part of foreach below
eventDtreamFiltered.foreach { rdd =>
// for each RDD reach out to attrs broadcast
val _attrs = attrs.get
// do something here with the rdd and _attrs
}
I tyle!

Spark Cassandra Connector: SQLContext.read + SQLContext.write vs. manual parsing and inserting (JSON -> Cassandra)

Good morning,
i just started investigating Apache Spark and Apache Cassandra. First step is a real simple use-case: taking a file containing e.g. customer + score.
Cassandra table has customer as PrimaryKey. Cassandra is just running locally (so no cluster at all!).
So the SparkJob (Standalone local[2]) is parsing the JSON file and then writing the whole stuff into Cassandra.
First solution was
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val cass = CassandraConnector(conf)
val customerScores = sc.textFile(file).cache()
val customerScoreRDD = customerScores.mapPartitions(lines => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
lines
.map(line => {
mapper.readValue(line, classOf[CustomerScore])
})
//Filter corrupt ones: empty values
.filter(customerScore => customerScore.customer != null && customerScore.score != null)
})
customerScoreRDD.foreachPartition(rows => cass.withSessionDo(session => {
val statement: PreparedStatement = session.prepare("INSERT INTO playground.customer_score (customer,score) VALUES (:customer,:score)")
rows.foreach(row => {
session.executeAsync(statement.bind(row.customer.asInstanceOf[Object], row.score))
})
}))
sc.stop()
means doing everything manually, parsing the lines and then inserting into Cassandra.
This roughly takes about 714020 ms in total for 10000000 records (incl. creating SparkContext and so on ...).
Then i read about the spark-cassandra-connector and did the following:
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
var sql = new SQLContext(sc)
val customerScores = sql.read.json(file)
val customerScoresCorrected = customerScores
//Filter corrupt ones: empty values
.filter("customer is not null and score is not null")
//Filter corrupt ones: invalid properties
.select("customer", "score")
customerScoresCorrected.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "playground", "table" -> "customer_score"))
.save()
sc.stop()
So much simpler in sense of needed code and using given API.
This solution roughly takes 1232871 ms for 10000000 records (again all in all, so same measure points).
(Had a third solution as well, parsing manually plus using saveToCassandra which takes 1530877 ms)
Now my question:
Which way is the "correct" way to fulfil this usecase, so which one is the "best practice" (and in a real scenario, clustered cassandra and spark, the most performing one) nowadays?
Cause from my results i would use the "manual" stuff instead of SQLContext.read + SQLContext.write.
Thanks for your comments and hints in advance.
Actually after playing around now a long time, following has to be considered.
Of course amount of data
Type of your data: especially variety of partition keys (each one different vs. lots of duplicates)
The environment: Spark Executors, Cassandra Nodes, Replication ...
For my UseCase playing around with
def initSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
// since we have nearly totally different PartitionKeys, default: 1000
.set("spark.cassandra.output.batch.grouping.buffer.size", "1")
// write as much concurrently, default: 5
.set("spark.cassandra.output.concurrent.writes", "1024")
// batch same replica, default: partition
.set("spark.cassandra.output.batch.grouping.key", "replica_set")
val sc = new SparkContext(conf)
sc
}
did boost speed dramatically in my local run.
So there is very much need to try out the various parameters to get YOUR best way. At least that is the conclusion i got.

Resources