Spark not saving all data to redshift - apache-spark

The following code loads data from S3, cleans and removes duplicates using SparkSQL, and then saves data using JDBC to Redshift. I have also tried using spark-redshift maven dependency and getting same result. I am using Spark 2.0.
What I cannot understand is how when showing the result loaded in memory, the sum is the expected number, however when Spark saves to Redshift, it's always less. Somehow not all records are saved, and I do not see any errors in STL_LOAD_ERRORS either. Anybody encountered this or have any ideas to why this happens?
// Load files that were loaded into firehose on this day
var s3Files = spark.sqlContext.read.schema(schema).json("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "#" + job.getBucketName + "/"+ job.getAWSS3RawFileExpression + "/" + year+ "/" + monthCheck+ "/" + dayCheck + "/*/" ).rdd
// Apply the schema to the RDD, here we will have duplicates
val usersDataFrame = spark.createDataFrame(s3Files, schema)
usersDataFrame.createOrReplaceTempView("results")
// Clean and use partition by the keys to eliminate duplicates and get latest record
var results = spark.sql(buildCleaningQuery(job,"results"))
results.createOrReplaceTempView("filteredResults")
// This returns the correct result!
var check = spark.sql("select sum(Reward) from filteredResults where period=1706")
check.show()
var path = UUID.randomUUID().toString()
println("s3://" + job.getAWSAccessKey + ":" + job.getAWSSecretKey + "#" + job.getAWSS3TemporaryDirectory + "/" + path)
val prop = new Properties()
results.write.jdbc(job.getRedshiftJDBC,"work.\"" + path + "\"",prop)

Using jdbc means that Spark will be trying to do repeated INSERT INTO statements - which is massively slow in Redshift. That's why you're not seeing entries in stl_load_errors.
I'd suggest that you use the spark-redshift library instead. It's well tested and will perform better. https://github.com/databricks/spark-redshift
Example (showing many options):
my_dataframe.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://my_cluster.qwertyuiop.eu-west-1.redshift.amazonaws.com:5439/my_database?user=my_user&password=my_password")
.option("dbtable", "my_table")
.option("tempdir", "s3://my-bucket")
.option("diststyle", "KEY")
.option("distkey", "dist_key")
.option("sortkeyspec", "COMPOUND SORTKEY(key_1, key_2)")
.option("extracopyoptions", "TRUNCATECOLUMNS COMPUPDATE OFF STATUPDATE OFF")
.mode("overwrite") // "append" / "error"
.save()

Related

Spark structured streaming - update data frame's schema on the fly

I have a simple structured streaming job which monitors a directory for CSV files and writes parquet files - no transformation in between.
The job starts by building a data frame from reading CSV files using readStream(), with a schema which I get from calling a function called buildSchema(). Here is the code:
var df = spark
.readStream
.option("sep", "|")
.option("header","true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
logger.info(" new batch indicator")
if (df.schema != buildSchema(spark, table_name).get) {
df = spark.sqlContext.createDataFrame(df.collectAsList(), buildSchema(spark, table_name).get)
}
val query =
df.writeStream
.format("parquet")
.queryName("convertCSVtoPqrquet for table " + table_name)
.option("path", output_base_dir + table_name + "/")
.trigger(ProcessingTime(60.seconds))
.start()
The job runs fine, but my question is, I'd like to always use the latest schema to build my data frame, or in other words, to read from the CSV files. While buildSchema() can get me the latest schema, I'm not sure how to call it periodically (or once per CSV file), and then use the latest schema to somehow re-generate or modify the data frame.
When testing, my observation is that only the query object is running continuously batch after batch; the log statement that I put, and the if()statement for schema comparison, only happened once at the beginning of the application.
Can data frame schema in structured streaming job be modified after query.start() is called? What would you suggest as a good workaround if we cannot change the schema of a data frame?
Thanks in advance.
You can utilize foreachBatch method to load the latest schema periodically, then compare it to the concrete micro batch dataframe schema.
Example:
var streamingDF = spark
.readStream
.option("sep", "|")
.option("header", "true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
val query =
streamingDF
.writeStream
.foreachBatch((ds, i) => {
logger.info(s"New batch indicator(${i})")
val batchDf =
if (ds.schema != buildSchema(spark, table_name).get) {
spark.sqlContext.createDataFrame(ds.collectAsList(), buildSchema(spark, table_name).get)
} else {
ds
}
batchDf.write.parquet(output_base_dir + table_name + "/")
})
.trigger(ProcessingTime(60.seconds))
.start()

Connection to remote hbase through scala spark

I am trying to connect to remote Hbase through scala and spark. Unable to succeeded.
Can anyone suggest any methods related to this.
Thanks in advance.
we have two methods to connect to HBASE from spark/scala
HBASE Rest Api
phoenix.apache -- https://phoenix.apache.org/
Hbase Rest API code
val hbaseCluster = new org.apache.hadoop.hbase.rest.client.Cluster()
hbaseCluster.add("localhost or UP", <port>)
val restClient = new Client(hbaseCluster)
val table = new RemoteHTable(restClient, "STUDENT")
println("connected...")
var p = new Put(Bytes.toBytes("row1"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("NAME"),Bytes.toBytes("raju"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("COURSE"),Bytes.toBytes("SCALA"))
p.add(Bytes.toBytes("0"), Bytes.toBytes("YEAR"),Bytes.toBytes("2017"))
table.put(p)
val scan = new Scan()
val scanner : ResultScanner = table.getScanner(scan)
println("got scanner...")
val g = new Get(Bytes.toBytes("row1"))
val result = table.get(g)
val name = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("NAME")))
val course = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("COURSE")))
val year = Bytes.toString(result.getValue(Bytes.toBytes("0"),Bytes.toBytes("YEAR")))
println("row1 " + "name: " + name + " course: " + course + " year:" + year);
for (result <- scanner) {
var userId = Bytes.toString(result.getValue("NAME".getBytes(), "ID".getBytes()))
println("userId " + userId)
}
}
}
Apache Phoenix
Phoenix provides spark plugin and JDBC connection as well.
spark plugin - https://phoenix.apache.org/phoenix_spark.html
JDBC Connection (query server)- https://phoenix.apache.org/server.html
I came across similar problem last week. Eventually I made it using the HBase Spark connector. It is quite a bit setup/configuration. I've documented my steps in the link below
Setup Apache Zeppelin with Spark and HBase

Spark Job hang when writing out using DF

The application exe to sysDF.write.partitionBy , and write out first parquet files successfully. But after that, the application hangs with all executors killed, until some overtime occurred. The ACTION code is as below:
import sqlContext.implicits._
val systemRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[SystemLog]) basicLog.asInstanceOf[SystemLog] else null).filter(_ != null)
val sysDF = systemRDD.toDF()
sysDF.write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2)
val customRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[CustomLog]) basicLog.asInstanceOf[CustomLog] else null).filter(_ != null)
val customDF = customRDD.toDF()
customDF.write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2)
val illegalRDD = basicLogRDD.map(basicLog => if (basicLog.isInstanceOf[IllegalLog]) basicLog.asInstanceOf[IllegalLog] else null).filter(_ != null)
val illegalDF = illegalRDD.toDF()
illegalDF.write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2)
First, the map can be combined with the filter, which should optimise the query a bit:
val rdd = basicLogRDD.cache()
rdd.filter(_.isInstanceOf[SystemLog]).write.partitionBy("appId").parquet(outputPath + "/system/date=" + dateY4M2D2)
rdd.filter(_.isInstanceOf[CustomLog]).write.partitionBy("appId").parquet(outputPath + "/custom/date=" + dateY4M2D2)
rdd.filter(_.isInstanceOf[IllegalLog]).write.partitionBy("appId").parquet(outputPath + "/illegal/date=" + dateY4M2D2)
First, it is a good idea to cache the basicLogRDD as it is used multiple times. The .cache() operator will keep the RDD in memory.
Second, it is not needed to explicitly convert the RDD to a DataFrame, as it is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet, (you need to define import sqlContext.implicits._).

Issue while storing data from Spark-Streaming to Cassandra

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error :
java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?
I have looked at fews examples here but it didn't help.
Here is the snippet of the code :
val ssc = new StreamingContext(sparkConf,30)
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
val lines = receiverStream.map(EventData.fromString(_))
lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition { it => for (tuple <- it) {
val cookieid = tuple.cookieid
val sessionid = tuple.sessionid
val logdate = tuple.logdate
val EventRows = ssc.cassandraTable("SparkTest", CassandraTable).select("*")
.where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')
Somelogic Whether row exist or not for Cookieid
} } }
The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons.
Else do this withing your function that gets passed around
CassandraConnector(SparkWriter.conf).withSessionDo { session =>
....
session.executeAsync(<CQL Statement>)
and in the SparkConf you need to give the Cassandra details
val conf = new SparkConf()
.setAppName("test")
.set("spark.ui.enabled", "true")
.set("spark.executor.memory", "8g")
// .set("spark.executor.core", "4")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/ephemeral/spark-events")
//to avoid disk space issues - default is /tmp
.set("spark.local.dir", "/ephemeral/spark-scratch")
.set("spark.cleaner.ttl", "10000")
.set("spark.cassandra.connection.host", cassandraip)
.setMaster("spark://10.255.49.238:7077")
The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example
val rdd_inital_parse = rdd.mapPartitions(pLines).
def pLines(lines: Iterator[String]) = {
val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
lines.map(x => parseCSVLine(x, parser.parseLine))
}
Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

How to convert a Cassandra ResultSet to a Spark DataFrame?

I would normally load data from Cassandra into Apache Spark this way using Java:
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
"WHERE CAST(store_id as string) = '" + storeId + "'");
But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Is there a way to use several queries but load them into a single DataFrame?
One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");
DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");
DataFrame allCustomers = customersOne.unionAll(CustomersTwo)

Resources