Dataframe each row iteration save to cassandra

Dataframe each row iteration save to cassandra - cassandra

I have following code :-
def writeToCassandra(cassandraConnector: CassandraConnector) = new ForeachWriter[Row] {
override def process(row: Row): Unit = {
println("row is " + row.toString())}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean =
true
}
val conf = new SparkConf()
.setAppName("Data")
.set("spark.cassandra.connection.host", "192.168.0.40,192.168.0.106,192.168.0.113")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "9")
.set("spark.executor.cores", "1")
.set("spark.cores.max", "9")
.set("spark.driver.cores", "3")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.speculation", "true")
println("Spark Configuration Done")
val spark = SparkSession
.builder
.appName("Data")
.config(conf)
.master("local[2]")
.getOrCreate()
println("Spark Session Config Done")
val cassandraConnector = CassandraConnector(conf)
import spark.implicits._
import org.apache.spark.sql.streaming.OutputMode
val dataStream =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.78:9092,192.168.0.78:9093,192.168.0.78:9094")
.option("subscribe", "historyfleet")
.load()
val query =
dataStream
.writeStream
.outputMode(OutputMode.Append())
.foreach(writeToCassandra(cassandraConnector))
.format("console")
.start()
query.awaitTermination()
query.stop()
It gives runtime error as :-
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/streaming/Source$class
at org.apache.spark.sql.kafka010.KafkaSource.<init>(KafkaSource.scala:80)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createSource(KafkaSourceProvider.scala:94)
at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:240)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$1.applyOrElse(StreamingQueryManager.scala:245)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$1.applyOrElse(StreamingQueryManager.scala:241)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:287)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.streaming.Source$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
My application is taking some time to insert DataFrame into Cassandra so I'am trying to check whether single iteration will speed my performance but it is giving above error.
Using 3 node cluster - 12 executors with 1 core each. it is giving 6000 insert per second in cassandra. need to optimise this.
Any suggestions please. Thanks,

Related

Spark structured streaming sinks to output is delayed

The below spark structured streaming code collects data from Kafka at every 10 seconds:
window($"timestamp", "10 seconds")
I was expecting the results to be printed on the console every 10 seconds. But, I notice the sink to the console is happening at every ~2 mins or above.
May I know what am I doing wrong?
def streaming(): Unit = {
System.setProperty("hadoop.home.dir", "/Documents/ ")
val conf: SparkConf = new SparkConf().setAppName("Histogram").setMaster("local[8]")
conf.set("spark.eventLog.enabled", "false");
val sc: SparkContext = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
val spark = SparkSession.builder().config(conf).getOrCreate()
import sqlcontext.implicits._
import org.apache.spark.sql.functions.window
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "wonderful")
.option("startingOffsets", "latest")
.load()
import scala.concurrent.duration._
val personJsonDf = inputDf.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.withWatermark("timestamp", "500 milliseconds")
.groupBy(
window($"timestamp", "10 seconds")).count()
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
consoleOutput.awaitTermination()
}
object SparkExecutor {
val spE: SparkExecutor = new SparkExecutor();
def main(args: Array[String]): Unit = {
println("test")
spE.streaming
}
}

I think that you might be missing the trigger definition for querying personJsonDf during the writeStreamoperation. The 2 minute period might be a default one (not sure).
The groupBy window that you have defined, will be used in the query but it does not define its periodicity.
One way to configure this could be:
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
Finally, the class Trigger contains some useful methods you wanna check out.
Hope it helps.

how can i overcome the file not foundexception

I am rying to read multiple excel files which under one directory, but i am encountered an error java.io.FileNotFoundException: File path/** does not exist
object example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Excel to
DataFrame").master("local[2]").getOrCreate()
val path = "C:\\excel\\files"
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", "true")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema","true")
.option("addColorColumns", "true")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.load("path")

Try this:
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
If you want to read a particular sheet:
.option("sheetName", "Sheet2")
EDIT: To read multiple excel files into one dataframe. (provided the columns in the excel file are consistent)
For this I have used spark-excel package. It can be added to build.sbt file as:
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
The code is as follows:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File
val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().getOrCreate()
// Function to read xlsx file using spark-excel.
// This code format with "trailing dots" can be sent to Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
format("com.crealytics.spark.excel").
option("location", file).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
load()
val dir = new File("path to your excel file")
val excelFiles = dir.listFiles.sorted.map(f => f.toString) // Array[String]
val dfs = excelFiles.map(f => readExcel(f)) // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_)) // DataFrame
ppdf.count()
ppdf.show(5)
Hope this helps. Good luck.

StreamingQueryException: Text data source supports only a single column

I know this question has already been asked before multiple times but none of the answers help in my case.
Below is my spark code
class ParseLogs extends java.io.Serializable {
def formLogLine(logLine: String): (String,String,String,Int,String,String,String,Int,Float,String,String,Flo at,Int,String,Int,Float,String)={
//some logic
//return value
(recordKey._2.toString().replace("\"", ""),recordKey._3,recordKey._4,recordKey._5,recordKey._6,recordKey._8,sbcId,recordKey._10,recordKey._11,recordKey._12,recordKey._13.trim(),LogTransferTime,contentAccessed,OTT,dataTypeId,recordKey._14,logCaptureTime1)
}
}
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val myDf = inputDf.selectExpr("CAST(value AS STRING)")
val df1 = myDf.map(line => new ParseLogs().formLogLine(line.get(0).toString()))
I get below error
User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 17 columns.;

Use UDF to convert logLine to what you want.For example:
spark.sqlContext.udf.register("YOURLOGIC", (logLine: String) => {
//some logic
(recordKey._2.toString().replace("\"",""),recordKey._3,recordKey._4,recordKey._5,recordKey._6,recordKey._8,sbcId,recordKey._10,recordKey._11,recordKey._12,recordKey._13.trim(),LogTransferTime,contentAccessed,OTT,dataTypeId,recordKey._14,logCaptureTime1)
})
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val myDf = inputDf.selectExpr("CAST(value AS STRING)")
val df1 = myDf.selectExpr("YOURLOGIC(value) as result")
val result = df1.select(
df1("result").getItem(0),
df1("result").getItem(1),
df1("result").getItem(2)),
df1("result").getItem(3)),
...if you have 17 item,then add to 17
df1("result").getItem(17))

GraphFrames Connected Components Performance

When I attempt to generate the connected components using graphframes it is taking substantially longer than I expected. I am running on spark 2.1, graphframes 0.5 and AWS EMR with 3 r4.xlarge instances. When the generating the connected components for a graph of about 12 million edges it is taking around 3 hours.
The code is below. I am fairly new to spark so any suggestions would be awesome.
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setMaster("yarn-cluster")
.setAppName("Connected Component")
val sc = new SparkContext(sparkConf)
sc.setCheckpointDir("s3a://......")
AWSUtils.setS3Credentials(sc.hadoopConfiguration)
implicit val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._
val historical = sqlContext
.read
.option("mergeSchema", "false")
.parquet("s3a://.....")
.map(x => (x(0).toString, x(2).toString, x(1).toString, x(3).toString, x(4).toString.toLong, x(5).toString.toLong))
// Complete graph
val g = GraphFrame(
historical.flatMap(e => List((e._1, e._3, e._5), (e._2, e._4, e._5))).toDF("id", "type", "timestamp"),
historical.toDF("src", "dst", "srcType", "dstType", "timestamp", "companyId")
)
val connectedComponents: DataFrame = g.connectedComponents.run()
connectedComponents.toDF().show(100, false)
sc.stop()
}

Join files in Apache Spark

I have a file like this. code_count.csv
code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004
Another file like this. details.csv
code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu
I want the total sum for each code but in the final output, I want the exp_code. Like this
Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4
Here is my code
var countData=sc.textFile("C:\path\to\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)
gives
Array[(String, Int)] = Array((AE,5), (BX,9))
Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.

You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
.toDF("code","count","year")
val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")
val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))
val finalDF = sumdf1.join(df2, "code").drop("code")
finalDF.show()

If you are using spark version > 2.0 you can use following code directly.
com.databricks.spark.csv is available by default as part of spark 2.0
val codeDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/code_count.csv")
val detailsDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/details.csv")
//
//
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))
output:
If you are using spark <=1.6 version. you can use following code.
you can follow this link to use com.databricks.spark.csv
https://github.com/databricks/spark-csv
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
import hiveContext.implicits._
val codeDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/code_count.csv")
val detailsDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/details.csv")
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Dataframe each row iteration save to cassandra - cassandra

Related

Spark structured streaming sinks to output is delayed

how can i overcome the file not foundexception

StreamingQueryException: Text data source supports only a single column

GraphFrames Connected Components Performance

Join files in Apache Spark

Categories

Resources