I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.
Related
Shareplex CDC offers 3 JSON sub-structs per CDC record:
meta operation type, insert, del, ...
data actual changed data with column names
key the before image, thus all fields including those that changed in "data"
This is what data engineers state and the documentation seems to state this possibility only, as well.
My question is how can we get the complete after image of the record including both changed and non-changed data? May be it is simply not possible.
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
The above approach is unhandy with Spark schema's being computed in batch, or defining the complete schema in conjunction with NULL values issues, as far as I can see.
No, this is standardly not possible.
What you can do is the read the Kafka JSON, do as per below and set the after image on a new Kafka Topic and proceed:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val jsonS =
"""
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
""".stripMargin
val jsonNN = parse(jsonS)
val meta = jsonNN\"meta"
val data = jsonNN\"data"
val key = jsonNN\"key"
val Diff(changed, added, deleted) = key diff data
val afterImage = changed merge deleted
// Convert to JSON
println(pretty(render(afterImage)))
I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.
The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.
The code I use is like below:
class TestJob extends Serializable {
def test(ssc : StreamingContext) : Unit={
val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
.select(
"code",
"date",
"time",
"score"
).where("date<= ?",new Utils().getCurrentDate())
val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)
DStreamRdd.foreachRDD{r=>
//DO SOMETHING
}
}
}
object GetSSC extends Serializable {
def getSSC() : StreamingContext ={
val conf = new SparkConf()
.setMaster(Configurations.getInstance().sparkHost)
.setAppName(Configurations.getInstance().appName)
.set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
.set("spark.cleaner.ttl", "3600")
.set("spark.default.parallelism","3")
.set("spark.ui.port","5050")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
#transient lazy val ssc = new StreamingContext(sc,Seconds(30))
ssc
}
}
object Main {
val logger : Log = LogFactory.getLog(Main.getClass)
def main(args : Array[String]) : Unit={
val ssc = GetSSC.getSSC()
try{
new TestJob().test(ssc)
ssc.start()
ssc.awaitTermination()
}catch {
case e : Exception =>
logger.error(Main.getClass.getSimpleName+"error :
"+e.printStackTrace())
}
}
}
Table used in this Demo like:
CREATE TABLE test.test_table (
code text PRIMARY KEY, //UUID
date text, // '20190520'
time text, // '12:00:00'
score int); // 90
Any help is appreciated!
In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.
I'm using a LongAccumulator to count the number of record which I save in Cassandra.
object Main extends App {
val conf = args(0)
val ssc = StreamingContext.getStreamingContext(conf)
Runner.apply(conf).startJob(ssc)
StreamingContext.startStreamingContext(ssc)
StreamingContext.stopStreamingContext(ssc)
}
class Runner (conf: Conf) {
override def startJob(ssc: StreamingContext): Unit = {
accTotal = ssc.sparkContext.longAccumulator("total")
val inputKafka = createDirectStream(ssc, kafkaParams, topicsSet)
val rddAvro = inputKafka.map{x => x.value()}
saveToCassandra(rddAvro)
println("XXX:" + accTotal.value) //-->0
}
def saveToCassandra(upserts: DStream[Data]) = {
val rddCassandraUpsert = upserts.map {
record =>
accTotal.add(1)
println("ACC: " + accTotal.value) --> 1,2,3,4.. OK. Spark Web UI, ok too.
DataExt(record.data,
record.data1)}
rddCassandraUpsert.saveToCassandra(keyspace, table)
}
}
I see that the code is executed right and I save data in Cassandra, when I finally print the accumulator the value is 0, but if I print it in the map fuction I can see the right values. Why?
I'm using Spark 2.0.2 and executing from Intellj in local mode. I have checked the spark web UI and I can see the accumulador updated.
The problem is probably here:
object Main extends App {
...
Spark doesn't support applications extending App, doing so, can result in non-deterministic behaviors:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
You should always use standard applications with main:
object Main {
def main(args: Array[String]) {
...
I want to read whole text files in non UTF-8 encoding via
val df = spark.sparkContext.wholeTextFiles(path, 12).toDF
into spark. How can I change the encoding?
I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.
edit
maybe a custom Hadoop file input format should be used?
https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/
You can read the files using SparkContext.binaryFiles() instead and build the String for the contents specifying the charset you need. E.g:
val df = spark.sparkContext.binaryFiles(path, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
It's Simple.
Here is the source code,
import java.nio.charset.Charset
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object TextFile {
val DEFAULT_CHARSET = Charset.forName("UTF-8")
def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
if (Charset.forName(charset) == DEFAULT_CHARSET) {
context.textFile(location)
} else {
// can't pass a Charset object here cause its not serializable
// TODO: maybe use mapPartitions instead?
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
}
}
}
From here it's copied.
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
To Use it.
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
Edit:
If you need wholetext file,
Here is the actual source of the implementation.
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
assertNotStopped()
val job = NewHadoopJob.getInstance(hadoopConfiguration)
// Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updateConf = job.getConfiguration
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
classOf[Text],
classOf[Text],
updateConf,
minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
}
Try changing :
.map(record => (record._1.toString, record._2.toString))
to(probably):
.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))
I am using spark to read and analyse a data file, file contains data like following:
1,unit1,category1_1,100
2,unit1,category1_2,150
3,unit2,category2_1,200
4,unit3,category3_1,200
5,unit3,category3_2,300
The file contains around 20 million records. If user input unit or category, spark need filter the data by inputUnit or inputCategory.
Solution 1:
sc.textFile(file).map(line => {
val Array(id,unit,category,amount) = line.split(",")
if ( (StringUtils.isNotBlank(inputUnit) && unit != inputUnit ) ||
(StringUtils.isNotBlank(inputCategory) && category != inputCategory)){
null
} else {
val obj = new MyObj(id,unit,category,amount)
(id,obj)
}
}).filter(_!=null).collectAsMap()
Solution 2:
var rdd = sc.textFile(file).map(line => {
val (id,unit,category,amount) = line.split(",")
(id,unit,category,amount)
})
if (StringUtils.isNotBlank(inputUnit)) {
rdd = rdd.filter(_._2 == inputUnit)
}
if (StringUtils.isNotBlank(inputCategory)) {
rdd = rdd.filter(_._3 == inputCategory)
}
rdd.map(e => {
val obj = new MyObject(e._1, e._2, e._3, e._4)
(e._1, obj)
}).collectAsMap
I want to understand, which solution is better, or both of them are poor? If both are poor, how to make a good one? Personally, I think second one is better, but I am not quite sure whether it is nice to declare a rdd as var... (I am new to Spark, and I am using Spark 1.5.0 and Scala 2.10.4 to write the code, this is my first time asking a question in StackOverFlow, feel free to edit if it is not well formatted) Thanks.