Get file name using hadoopFile - apache-spark

I'm using Spark 2.2 along with Scala 2.11 to parse a directory and transform data inside.
To handle ISO charset, I'm using hadoopFile like this :
val inputDirPath = "myDirectory"
sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath).map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, "iso-8859-1")).map(ProcessFunction(_)).toDF
How can I get the file name of each row into the ProcessFunction ?
ProcessFunction takes a String in param and return an object.
Thank you for your time

Answers includ your function ProcessFunction
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.HadoopRDD
val inputDirPath = "dataset.txt"
val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = rddHadoop.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, new String(tuple._2.getBytes, 0, tuple._2.getLength, "iso-8859-1")))
}).map{case (path, line) => (path, ProcessFunction(line)}

val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = textRdd.asInstanceOf[HadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, tuple._2))
}
)
linesWithFileNames.foreach(println)

Related

How to convert java Resultset into Spark dataframe

I am trying to use preparestatement with JDBC. It results ResultSet object. I want to convert it into spark dataframe.
object JDBCRead {
val tableName:String = "TABLENAME"
val url :String = "jdbc:teradata://TERADATA_URL/user=USERNAME,password=PWD,charset=UTF8,TYPE=FASTEXPORT,SESSIONS=10"
val selectTable:String = "SELECT * FROM " + tableName +" sample 10";
val con : Connection = DriverManager.getConnection(url);
val pstmt2: PreparedStatement = con.prepareStatement(selectTable)
import java.sql.ResultSet
val rs: ResultSet = pstmt2.executeQuery
val rsmd: ResultSetMetaData = rs.getMetaData
while(rs.next()!=null)
{
val k: Boolean = rs.next()
for(i<-1 to rsmd.getColumnCount) {
print(" " + rs.getObject(i))
}
println()
}
}
I want to call above code from Spark Dataframe so that I can load the data into dataframe and get the results faster distributedly.
I must use PreparedStatement. I can not use spark.jdbc.load since FASTEXPORT of Teradata does not work with jdbc load. It has to be used with PreparedStatement
How to achieve this? How can I user preparestatement along with SELECT statement to load into Spark Dataframe.
-
AFAIK there are 2 options available for this kind of requirements
1. DataFrame 2. JdbcRDD
I'd offer JdbcRDD (since you are so specific to preparedstatement)
Which used prepareStatement internally in compute method. Therefore, you don't need to create connection and maintain it explicitly(error prone).
Later you can convert result in to dataframe
For speed you can configure other parameters.
Example code usage of JdbcRDD is below..
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext.__
import org.apache.spark.SparkConf
import org.apache.spark.rdd.JdbcRDD
import java.sql.{connection, DriverManager,ResultSet}
object jdbcRddExample {
def main(args: Array[String]) {
// Connection String
VAL URL = "jdbc:teradata://SERVER/demo"
val username = "demo"
val password = "Spark"
Class.forName("com.teradata.jdbc.Driver").newInstance
// Creating & Configuring Spark Context
val conf = new SparkConf().setAppName("App1").setMaster("local[2]").set("spark.executor.memory",1)
val sc = new SparkContext(conf)
println("Start...")
// Fetching data from Database
val myRDD = new JdbcRDD(sc,() => DriverManager.getConnection(url,username,password),
"select first_name, last_name, gender from person limit ?,?",
3,5,1,r => r.getString("last_name") + "," +r.getString("first_name"))
// Displaying the content
myRDD.foreach(println)
// Saving the content inside Text File
myRDD.saveAsTextFile("c://jdbcrdd")
println("End...")
}
}

Serialization of transform function in checkpointing

I'm trying to understand Spark Streaming's RDD transformations and checkpointing in the context of serialization. Consider the following example Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
From what I've read in other SO posts, transformTest will be invoked on the driver program once for every batch after streaming starts. Assuming createStreamingContext is invoked (no checkpoint is available), I would expect that the instance of helperObject defined up top would be serialized out to workers once per batch, hence picking up the changes applied to it via helperObject.setData(...). Is this the case?
Now, if createStreamingContext is not invoked (a checkpoint is available), then I would expect that the instance of helperObject cannot possibly be picked up for each batch, since it can't have been captured if createStreamingContext is not executed. Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Is it possible to update helperObject throughout execution from the driver program when using checkpointing? If so, what's the best approach?
If helperObject is going to be serialized to each executors?
Ans: Yes.
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Ans Yes.
If you wish to refresh your helperObject behaviour for each RDD operation you can still do that by making your helperObject more intelligent and not sending the helperObject directly but via a function which has the following signature () => helperObject_Class.
Since it is a function it is serializable. It is a very common design pattern used for sending objects that are not serializable e.g. database connection object or for your fun use case.
An example is given from Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}

Zeppling twitter streaming example, unable to

When following the zeppelin tutorial for streaming tweets and querying them using SparkSQL, am running into error where the 'tweets' temp table is not found. The exact code being used and links referred as as follows
Ref: https://zeppelin.apache.org/docs/0.6.2/quickstart/tutorial.html
import scala.collection.mutable.HashMap
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess
/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
println("Configuring Twitter OAuth")
configs.foreach{ case(key, value) =>
if (value.trim.isEmpty) {
throw new Exception("Error setting authentication - value for " + key + " not set")
}
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
System.setProperty(fullKey, value.trim)
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
}
println()
}
// Configure Twitter credentials
val apiKey = "xxx"
val apiSecret = "xxx"
val accessToken = "xx-xxx"
val accessTokenSecret = "xxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
#transient val ssc = new StreamingContext(sc, Seconds(2))
#transient val tweets = TwitterUtils.createStream(ssc, None)
#transient val twt = tweets.window(Seconds(60), Seconds(2))
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
rdd.toDF().registerTempTable("tweets")
)
ssc.start()
In the next paragraph, i have the SQL select statement
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
Which throws the following exception
org.apache.spark.sql.AnalysisException: Table not found: tweets;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:305)
Was able to get the above example running by making following edits. Am not sure, if this change was needed due to version upgrade of Spark (v1.6.3) or some other underlying architecture nuances i might be missing, but eitherways
REF: SparkSQL error Table Not Found
In the second para' instead of directly invoking as SQL syntax, try using the sqlContext as follows
val my_df = sqlContext.sql("SELECT * from sweets LIMIT 5")
my_df.collect().foreach(println)

Spark and Drools integration (Reading rules from a drl file)

I am working on a spark program that takes input from the RDD and runs a few drool rules on it reading from a drl file.
in the drl file i have made a rule that wherever the hz attribute of the object is 0 it should increment the counter attribute by 1.
I have no clue why is that not working, it gives me an output of 0 for all the data in the stream (Yes, there is data with hz attribute equal to 0 and yes, I can print all the attributes and verify that even for them counter is 0)
I am using the KieSessionFactory class that I found on a github project here https://github.com/mganta/sprue/blob/master/src/main/java/com/cloudera/sprue/KieSessionFactory.java
But I am quite sure that this part not where the problem is, it only reads from the drl file and applies the rules.
below is my scala code: (I have marked the part where I think the problem lies, but please take a look at the drl file first)
package com.streams.Scala_Consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.{ DStream, InputDStream, ConstantInputDStream }
import org.apache.spark.streaming.kafka.v09.KafkaUtils
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.sql.functions.avg
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.kafka.producer._
import org.apache.kafka.common.serialization.{ Deserializer, Serializer }
import org.apache.kafka.common.serialization.StringSerializer
import org.kie.api.runtime.StatelessKieSession
//import KieSessionFactory.getKieSession;
//import Sensor
object scala_consumer extends Serializable {
// schema for sensor data
class Sensor(resid_1: String, date_1: String, time_1: String, hz_1: Double, disp_1: Double, flo_1: Double, sedPPM_1: Double, psi_1: Double, chlPPM_1: Double, counter_1: Int) extends Serializable
{
var resid = resid_1
var date = date_1
var time = time_1
var hz = hz_1
var disp = disp_1
var flo = flo_1
var sedPPM = sedPPM_1
var psi = psi_1
var chlPPM = chlPPM_1
var counter = counter_1
def IncrementCounter (param: Int) =
{
counter = counter + param
}
}
// function to parse line of sensor data into Sensor class
def parseSensor(str: String): Sensor = {
val p = str.split(",")
//println("printing p: " + p)
new Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble, 0)
}
var counter = 0
val timeout = 10 // Terminate after N seconds
val batchSeconds = 2 // Size of batch intervals
def main(args: Array[String]): Unit = {
val brokers = "maprdemo:9092" // not needed for MapR Streams, needed for Kafka
val groupId = "testgroup"
val offsetReset = "latest"
val batchInterval = "2"
val pollTimeout = "1000"
val topics = "/user/vipulrajan/streaming/original:sensor"
val topica = "/user/vipulrajan/streaming/fail:test"
val xlsFileName = "./src/main/Rules.drl"
val sparkConf = new SparkConf().setAppName("SensorStream").setMaster("local[1]").set("spark.testing.memory", "536870912")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.receiver.maxRate", Integer.toString(2000000))
.set("spark.streaming.kafka.maxRatePerPartition", Integer.toString(2000000));
val ssc = new StreamingContext(sparkConf, Seconds(batchInterval.toInt))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> offsetReset,
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
"spark.kafka.poll.time" -> pollTimeout
)
val producerConf = new ProducerConf(
bootstrapServers = brokers.split(",").toList
)
val messages = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)
val values: DStream[String] = messages.map(_._2)
println("message values received")
//values.print(10)
///////////*************************PART THAT COULD BE CAUSING A PROBLEM**************************/////////////
values.foreachRDD(x => try{
print("did 1\n") //markers for manual and minor debugging
val myData = x.mapPartitions(s => {s.map(sens => {parseSensor(sens)})})
//myData.collect().foreach(println)
//println(youData.date)
print("did 2\n")
val evalData = myData.mapPartitions(s => {
val ksession = KieSessionFactory.getKieSession(xlsFileName)
val retData = s.map(sens => {ksession.execute(sens); sens;})
retData
})
evalData.foreach(t => {println(t.counter)})
print("did 3\n")
}
catch{case e1: ArrayIndexOutOfBoundsException => println("exception in line " )})
///////////*************************PART THAT COULD BE CAUSING A PROBLEM**************************/////////////
println("filtered alert messages ")
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
}
the drl file
package droolsexample
import com.streams.Scala_Consumer.Sensor;
import scala.com.streams.Scala_Consumer.Sensor; //imported because my rules file lies in the src/main folder
//and code lies in src/main/scala
// declare any global variables here
dialect "java"
rule "Counter Incrementer"
when
sens : Sensor (hz == 0)
then
sens.IncrementCounter(1);
end
I have tried using an xls file instead of the drl file, I have tried creating the class in java and the object in scala. I have tried a lot of other things, but all I get in the output is a warning:
6/06/27 16:38:30.462 Executor task launch worker-0 WARN AbstractKieModule: No files found for KieBase defaultKieBase
and when I print the counter values I get all zeroes. Anybody to the rescue?
When you are doing the spark submit and passing your JAR for execution, pls ensure that other dependency JARs from KIE, etc are also included with in the same JAR and then run it with Spark-Submit.
alternate is to have two separate projects one where you have your spark program another is your KIE project so you will have two Jars and you run it something like below:
nohup spark-submit --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/log4j.properties" \
--queue abc \
--master yarn \
--deploy-mode cluster \
--jars drools-kie-project-0.0.1-SNAPSHOT.jar --class com.abc.DroolsSparkJob SparkcallingDrools-0.0.1-SNAPSHOT.jar \
-inputfile /user/hive/warehouse/abc/* -output /user/hive/warehouse/drools-Op > app.log &

Spark Streaming not detecting new HDFS files

I am running the program below on Spark 1.3.1. Spark Streaming is watching a directory in HDFS for new files and should process them as they come in. I have read that the best way to do this is to move the files from an existing HDFS location so that the operation is atomic.
I start my streaming job, I add a bunch of small files to a random HDFS directory, then I move these files from the original HDFS directory to the watched HDFS directory (all with simple shell commands). But my streaming job is not recognizing these as new files and therefore not processing them.
Currently I am using textFileStream but am open to using fileStream. However I am getting errors with this val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("hdfs:///name/spark-streaming/data/", (p: Path)=>true, false)
package com.com.spark.prototype
import java.io.FileInputStream
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark._
import org.apache.spark.streaming._
import com.twitter.algebird.HyperLogLogMonoid
import org.apache.hadoop.io._
object HLLStreamingHDFSTest {
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().set("spark.executor.extraClassPath", "/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("/name/spark-streaming/checkpointing")
val lines = ssc.textFileStream("hdfs:///name/spark-streaming/data/")
val hll = new HyperLogLogMonoid(15)
var globalHll = hll.zero
val users = lines.map(_.toString().toCharArray.map(_.toByte))
val approxUsers = users.mapPartitions(ids => {
ids.map(id => hll(id))
}).reduce(_ + _)
approxUsers.foreachRDD(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
globalHll += partial
println()
println()
println("Estimated distinct users this batch: %d".format(partial.estimatedSize.toInt))
println("Estimated distinct users this batch: %d".format(globalHll.estimatedSize.toInt))
println()
println("Approx distinct users this batch: %s".format(partial.approximateSize.toString))
println("Approx distinct users overall: %s".format(globalHll.approximateSize.toString))
}
})
ssc
}
def main(args: Array[String]): Unit = {
val context = StreamingContext.getOrCreate("hdfs:///name/spark-streaming/checkpointing", functionToCreateContext _)
context.start()
context.awaitTermination()
}
}

Resources