Spark streaming data sharing between batches - apache-spark

Spark streaming processes the data in micro batches.
Each interval data is processed in parallel using RDDs with out any data sharing between each interval.
But my use case needs to share the data between intervals.
Consider the Network WordCount example which produces the count of all words received in that interval.
How would I produce following word count ?
Relative count for the words "hadoop" and "spark" with the previous interval count
Normal word count for all other words.
Note: UpdateStateByKey does the Stateful processing but this applies function on every record instead of particular records.
So, UpdateStateByKey doesn't fit for this requirement.
Update:
consider the following example
Interval-1
Input:
Sample Input with Hadoop and Spark on Hadoop
output:
hadoop 2
sample 1
input 1
with 1
and 1
spark 1
on 1
Interval-2
Input:
Another Sample Input with Hadoop and Spark on Hadoop and another hadoop another spark spark
output:
another 3
hadoop 1
spark 2
and 2
sample 1
input 1
with 1
on 1
Explanation:
1st interval gives the normal word count of all words.
In the 2nd interval hadoop occurred 3 times but the output should be 1 (3-2)
Spark occurred 3 times but the output should be 2 (3-1)
For all other words it should give the normal word count.
So, while processing 2nd Interval data, it should have the 1st interval's word count of hadoop and spark
This is a simple example with illustration.
In actual use case, fields that need data sharing are part of the RDD element(RDD) and huge no of values needs to be tracked.
i.e, in this example like hadoop and spark keywords nearly 100k keywords to be tracked.
Similar usecases in Apache Storm:
Distributed caching in storm
Storm TransactionalWords

This is possible by "remembering" the last RDD received and using a left join to merge that data with the next streaming batch. We make use of streamingContext.remember to enable RDDs produced by the streaming process to be kept for the time we need them.
We make use of the fact that dstream.transform is an operation that executes on the driver and therefore we have access to all local object definitions. In particular we want to update the mutable reference to the last RDD with the required value on each batch.
Probably a piece of code makes that idea more clear:
// configure the streaming context to remember the RDDs produced
// choose at least 2x the time of the streaming interval
ssc.remember(xx Seconds)
// Initialize the "currentData" with an empty RDD of the expected type
var currentData: RDD[(String, Int)] = sparkContext.emptyRDD
// classic word count
val w1dstream = dstream.map(elem => (elem,1))
val count = w1dstream.reduceByKey(_ + _)
// Here's the key to make this work. Look how we update the value of the last RDD after using it.
val diffCount = count.transform{ rdd =>
val interestingKeys = Set("hadoop", "spark")
val interesting = rdd.filter{case (k,v) => interestingKeys(k)}
val countDiff = rdd.leftOuterJoin(currentData).map{case (k,(v1,v2)) => (k,v1-v2.getOrElse(0))}
currentData = interesting
countDiff
}
diffCount.print()

Related

Number of Task in Apache spark while writing into HDFS

I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.

What is the best strategy to load huge datasets/data into Hive tables using Spark? [duplicate]

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code:
val conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.sql.inMemoryColumnarStorage.compressed", "true").set("spark.sql.orc.filterPushdown","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.max","512m").set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.yarn.driver.memoryOverhead","7168").set("spark.yarn.executor.memoryOverhead","7168").set("spark.sql.shuffle.partitions", "61").set("spark.default.parallelism", "60").set("spark.memory.storageFraction","0.5").set("spark.memory.fraction","0.6").set("spark.memory.offHeap.enabled","true").set("spark.memory.offHeap.size","16g").set("spark.dynamicAllocation.enabled", "false").set("spark.dynamicAllocation.enabled","true").set("spark.shuffle.service.enabled","true")
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val colList = allColumns.split(",").toList
val (partCols, npartCols) = colList.partition(p => partition_columns.contains(p.takeWhile(x => x != ' ')))
val queryCols = npartCols.mkString(",") + ", 0 as " + flagCol + "," + partCols.reverse.mkString(",")
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017' and period_num='12'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","cast_id")
.option("lowerBound", 1).option("upperBound", 100000)
.option("numPartitions",70).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
val fc = prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
val dataDFPart = dataDF.repartition(30)
dataDFPart.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF")
The data is inserted into the hive table dynamically partitioned based on prtn_String_columns: source_system_name, period_year, period_num
Spark-submit used:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/fdlhdpetl/jars/postgresql-42.1.4.jar --jars /home/fdlhdpetl/jars/postgresql-42.1.4.jar --num-executors 80 --executor-cores 5 --executor-memory 50G --driver-memory 20G --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/fdlhdpetl/fdlhdpetl.keytab --principal fdlhdpetl#FDLDEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/fdlhdpetl/jars/postgresql-42.1.4.jar
The following error messages are generated in the executor logs:
Container exited with a non-zero exit code 143.
Killed by external signal
18/10/03 15:37:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[SIGTERM handler,9,system]
java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
I see in the logs that the read is being executed properly with the given number of partitions as below:
Scan JDBCRelation((select column_names from schema.tablename where period_year='2017' and period_num='12') as year2017) [numPartitions=50]
Below is the state of executors in stages:
The data is not being partitioned properly. One partition is smaller while the other one becomes huge. There is a skew problem here.
While inserting the data into Hive table the job fails at the line:spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF") but I understand this is happening because of the data skew problem.
I tried to increase number of executors, increasing the executor memory, driver memory, tried to just save as csv file instead of saving the dataframe into a Hive table but nothing affects the execution from giving the exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there anything in the code that I need to correct ? Could anyone let me know how can I fix this problem ?
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit.
You've previously stated that you migrate 1TB of data values you use in different posts (5 - 70) are likely way to low to ensure smooth process.
Try to use value which won't require further repartitioning.
Know your data.
Analyze the columns available in the the dataset to determine if there any columns with high cardinality and uniform distribution to be distributed among desired number of partitions. These are good candidates for an import process. Additionally you should determine an exact range of values.
Aggregations with different centrality and skewness measure as well as histograms and basic counts-by-key are good exploration tools. For this part it is better to analyze data directly in the database, instead of fetching it to Spark.
Depending on the RDBMS you might be able to use width_bucket (PostgreSQL, Oracle) or equivalent function to get a decent idea how data will be distributed in Spark after loading with partitionColumn, lowerBound, upperBound, numPartitons.
s"""(SELECT width_bucket($partitionColum, $lowerBound, $upperBound, $numPartitons) AS bucket, COUNT(*)
FROM t
GROUP BY bucket) as tmp)"""
If there are no columns which satisfy above criteria consider:
Creating a custom one and exposing it via. a view. Hashes over multiple independent columns are usually good candidates. Please consult your database manual to determine functions that can be used here (DBMS_CRYPTO in Oracle, pgcrypto in PostgreSQL)*.
Using a set of independent columns which taken together provide high enough cardinality.
Optionally, if you're going to write to a partitioned Hive table, you should consider including Hive partitioning columns. It might limit the number of files generated later.
Prepare partitioning arguments
If column selected or created in the previous steps is numeric (or date / timestamp in Spark >= 2.4) provide it directly as the partitionColumn and use range values determined before to fill lowerBound and upperBound.
If bound values don't reflect the properties of data (min(col) for lowerBound, max(col) for upperBound) it can result in a significant data skew so thread carefully. In the worst case scenario, when bounds don't cover the range of data, all records will be fetched by a single machine, making it no better than no partitioning at all.
If column selected in the previous steps is categorical or is a set of columns generate a list of mutually exclusive predicates that fully cover the data, in a form that can be used in a SQL where clause.
For example if you have a column A with values {a1, a2, a3} and column B with values {b1, b2, b3}:
val predicates = for {
a <- Seq("a1", "a2", "a3")
b <- Seq("b1", "b2", "b3")
} yield s"A = $a AND B = $b"
Double check that conditions don't overlap and all combinations are covered. If these conditions are not satisfied you end up with duplicates or missing records respectively.
Pass data as predicates argument to jdbc call. Note that the number of partitions will be equal exactly to the number of predicates.
Put database in a read-only mode (any ongoing writes can cause data inconsistency. If possible you should lock database before you start the whole process, but if might be not possible, in your organization).
If the number of partitions matches the desired output load data without repartition and dump directly to the sink, if not you can try to repartition following the same rules as in the step 1.
If you still experience any problems make sure that you've properly configured Spark memory and GC options.
If none of the above works:
Consider dumping your data to a network / distributes storage using tools like COPY TO and read it directly from there.
Note that or standard database utilities you will typically need a POSIX compliant file system, so HDFS usually won't do.
The advantage of this approach is that you don't need to worry about the column properties, and there is no need for putting data in a read-only mode, to ensure consistency.
Using dedicated bulk transfer tools, like Apache Sqoop, and reshaping data afterwards.
* Don't use pseudocolumns - Pseudocolumn in Spark JDBC.
In my experience there are 4 kinds of memory settings which make a difference:
A) [1] Memory for storing data for processing reasons VS [2] Heap Space for holding the program stack
B) [1] Driver VS [2] executor memory
Up to now, I was always able to get my Spark jobs running successfully by increasing the appropriate kind of memory:
A2-B1 would therefor be the memory available on the driver to hold the program stack. Etc.
The property names are as follows:
A1-B1) executor-memory
A1-B2) driver-memory
A2-B1) spark.yarn.executor.memoryOverhead
A2-B2) spark.yarn.driver.memoryOverhead
Keep in mind that the sum of all *-B1 must be less than the available memory on your workers and the sum of all *-B2 must be less than the memory on your driver node.
My bet would be, that the culprit is one of the boldly marked heap settings.
There was an another question of yours routed here as duplicate
'How to avoid data skewing while reading huge datasets or tables into spark?
The data is not being partitioned properly. One partition is smaller while the
other one becomes huge on read.
I observed that one of the partition has nearly 2million rows and
while inserting there is a skew in partition. '
if the problem is to deal with data that is partitioned in a dataframe after read, Have you played around increasing the "numPartitions" value ?
.option("numPartitions",50)
lowerBound, upperBound form partition strides for generated WHERE clause expressions and numpartitions determines the number of split.
say for example, sometable has column - ID (we choose that as partitionColumn) ; value range we see in table for column-ID is from 1 to 1000 and we want to get all the records by running select * from sometable,
so we going with lowerbound = 1 & upperbound = 1000 and numpartition = 4
this will produce a dataframe of 4 partition with result of each Query by building sql based on our feed (lowerbound = 1 & upperbound = 1000 and numpartition = 4)
select * from sometable where ID < 250
select * from sometable where ID >= 250 and ID < 500
select * from sometable where ID >= 500 and ID < 750
select * from sometable where ID >= 750
what if most of the records in our table fall within the range of ID(500,750). that's the situation you are in to.
when we increase numpartition , the split happens even further and that reduce the volume of records in the same partition but this
is not a fine shot.
Instead of spark splitting the partitioncolumn based on boundaries we provide, if you think of feeding the split by yourself so, data can be evenly
splitted. you need to switch over to another JDBC method where instead of (lowerbound,upperbound & numpartition) we can provide
predicates directly.
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Link

Why my count,Distinct and count of Distinct count is very slow in huge cluster in spark

I have a very huge cluster 20 m4.Xlarge instances .
I have file size of 20GB and count of records in the file is 193944092.
From this file I need three info.
1. Total no of records
2. Total no of distinct records
3. Total no of distinct records based on one column (FundamentalSeriesId).
When I run below code it takes very long time .For counting total no of records it has taken 7 minutes .
But for Total no of distinct and Total no of distinct records of FundamentalSeriesId column it has taken very long time I mean I have cancelled the query because it was taking long time.
If any one can improve my code that would be great .Can I use cache or something else to get the info faster ?
This is what I am doing
val rdd = sc.textFile("s3://kishore-my-bucket-trf/Fundamental.FundamentalAnalytic.FundamentalAnalytic.SelfSourcedPublic.2011.1.2018-02-18-1340.Full.txt.gz")
println("Total count="+rdd.count())
val header = rdd.filter(_.contains("FundamentalSeriesId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("FundamentalSeriesId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
println("distinct count="+data.distinct.count())
val data1=data.select($"FundamentalSeriesId")
println("count of distinct FundamentalSeriesId column="+data1.distinct.count())
My sample records are like this ..
FundamentalSeriesId|^|FundamentalSeriesId.objectTypeId|^|FundamentalSeriesId.objectType_1|^|financialPeriodEndDate|^|financialPeriodType|^|lineItemId|^|analyticItemInstanceKey_1|^|AnalyticValue_1|^|AnalyticConceptCode_1|^|AnalyticValue.currencyId_1|^|AnalyticIsEstimated_1|^|AnalyticAuditabilityEquation_1|^|FinancialPeriodTypeId_1|^|AnalyticConceptId_1|^|sYearToDate|^|IsAnnual_1|^|TaxonomyId_1|^|InstrumentId_1|^|AuditID_1|^|PhysicalMeasureId_1|^|FFAction_1
Distinct is a common problem in Spark, use countApproxDistinct instead if you can.
Distinct count will move all the data into single Executor. So try to increase the Executor memory to max. It can reduce the time.
Try to cache the data. So we can eliminate the disk Io.
Try to use
val rdd = sc.textFile("s3://your_path").cache()
because when you start compute .count() spark reads file every time for every .count() function, but when you start use .cache() it will read file only once

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

what is exact difference between Spark Transform in DStream and map.?

I am trying to understand transform on Spark DStream in Spark Streaming.
I knew that transform in much superlative compared to map, but
Can some one give me some real time example or clear example that can differentiate transform and map.?
The transform function in Spark streaming allows one to use any of Apache Spark's transformations on the underlying RDDs for the stream. map is used for an element to element transform, and could be implemented using transform. Essentially, map works on the elements of the DStream and transform allows you to work with the RDDs of the DStream. You may find http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams to be useful.
map is an elementary transformation and transform is an RDD transformation
map
map(func) : Return a new DStream by passing each element of the source
DStream through a function func.
Here is an example which demonstrates both map operation and transform operation on a DStream
val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))
val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)
val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()
the map operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below
-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4
-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7
-------------------------------------------
transform
transform(func) : Return a new DStream by applying a RDD-to-RDD function
to every RDD of the source DStream. This can be used to do arbitrary
RDD operations on the DStream.
val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()
transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below
-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0
-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0
-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0
-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------
here the commonRdd which contains the element 0 is performed a union operation with all the underlying RDDs within the DStream.
DStream has Several RDD's, since every batch interval is a different RDD.
So by using transform(), you get the chance to to apply an RDD operation on the
entire DStream.
Example from Spark Docs:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
The transform function in Spark Streaming allows you to perform any transformation on underlying RDD's in Stream. For example you can join two RDD's in streaming using Transform wherein one RDD would be some RDD made from textfile or parallelized collection and other RDD is coming from Stream of textfile/socket etc.
Map works on each element of RDD in a particular batch and results in the RDD after applying the function passed to Map.
example 1)
men in a queuing coming inside a room, changing the dress , then marry women of their choice.
1) Changing the dress is the map operation ( where they are transform themself in attributes )
2) Marrying woman is merge/filter operation on you but under impact of others which we can call real transform operation.
example 2)
Student enters in college , few attended 2 lectures, few other attended 4, So on.
1) Attending lectures is map operation , what typically students are doing.
2) but to identify what lecturer taught them depends on lecturer RDD data , his day agenda.
Assume Transform operation a Dimension or static table that you want to filter or verify to identify right data for you , and removing may be junk.
If I am having data from 0-1 s "Hello How" and for the next 1-2 s " Are You" . Then in case of map and reduce by key example as shown above will produce the output (hello,1) and (How,1) for the first batch and (are,1) and (you,1) for the next batch.But similarly for the next example using "transform function" what will be the difference in the output

Resources