Filtering records for all values of an array in Spark - apache-spark

I am very new to Spark.
I have a very basic question. I have an array of values:
listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A)
I want to filter an RDD for all of these token values. I tried the following way:
val ECtokens = for (token <- listofECtokens) rddAll.filter(line => line.contains(token))
Output:
ECtokens: Unit = ()
I got an empty Unit even when there are records with these tokens. What am I doing wrong?

You can get that result in a more efficient way and the result would be a filtered RDD:
val filteredRDD = rddAll.filter(line => listofECtokens.exists(line.contains))
And then to get the result as an array you should call collect or take on the filteredRDD:
//collect brings the RDD to the driver so be carefull cause that can result in a OutOfMemory in that machine
val ECtokens = filteredRDD.collect()
//if you only need to print a few elements of the RDD, a safer approach is to use the take()
val ECtokens = filteredRDD.take(5)

Related

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

merge spark dStream with variable to saveToCassandra()

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.
The cassandra table created as:
CREATE TABLE wordcounts (
step int,
word text,
count int,
primary key (step, word)
);
When trying to write the stream to the table...
stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))
... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step.
How can I prepend the step index to the stream in order to write the three columns together?
I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.
As noted, while the Cassandra table expects something of the form (Int, String, Int), the wordCount DStream is of type DStream[(String, Int)], so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)].
The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.
To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.
Departing from the classic Streaming word count example:
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
We add a local var to hold the count of the microbatches:
#transient var batchCount = 0
It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.
Now the tricky bit: Within the context of a DStream transformation, we make an RDD out of that single variable and join it with underlying RDD of the DStream using cartesian product:
val batchWordCounts = wordCounts.transform{ rdd =>
batchCount = batchCount + 1
val localCount = sparkContext.parallelize(Seq(batchCount))
rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}
(Note that a simple map function would not work, as only the initial value of the variable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.
Finally, now that the data is in the right shape, save it to Cassandra:
batchWordCounts.saveToCassandra("keyspace", "wordcounts")
updateStateByKey function is provided by spark for global state handling.
For this case it could look something like following
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount: Int = runningCount.getOrElse(0) + 1
Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)
stream.join(step).map{case (key,(count, step)) => (step,key,count)})
.saveToCassandra("keyspace", "wordcounts")
Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.
What you can do is, you can use the below methods to save the RDD to new table.
saveAsCassandraTable or saveAsCassandraTableEx
For more info look into this.

Spark 1.6.2's RDD caching seems do to weird things with filters in some cases

I have an RDD:
avroRecord: org.apache.spark.rdd.RDD[com.rr.eventdata.ViewRecord] = MapPartitionsRDD[75]
I then filter the RDD for a single matching value:
val siteFiltered = avroRecord.filter(_.getSiteId == 1200)
I now count how many distinct values I get for SiteId. Given the filter it should be "1". Here's two ways I do it without cache and with cache:
val basic = siteFiltered.map(_.getSiteId).distinct.count
val cached = siteFiltered.cache.map(_.getSiteId).distinct.count
The result indicates that the cached version isn't filtered at all:
basic: Long = 1
cached: Long = 93
"93" isn't even the expected value if the filter was ignored completely (that answer is "522"). It also isn't a problem with "distinct" as the values are real ones.
It seems like the cached RDD has some odd partial version of the filter.
Anyone know what's going on here?
I supposed the problem is that you have to cache the result of your RDD before doing any action on it.
Spark build a DAG that represents the execution of your program. Each node is a transformation or an action on your RDD. Without cacheing the RDD, each action forces Spark to execute the whole DAG from the begining (or from the last cache invocation).
So, your code should work if you do the following changes:
val siteFiltered =
avroRecord.filter(_.getSiteId == 1200)
.map(_.getSiteId).cache
val basic = siteFiltered.distinct.count
// Yes, I know, in this way the second count has no sense at all
val cached = siteFiltered.distinct.count
There is no issue with your code. It should work fine.
I tried out the same at my local it is working fine without any discrepancies with multiple runs.
I have following data with me:
Event1,11.4
Event2,82.0
Event3,53.8
Event4,31.0
Event5,22.6
Event6,43.1
Event7,11.0
Event8,22.1
Event8,22.1
Event8,22.1
Event8,22.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event11,3.22
Event12,13.11
And I tried the same thing as you did, following is my code that is working fine:
scala> var textrdd = sc.textFile("file:///data/pocs/blogs/eventrecords");
textrdd: org.apache.spark.rdd.RDD[String] = file:///data/pocs/blogs/eventrecords MapPartitionsRDD[123] at textFile at <console>:27
scala> var filteredRdd = textrdd.filter(_.split(",")(1).toDouble > 1)
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[124] at filter at <console>:29
scala> filteredRdd.map(x => x.split(",")(1)).distinct.count
res36: Long = 12
scala> filteredRdd.cache.map(x => x.split(",")(1)).distinct.count
res37: Long = 12

Vertically partition an RDD and write to separate locations

In spark 1.5+ how can I write each column of an "n"-tuple RDD to different locations?
For example if I had a RDD[(String, String)] I would like the first column to be written to s3://bucket/first-col and the second to s3://bucket/second-col
I could do the following
val pairRDD: RDD[(String, String)]
val cachedRDD = pairRDD.cache()
cachedRDD.map(_._1).saveAsTextFile("s3://bucket/first-col")
cachedRDD.map(_._2).saveAsTextFile("s3://bucket/second-col")
But is far from ideal since I need a two-pass over the RDD.
One way you could you can go about doing this is by converting the tuples into lists and then use map to create a list of RDDs and perform a save on each as follows:
val fileNames:List[String]
val input:RDD[(String, String...)] //could be a tuple of any size
val columnIDs = (1 to numCols)
val unzippedValues = input.map(_.productIterator.toList).persist() //converts tuple into list
val columnRDDs = columnIDs.map( a => unzippedValues.map(_(a)))
columnRDDs.zip(fileNames)foreach{case(b,fName) => b.saveAsTextFile(fName)}

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources