microbatch lookup in external database from spark - apache-spark

I have a requirement where I need to process log lines using Spark. One of the steps in processing is to lookup certain value in external DB.
For ex:
my log line contains multiple key-value pair. One of the key that is present in log is "key1". This key needs to be used for lookup call.
I dont want to make multiple lookup call sequentially in external DB for each value of "key1" in RDD .Rather I want to create a list of all values of "key1" present in RDD and then make a single lookup call in external DB.
My code to extract key from each log line would look as follows:
lines.foreachRDD{rdd => rdd.map(line => extractKey(line))
// next step is lookup
// then further processing
The .map function would be called for each log line, so I am not sure , how can I create a list of Keys which can be used for external lookup.
Thanks

Use collect.
lines.foreachRDD{rdd =>
val keys = rdd.map(line => extractKey(line)).collect()
// here you can use keys List
Probably you'll have to use mapPartitions also:
lines.foreachRDD{rdd =>
rdd.foreachPartition(iter => {
val keys = iter.map(line => extractKey(line)).toArray
// here you can use keys Array
}
}
There will be 1 call per 1 partition, this method avoids serialization problems

It looks like you want this:
lines.groupByKey().filter()
Could you provide more info?

Related

How to keep track of updated fields of an RDD in Spark?

There is an issue that i am dealing with on keeping track of updated fields in spark RDD.
assume that we have an RDD like this:
(1,2)
(2,10)
(5,9)
(3,8)
(8,15)
based on some conditions value of some keys may change. for example the value of key=2 changes from 10 to 11. then the value of a key in RDD that its value is equal to the key of updated row should be changed respectively. for example key=1 has value equal to 2, which 2 is a key in other row. because value of key=2 changes to 11. then the value of key=1 should change to 11 to. after some execution RDD looks like this:
(1,11)
(2,11)
(5,9)
(3,7)
(8,7)
is there any efficient way to implement this?
Assuming you are talking about a DStream (of RDDs). In that case you can use the updateStateByKey method.
To use updateStateByKey, you need to provide a function update(events, oldState) that takes in the events that arrived for a key and its previous state, and returns a newState to store for it.
events: is a list of events that arrived in the current batch (may be empty).
oldState: is an optional state opbject, stored withing an Option; it might be missing if there was no prevous state for the key.
newState: returned by the function, is also an Option.
The result of updateStateByKey() will be a new DStream that contains an RDD of (key, state) pairs.
Basic Example:
def myUpdate(values: Seq[Long], state: Option[Long]) = {
// select new value
}
myDStream.updateStateByKey(myUpdate _)
Background given from the book "Learning Spark" (O'Reilly).

How can you get around the 2GB buffer limit when using Dataset.groupByKey?

When using Dataset.groupByKey(_.key).mapGroups or Dataset.groupByKey(_.key).cogroup in Spark, I've run into a problem when one of the groupings results in more than 2GB of data.
I need to normalize the data by group before I can start to reduce it, and I would like to split up the groups into smaller subgroups so they distribute better. For example, here's one way I've attempted to split the groups:
val groupedInputs = inputData.groupByKey(_.key).mapGroups {
case(key, inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group))
}
But unfortunately however I try to work around it, my jobs always die with an error like this: java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 23816 because the size after growing exceeds size limitation 2147483632. When using Kryo serialization I get a different Kryo serialization failed: Buffer overflow error recommending I increase spark.kryoserializer.buffer.max, but I've already increased it to the 2GB limit.
One solution that occurs to me is to add a random value to the keys before grouping them. This isn't ideal since it'll split up every group (not just the large ones), but I'm willing to sacrifice "ideal" for the sake of "working". That code would look something like this:
val splitInputs = inputData.map( record => (record, ThreadLocalRandom.current.nextInt(splitFactor)))
val groupedInputs = splitInputs.groupByKey{ case(record, split) => (record.key, split)).mapGroups {
case((key, _), inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group.map(_._1)))
}
Add a salt key and do groupBy on your key and the salt key and later
import scala.util.Random
val start = 1
val end = 5
val randUdf = udf({() => start + Random.nextInt((end - start) + 1)})
val saltGroupBy=skewDF.withColumn("salt_key", randUdf())
.groupBy(col("name"), col("salt_key"))
So your all the skew data doesn't go into one executor and cause the 2GB Limit.
But you have to develop a logic to aggregate the above result and finally remove the salt key at the end.
When you use groupBy all the records with the same key will reach one executor and bottle neck occur.
The above is one of the method to mitigate it.
For this case, where the dataset had a lot of skew and it was important to group the records into regularly-sized groups, I decided to process the dataset in two passes. First I used a window function to number the rows by key, and converted that to a "group index," based on a configurable "maxGroupSize":
// The "orderBy" doesn't seem necessary here,
// but the row_number function requires it.
val partitionByKey = Window.partitionBy(key).orderBy(key)
val indexedData = inputData.withColumn("groupIndex",
(row_number.over(partitionByKey) / maxGroupSize).cast(IntegerType))
.as[(Record, Int)]
Then I can group by key and index, and produce groups that are consistently sized--the keys with a lot of records get split up more, and the keys with few records may not be split up at all.
indexedData.groupByKey{ case (record, groupIndex) => (record.key, groupIndex) }
.mapGroups{ case((key, _), recordGroup) =>
// Remove the index values before returning the groups
(key, recordGroup.map(_._1))
}

UPDATE prepared statement with Object

I have an Object that maps column names to values. The columns to be updated are not known beforehand and are decided at run-time.
e.g. map = {col1: "value1", col2: "value2"}.
I want to execute an UPDATE query, updating a table with those columns to the corresponding values. Can I do the following? If not, is there an elegant way of doing it without building the query manually?
db.none('UPDATE mytable SET $1 WHERE id = 99', map)
is there an elegant way of doing it without building the query manually?
Yes, there is, by using the helpers for SQL generation.
You can pre-declare a static object like this:
const cs = new pgp.helpers.ColumnSet(['col1', 'col2'], {table: 'mytable'});
And then use it like this, via helpers.update:
const sql = pgp.helpers.update(data, cs) + /* WHERE clause with the condition */;
// and then execute it:
db.none(sql).then(data => {}).catch(error => {})
This approach will work with both a single object and an array of objects, and you will just append the update condition accordingly.
See also: PostgreSQL multi-row updates in Node.js
What if the column names are not known beforehand?
For that see: Dynamic named parameters in pg-promise, and note that a proper answer would depend on how you intend to cast types of such columns.
Something like this :
map = {col1: "value1", col2: "value2",id:"existingId"}.
db.none("UPDATE mytable SET col1=${col1}, col2=${col2} where id=${id}", map)

Spark Streaming Desinging Questiion

I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).
val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 = cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t"))
joinedRDD.first().foreach { println }
I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?
After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

Resources