Spark reduce by - apache-spark

I have streaming data coming as follows
id, date, value
i1, 12-01-2016, 10
i2, 12-02-2016, 20
i1, 12-01-2016, 30
i2, 12-05-2016, 40
Want to reduce by id to get aggregate value info by date like
output required from rdd is for a given id and list(days 365)
I have to put the value in the list position based on day of year like 12-01-2016 is 336 and as there are two instances for device i1 with same date they should be aggregated
id, List [0|1|2|3|... |336| 337| |340| |365]
i1, |10+30| - this goes to 336 position
i2, 20 40 -- this goes to 337 and 340 position
Please guide the reduce or group by transformation to do this.

I'll provide you the basic code snippet with few assumptions as you haven't specified about the language, data source or data format.
JavaDStream<String> lineStream = //Your data source for stream
JavaPairDStream<String, Long> firstReduce = lineStream.mapToPair(line -> {
String[] fields = line.split(",");
String idDate = fields[0] + fields[1];
Long value = Long.valueOf(fields[2]);
return new Tuple2<String, Long>(idDate, value);
}).reduceByKey((v1, v2) -> {
return (v1+v2);
});
firstReduce.map(idDateValueTuple -> {
String idDate = idDateValueTuple._1();
Long valueSum = idDateValueTuple._2();
String id = idDate.split(",")[0];
String date = idDate.split(",")[];
//TODO parse date and put the sumValue in array as you wish
}

Can only reach this far. Am not sure how to add each element of an array in the final step. Hope this helps!!!If you get the last step or any alternate way,appreciate if you post it here!!
def getDateDifference(dateStr:String):Int = {
val startDate = "01-01-2016"
val formatter = DateTimeFormatter.ofPattern("MM-dd-yyyy")
val oldDate = LocalDate.parse(startDate, formatter)
val currentDate = dateStr
val newDate = LocalDate.parse(currentDate, formatter)
return newDate.toEpochDay().toInt - oldDate.toEpochDay().toInt
}
def getArray(numberofDays:Int,data:Int):Iterable[Int] = {
val daysArray = new Array[Int](366)
daysArray(numberofDays) = data
return daysArray
}
val idRDD = <read from stream>
val idRDDMap = idRDD.map { rec => ((rec.split(",")(0),rec.split(",")(1)),
(getDateDifference(rec.split(",")(1)),rec.split(",")(2).toInt))}
val idRDDconsiceMap = idRDDMap.map { rec => (rec._1._1,getArray(rec._2._1, rec._2._2)) }
val finalRDD = idRDDconsiceMap.reduceByKey((acc,value)=>(???add each element of the arrays????))

Related

how to extract an integer range from a string

I have a string that contains different ranges and I need to find their value
var str = "some text x = 1..14, y = 2..4 some text"
I used the substringBefore() and substringAfter() methodes to get the x and y but I can't find a way to get the values because the numbers could be one or two digits or even negative numbers.
One approach is to use a regex, e.g.:
val str = "some text x = 1..14, y = 2..4 some text"
val match = Regex("x = (-?\\d+[.][.]-?\\d+).* y = (-?\\d+[.][.]-?\\d+)")
.find(str)
if (match != null)
println("x=${match.groupValues[1]}, y=${match.groupValues[2]}")
// prints: x=1..14, y=2..4
\\d matches a single digit, so \\d+ matches one or more digits; -? matches an optional minus sign; [.] matches a dot; and (…) marks a group that you can then retrieve from the groupValues property. (groupValues[0] is the whole match, so the individual values start from index 1.)
You could easily add extra parens to pull out each number separately, instead of whole ranges.
(You may or may not find this as readable or maintainable as string-manipulation approaches…)
Is this solution fit for you?
val str = "some text x = 1..14, y = 2..4 some text"
val result = str.replace(",", "").split(" ")
var x = ""; var y = ""
for (i in 0..result.count()-1) {
if (result[i] == "x") {
x = result[i+2]
} else if (result[i] == "y") {
y = result[i+2]
}
}
println(x)
println(y)
Using KotlinSpirit library
val rangeParser = object : Grammar<IntRange>() {
private var first: Int = -1
private var last: Int = -1
override val result: IntRange
get() = first..last
override fun defineRule(): Rule<*> {
return int {
first = it
} + ".." + int {
last = it
}
}
}.toRule().compile()
val str = "some text x = 1..14, y = 2..4 some text"
val ranges = rangeParser.findAll(str)
https://github.com/tiksem/KotlinSpirit

Problem with Double, String and Integer conversion

Why is the value 4.49504794 ? It should be usdRate:String * sas * ddx:EditText.
I want it to be 0.00000001 * input from edittext (from user) * usdRate:String (1 BTC in USD)
It should be 0.00000001/44950 * x (user_input = x) = (0,00002224694105)
I'm also wanting to limit usdRate:String to only 5 digits total, or somehow remove the last four symbols in the string.
var usdRate:String = (JSONObject(json).getJSONObject("bpi").getJSONObject("USD")["rate"] as String)
val text = usdRate.replace(",", "")
val text2 = text.replace(".", "")
val satosh: Int = text2.toInt()
val sas: Double = 0.00000001
val sas2: Double = sas.toDouble() * satosh.toDouble()
val ddx:EditText = findViewById(R.id.editTextNumber2)
val sasEnty: Double = (ddx.text.toString().toDouble() * sas2)
//1 satoshi value in USD
usdView.text = sasEnty.toString()
//Problem end
Picture of output in application
Output
This code gave me the output I was looking for. When a user input 3 as a value, will it return 0.0013994405520000002
//ex 45,000.01234
var usdRate: String = (JSONObject(json).getJSONObject("bpi").getJSONObject("USD")["rate"] as String).toString()
val usdRateN1: String = usdRate.replace(",", "")
val sastoshi: Double = 0.00000001
var antalSatoshi = sastoshi * ddx.text.toString().toDouble()
var FinalUsdCount = (usdRateN1.toDouble() * antalSatoshi )
Math.round(FinalUsdCount)

I want to collect the data frame column values in an array list to conduct some computations, is it possible?

I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()

How to compute the distance matrix in spark?

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

Scala split string and sort data

Hi I am new in scala and I achieved following things in scala, my string contain following data
CLASS: Win32_PerfFormattedData_PerfProc_Process$$(null)|CreatingProcessID|Description|ElapsedTime|Frequency_Object|Frequency_PerfTime|Frequency_Sys100NS|HandleCount|IDProcess|IODataBytesPersec|IODataOperationsPersec|IOOtherBytesPersec|IOOtherOperationsPersec|IOReadBytesPersec|IOReadOperationsPersec|IOWriteBytesPersec|IOWriteOperationsPersec|Name|PageFaultsPersec|PageFileBytes|PageFileBytesPeak|PercentPrivilegedTime|PercentProcessorTime|PercentUserTime|PoolNonpagedBytes|PoolPagedBytes|PriorityBase|PrivateBytes|ThreadCount|Timestamp_Object|Timestamp_PerfTime|Timestamp_Sys100NS|VirtualBytes|VirtualBytesPeak|WorkingSet|WorkingSetPeak|WorkingSetPrivate$$(null)|0|(null)|8300717|0|0|0|0|0|0|0|0|0|0|0|0|0|Idle|0|0|0|100|100|0|0|0|0|0|8|0|0|0|0|0|24576|24576|24576$$(null)|0|(null)|8300717|0|0|0|578|4|0|0|0|0|0|0|0|0|System|0|114688|274432|17|0|0|0|0|8|114688|124|0|0|0|3469312|8908800|311296|5693440|61440$$(null)|4|(null)|8300717|0|0|0|42|280|0|0|0|0|0|0|0|0|smss|0|782336|884736|110|0|0|1864|10664|11|782336|3|0|0|0|5701632|19357696|1388544|1417216|700416$$(null)|372|(null)|8300715|0|0|0|1438|380|0|0|0|0|0|0|0|0|csrss|0|3624960|3747840|0|0|0|15008|157544|13|3624960|10|0|0|0|54886400|55345152|5586944|5648384|2838528$$(null)|424|(null)|8300714|0|0|0|71|432|0|0|0|0|0|0|0|0|csrss#1|0|8605696|8728576|0|0|0|8720|96384|13|8605696|9|0|0|0|50515968|50909184|7438336|9342976|4972544
now I want to find data who's value is PercentProcessorTime, ElapsedTime,.. so for this I first split above string $$ and then again split string using | and this new split string I searched string where PercentProcessorTime' presents and get Index of that string when I get string then skipped first two arrays which split from$$and get data ofPercentProcessorTime` using index , it's looks like complicated but I think following code should helps
// First split string as below
val processData = winProcessData.split("\\$\\$")
// get index here
val getIndex: Int = processData.find(part => part.contains("PercentProcessorTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("PercentProcessorTime")
case None => -1
}
val getIndexOfElapsedTime: Int = processData.find(part => part.contains("ElapsedTime"))
.map {
case getData =>
getData
} match {
case Some(s) => s.split("\\|").indexOf("ElapsedTime")
case None => -1
}
// now fetch data of above index as below
for (i <- 2 to (processData.length - 1)) {
val getValues = processData(i).split("\\|")
val getPercentProcessTime = getValues(getIndex).toFloat
val getElapsedTime = getValues(getIndexOfElapsedTime).toFloat
Logger.info("("+getPercentProcessTime+","+getElapsedTime+"),")
}
Now Problem is that using above code I was getting data of given key in index, so my output was (8300717,100),(8300717,17)(8300717,110)... Now I want sort this data using getPercentProcessTime so my output should be (8300717,110),(8300717,100)(8300717,17)...
and that data should be in lists so I will pass list to case class.
Are you find PercentProcessorTime or PercentPrivilegedTime ?
Here it is
val str = "your very long string"
val heads = Seq("PercentPrivilegedTime", "ElapsedTime")
val Array(elap, perc) = str.split("\\$\\$").tail.map(_.split("\\|"))
.transpose.filter(x => heads.contains(x.head))
//elap: Array[String] = Array(ElapsedTime, 8300717, 8300717, 8300717, 8300715, 8300714)
//perc: Array[String] = Array(PercentPrivilegedTime, 100, 17, 110, 0, 0)
val res = (elap.tail, perc.tail).zipped.toList.sortBy(-_._2.toInt)
//List[(String, String)] = List((8300717,110), (8300717,100), (8300717,17), (8300715,0), (8300714,0))

Resources