Apache Spark Streaming: Median of windowed PairDStream by key - apache-spark

I want to calculate the median value of a PairDStream for the values of each key.
I already tried the following, which is very unefficient:
JavaPairDStream<String, Iterable<Float>> groupedByKey = pairDstream.groupByKey();
JavaPairDStream<String, Float> medianPerPlug1h = groupedByKey.transformToPair(new Function<JavaPairRDD<String,Iterable<Float>>, JavaPairRDD<String,Float>>() {
public JavaPairRDD<String, Float> call(JavaPairRDD<String, Iterable<Float>> v1) throws Exception {
return v1.mapValues(new Function<Iterable<Float>, Float>() {
public Float call(Iterable<Float> v1) throws Exception {
List<Float> buffer = new ArrayList<Float>();
long count = 0L;
Iterator<Float> iterator = v1.iterator();
while(iterator.hasNext()) {
buffer.add(iterator.next());
count++;
}
float[] values = new float[(int)count];
for(int i = 0; i < buffer.size(); i++) {
values[i] = buffer.get(i);
}
Arrays.sort(values);
float median;
int startIndex;
if(count % 2 == 0) {
startIndex = (int)(count / 2 - 1);
float a = values[startIndex];
float b = values[startIndex + 1];
median = (a + b) / 2.0f;
} else {
startIndex = (int)(count/2);
median = values[startIndex];
}
return median;
}
});
}
});
medianPerPlug1h.print();
Can somebody help me with a more efficient transaction? I have about 1950 different keys, each can get to 3600 (1 data point per second, window of 1 hour) values, where to find the median of.
Thank you!

First thing is that I don't know why are you using Spark for that kind of task. It doesn't seem to be related to big data considering you got just few thousand of values. It may make things more complicated. But let's assume you're planning to scale up to bigger datasets.
I would try to use some more optimized algorithm for finding median than just sorting values. Sorting an array of values runs in O(n * log n) time.
You could think about using some linear-time median algorithm like Median of medians

1) avoid using groupbykey; reducebykey is more efficient than groupbykey.
2) reduceByKeyAndWindow(Function2,windowduration,slideDuration) can serve you better.
example:
JavaPairDStream merged=yourRDD.reduceByKeyAndWindow(new Function2() {
public String call(String arg0, String arg1) throws Exception {
return arg0+","+arg1;
}
}, Durations.seconds(windowDur), Durations.seconds(slideDur));
Assume output from this RDD will be like this :
(key,1,2,3,4,5,6,7)
(key,1,2,3,4,5,6,7).
now for each key , you can parse this , you will have the count of values,
so : 1+2+3+4+5+6+7/count
Note: i used string just to concatenate.
I hope it helps :)

Related

Optimization (in terms of speed )

is there any other way to optimize this code. Anyone can come up with better way because this is taking lot of time in main code. Thanks alot;)
HashMap<String, Integer> hmap = new HashMap<String, Integer>();
List<String> dup = new ArrayList<String>();
List<String> nondup = new ArrayList<String>();
for (String num : nums) {
String x= num;
String result = x.toLowerCase();
if (hmap.containsKey(result)) {
hmap.put(result, hmap.get(result) + 1);
}
else {
hmap.put(result,1);
}
}
for(String num:nums){
int count= hmap.get(num.toLowerCase());
if (count == 1){
nondup.add(num);
}
else{
dup.add(num);
}
}
output:
[A/tea, C/SEA.java, C/clock, aep, aeP, C/SEA.java]
Dups: [C/SEA.java, aep, aeP, C/SEA.java]
nondups: [A/tea, C/clock]
How much time is "a lot of time"? Is your input bigger than what you've actually shown us?
You could parallelize this with something like Arrays.parallelStream(nums).collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting()), which would get you a Map<String, Long>, but that would only speed up your code if you have a lot of input, which it doesn't look like you have right now.
You could parallelize the next step, if you liked, like so:
Map<String, Long> counts = Arrays.parallelStream(nums)
.collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting());
Map<Boolean, List<String>> hasDup =
counts.entrySet().parallelStream()
.collect(Collectors.partitioningBy(
entry -> entry.getValue() > 1,
Collectors.mapping(Entry::getKey, Collectors.toList())));
List<String> dup = hasDup.get(true);
List<String> nodup = hasDup.get(false);
The algorithms in the other answers can speed up execution using multiple threads.
This can theoretically reduce the processing time with a factor of M, where M is the maximum number of threads that your system can run concurrently. However, as M is a constant number, this does not change the order of complexity, which therefore remains O(N).
At a glance, I do not see a way to solve your problem in less than O(N), I am afraid.

Generate histogram from JavaRDD

I am trying to write code for converting data in Java RDD to a histogram so that I can bin the data in a certain way. For example, for the data I want to create a histogram of sizes such that I can find out which bin contains how many entries of a certain size range. I am able to get the value in different RDD's but I am not sure what I am missing here.
Is there an easier way to do this?
0 - 1 GB - 2 entries
1 - 5GB - 4 entries
and so on
EntryWithSize {
long size;
String entryId;
String groupId;
}
JavaRDD<EntryWithSize> entries = getEntries();
JavaRDD<HistoSize> histoSizeJavaRDD = entryJavaRDD.keyBy(EntryWithSize::getGroupId)
.combineByKey(
HistoSize::new,
(HistoSize h, EntryWithSize y) -> h.mergeWith(new HistoSize(y)),
HistoSize::mergeWith
).values();
#Data
#AllArgsConstructor
static class HistoSize implements Serializable {
int oneGB;
int fiveGB;
public HistoSize(EntryWithSize entry) {
addSize(entry);
}
private void addSize(EntryWithSize entry) {
long size = entry.getSize();
if (size <= ONE_GB) {
oneGB++;
} else {
fiveGB++;
}
}
public HistoSize mergeWith(HistoSize other) {
oneGB += other.oneGB;
fiveGB += other.fiveGB;
return this;
}
}
I was able to get it working by using a reduce on final pair rdd. My test data was wrong which was causing red herring in the output.
Function2<HistoSize, HistoSize, HistoSize> reduceSumFunc = (a, b) -> (new HistoSize(
a.oneGB + b.oneGB,
a.fiveGB + b.fiveGB,
));
HistoSize finalSize = histoSizeJavaRDD.reduce(reduceSumFunc);

How to compute the distance matrix in spark?

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Sorting a string using another sorting order string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I saw this in an interview question ,
Given a sorting order string, you are asked to sort the input string based on the given sorting order string.
for example if the sorting order string is dfbcae
and the Input string is abcdeeabc
the output should be dbbccaaee.
any ideas on how to do this , in an efficient way ?
The Counting Sort option is pretty cool, and fast when the string to be sorted is long compared to the sort order string.
create an array where each index corresponds to a letter in the alphabet, this is the count array
for each letter in the sort target, increment the index in the count array which corresponds to that letter
for each letter in the sort order string
add that letter to the end of the output string a number of times equal to it's count in the count array
Algorithmic complexity is O(n) where n is the length of the string to be sorted. As the Wikipedia article explains we're able to beat the lower bound on standard comparison based sorting because this isn't a comparison based sort.
Here's some pseudocode.
char[26] countArray;
foreach(char c in sortTarget)
{
countArray[c - 'a']++;
}
int head = 0;
foreach(char c in sortOrder)
{
while(countArray[c - 'a'] > 0)
{
sortTarget[head] = c;
head++;
countArray[c - 'a']--;
}
}
Note: this implementation requires that both strings contain only lowercase characters.
Here's a nice easy to understand algorithm that has decent algorithmic complexity.
For each character in the sort order string
scan string to be sorted, starting at first non-ordered character (you can keep track of this character with an index or pointer)
when you find an occurrence of the specified character, swap it with the first non-ordered character
increment the index for the first non-ordered character
This is O(n*m), where n is the length of the string to be sorted and m is the length of the sort order string. We're able to beat the lower bound on comparison based sorting because this algorithm doesn't really use comparisons. Like Counting Sort it relies on the fact that you have a predefined finite external ordering set.
Here's some psuedocode:
int head = 0;
foreach(char c in sortOrder)
{
for(int i = head; i < sortTarget.length; i++)
{
if(sortTarget[i] == c)
{
// swap i with head
char temp = sortTarget[head];
sortTarget[head] = sortTarget[i];
sortTarget[i] = temp;
head++;
}
}
}
In Python, you can just create an index and use that in a comparison expression:
order = 'dfbcae'
input = 'abcdeeabc'
index = dict([ (y,x) for (x,y) in enumerate(order) ])
output = sorted(input, cmp=lambda x,y: index[x] - index[y])
print 'input=',''.join(input)
print 'output=',''.join(output)
gives this output:
input= abcdeeabc
output= dbbccaaee
Use binary search to find all the "split points" between different letters, then use the length of each segment directly. This will be asymptotically faster then naive counting sort, but will be harder to implement:
Use an array of size 26*2 to store the begin and end of each letter;
Inspect the middle element, see if it is different from the element left to it. If so, then this is the begin for the middle element and end for the element before it;
Throw away the segment with identical begin and end (if there are any), recursively apply this algorithm.
Since there are at most 25 "split"s, you won't have to do the search for more than 25 segemnts, and for each segment it is O(logn). Since this is constant * O(logn), the algorithm is O(nlogn).
And of course, just use counting sort will be easier to implement:
Use an array of size 26 to record the number of different letters;
Scan the input string;
Output the string in the given sorting order.
This is O(n), n being the length of the string.
Interview questions are generally about thought process and don't usually care too much about language features, but I couldn't resist posting a VB.Net 4.0 version anyway.
"Efficient" can mean two different things. The first is "what's the fastest way to make a computer execute a task" and the second is "what's the fastest that we can get a task done". They might sound the same but the first can mean micro-optimizations like int vs short, running timers to compare execution times and spending a week tweaking every millisecond out of an algorithm. The second definition is about how much human time would it take to create the code that does the task (hopefully in a reasonable amount of time). If code A runs 20 times faster than code B but code B took 1/20th of the time to write, depending on the granularity of the timer (1ms vs 20ms, 1 week vs 20 weeks), each version could be considered "efficient".
Dim input = "abcdeeabc"
Dim sort = "dfbcae"
Dim SortChars = sort.ToList()
Dim output = New String((From c In input.ToList() Select c Order By SortChars.IndexOf(c)).ToArray())
Trace.WriteLine(output)
Here is my solution to the question
import java.util.*;
import java.io.*;
class SortString
{
public static void main(String arg[])throws IOException
{
BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
// System.out.println("Enter 1st String :");
// System.out.println("Enter 1st String :");
// String s1=br.readLine();
// System.out.println("Enter 2nd String :");
// String s2=br.readLine();
String s1="tracctor";
String s2="car";
String com="";
String uncom="";
for(int i=0;i<s2.length();i++)
{
if(s1.contains(""+s2.charAt(i)))
{
com=com+s2.charAt(i);
}
}
System.out.println("Com :"+com);
for(int i=0;i<s1.length();i++)
if(!com.contains(""+s1.charAt(i)))
uncom=uncom+s1.charAt(i);
System.out.println("Uncom "+uncom);
System.out.println("Combined "+(com+uncom));
HashMap<String,Integer> h1=new HashMap<String,Integer>();
for(int i=0;i<s1.length();i++)
{
String m=""+s1.charAt(i);
if(h1.containsKey(m))
{
int val=(int)h1.get(m);
val=val+1;
h1.put(m,val);
}
else
{
h1.put(m,new Integer(1));
}
}
StringBuilder x=new StringBuilder();
for(int i=0;i<com.length();i++)
{
if(h1.containsKey(""+com.charAt(i)))
{
int count=(int)h1.get(""+com.charAt(i));
while(count!=0)
{x.append(""+com.charAt(i));count--;}
}
}
x.append(uncom);
System.out.println("Sort "+x);
}
}
Here is my version which is O(n) in time. Instead of unordered_map, I could have just used a char array of constant size. i.,e. char char_count[256] (and done ++char_count[ch - 'a'] ) assuming the input strings has all ASCII small characters.
string SortOrder(const string& input, const string& sort_order) {
unordered_map<char, int> char_count;
for (auto ch : input) {
++char_count[ch];
}
string res = "";
for (auto ch : sort_order) {
unordered_map<char, int>::iterator it = char_count.find(ch);
if (it != char_count.end()) {
string s(it->second, it->first);
res += s;
}
}
return res;
}
private static String sort(String target, String reference) {
final Map<Character, Integer> referencesMap = new HashMap<Character, Integer>();
for (int i = 0; i < reference.length(); i++) {
char key = reference.charAt(i);
if (!referencesMap.containsKey(key)) {
referencesMap.put(key, i);
}
}
List<Character> chars = new ArrayList<Character>(target.length());
for (int i = 0; i < target.length(); i++) {
chars.add(target.charAt(i));
}
Collections.sort(chars, new Comparator<Character>() {
#Override
public int compare(Character o1, Character o2) {
return referencesMap.get(o1).compareTo(referencesMap.get(o2));
}
});
StringBuilder sb = new StringBuilder();
for (Character c : chars) {
sb.append(c);
}
return sb.toString();
}
In C# I would just use the IComparer Interface and leave it to Array.Sort
void Main()
{
// we defin the IComparer class to define Sort Order
var sortOrder = new SortOrder("dfbcae");
var testOrder = "abcdeeabc".ToCharArray();
// sort the array using Array.Sort
Array.Sort(testOrder, sortOrder);
Console.WriteLine(testOrder.ToString());
}
public class SortOrder : IComparer
{
string sortOrder;
public SortOrder(string sortOrder)
{
this.sortOrder = sortOrder;
}
public int Compare(object obj1, object obj2)
{
var obj1Index = sortOrder.IndexOf((char)obj1);
var obj2Index = sortOrder.IndexOf((char)obj2);
if(obj1Index == -1 || obj2Index == -1)
{
throw new Exception("character not found");
}
if(obj1Index > obj2Index)
{
return 1;
}
else if (obj1Index == obj2Index)
{
return 0;
}
else
{
return -1;
}
}
}

Resources