Generate histogram from JavaRDD - apache-spark

I am trying to write code for converting data in Java RDD to a histogram so that I can bin the data in a certain way. For example, for the data I want to create a histogram of sizes such that I can find out which bin contains how many entries of a certain size range. I am able to get the value in different RDD's but I am not sure what I am missing here.
Is there an easier way to do this?
0 - 1 GB - 2 entries
1 - 5GB - 4 entries
and so on
EntryWithSize {
long size;
String entryId;
String groupId;
}
JavaRDD<EntryWithSize> entries = getEntries();
JavaRDD<HistoSize> histoSizeJavaRDD = entryJavaRDD.keyBy(EntryWithSize::getGroupId)
.combineByKey(
HistoSize::new,
(HistoSize h, EntryWithSize y) -> h.mergeWith(new HistoSize(y)),
HistoSize::mergeWith
).values();
#Data
#AllArgsConstructor
static class HistoSize implements Serializable {
int oneGB;
int fiveGB;
public HistoSize(EntryWithSize entry) {
addSize(entry);
}
private void addSize(EntryWithSize entry) {
long size = entry.getSize();
if (size <= ONE_GB) {
oneGB++;
} else {
fiveGB++;
}
}
public HistoSize mergeWith(HistoSize other) {
oneGB += other.oneGB;
fiveGB += other.fiveGB;
return this;
}
}

I was able to get it working by using a reduce on final pair rdd. My test data was wrong which was causing red herring in the output.
Function2<HistoSize, HistoSize, HistoSize> reduceSumFunc = (a, b) -> (new HistoSize(
a.oneGB + b.oneGB,
a.fiveGB + b.fiveGB,
));
HistoSize finalSize = histoSizeJavaRDD.reduce(reduceSumFunc);

Related

Find position of item in list using Binary Search

The question is:
Given a list of String, find a specific string in the list and return
its index in the ordered list of String sorted by mergesort. There are
two cases:
The string is in the list, return the index it should be in, in the ordered list.
The String is NOT in the list, return the index it is supposed to be in, in the ordered list.
Here is my my code, I assume that the given list is already ordered.
For 2nd case, how do I use mergesort to find the supposed index? I would appreciate some clues.
I was thinking to get a copy of the original list first, sort it, and get the index of the string in the copy list. Here I got stuck... do I use mergesort again to get the index of non-existing string in the copy list?
public static int BSearch(List<String> s, String a) {
int size = s.size();
int half = size / 2;
int index = 0;
// base case?
if (half == 0) {
if (s.get(half) == a) {
return index;
} else {
return index + 1;
}
}
// with String a
if (s.contains(a)) {
// on the right
if (s.indexOf(s) > half) {
List<String> rightHalf = s.subList(half + 1, size);
index += half;
return BSearch(rightHalf, a);
} else {
// one the left
List<String> leftHalf = s.subList(0, half - 1);
index += half;
return BSearch(leftHalf, a);
}
}
return index;
}
When I run this code, the index is not updated. I wonder what is wrong here. I only get 0 or 1 when I test the code even with the string in the list.
Your code only returns 0 or 1 because you don't keep track of your index for each recursive call, instead of resetting to 0 each time. Also, to find where the non-existent element should be, consider the list {0,2,3,5,6}. If we were to run a binary search to look for 4 here, it should stop at the index where element 5 is. Hope that's enough to get you started!

Apache Spark Streaming: Median of windowed PairDStream by key

I want to calculate the median value of a PairDStream for the values of each key.
I already tried the following, which is very unefficient:
JavaPairDStream<String, Iterable<Float>> groupedByKey = pairDstream.groupByKey();
JavaPairDStream<String, Float> medianPerPlug1h = groupedByKey.transformToPair(new Function<JavaPairRDD<String,Iterable<Float>>, JavaPairRDD<String,Float>>() {
public JavaPairRDD<String, Float> call(JavaPairRDD<String, Iterable<Float>> v1) throws Exception {
return v1.mapValues(new Function<Iterable<Float>, Float>() {
public Float call(Iterable<Float> v1) throws Exception {
List<Float> buffer = new ArrayList<Float>();
long count = 0L;
Iterator<Float> iterator = v1.iterator();
while(iterator.hasNext()) {
buffer.add(iterator.next());
count++;
}
float[] values = new float[(int)count];
for(int i = 0; i < buffer.size(); i++) {
values[i] = buffer.get(i);
}
Arrays.sort(values);
float median;
int startIndex;
if(count % 2 == 0) {
startIndex = (int)(count / 2 - 1);
float a = values[startIndex];
float b = values[startIndex + 1];
median = (a + b) / 2.0f;
} else {
startIndex = (int)(count/2);
median = values[startIndex];
}
return median;
}
});
}
});
medianPerPlug1h.print();
Can somebody help me with a more efficient transaction? I have about 1950 different keys, each can get to 3600 (1 data point per second, window of 1 hour) values, where to find the median of.
Thank you!
First thing is that I don't know why are you using Spark for that kind of task. It doesn't seem to be related to big data considering you got just few thousand of values. It may make things more complicated. But let's assume you're planning to scale up to bigger datasets.
I would try to use some more optimized algorithm for finding median than just sorting values. Sorting an array of values runs in O(n * log n) time.
You could think about using some linear-time median algorithm like Median of medians
1) avoid using groupbykey; reducebykey is more efficient than groupbykey.
2) reduceByKeyAndWindow(Function2,windowduration,slideDuration) can serve you better.
example:
JavaPairDStream merged=yourRDD.reduceByKeyAndWindow(new Function2() {
public String call(String arg0, String arg1) throws Exception {
return arg0+","+arg1;
}
}, Durations.seconds(windowDur), Durations.seconds(slideDur));
Assume output from this RDD will be like this :
(key,1,2,3,4,5,6,7)
(key,1,2,3,4,5,6,7).
now for each key , you can parse this , you will have the count of values,
so : 1+2+3+4+5+6+7/count
Note: i used string just to concatenate.
I hope it helps :)

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Count the number of frequency for different characters in a string

i am currently tried to create a small program were the user enter a string in a text area, clicks on a button and the program counts the frequency of different characters in the string and shows the result on another text area.
E.g. Step 1:- User enter:- aaabbbbbbcccdd
Step 2:- User click the button
Step 3:- a 3
b 6
c 3
d 1
This is what I've done so far....
public partial class Form1 : Form
{
Dictionary<string, int> dic = new Dictionary<string, int>();
string s = "";
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
s = textBox1.Text;
int count = 0;
for (int i = 0; i < s.Length; i++ )
{
textBox2.Text = Convert.ToString(s[i]);
if (dic.Equals(s[i]))
{
count++;
}
else
{
dic.Add(Convert.ToString(s[i]), count++);
}
}
}
}
}
Any ideas or help how can I countinue because till now the program is giving a run time error when there are same charachter!!
Thank You
var lettersAndCounts = s.GroupBy(c=>c).Select(group => new {
Letter= group.Key,
Count = group.Count()
});
Instead of dic.Equals use dic.ContainsKey. However, i would use this little linq query:
Dictionary<string, int> dict = textBox1.Text
.GroupBy(c => c)
.ToDictionary(g => g.Key.ToString(), g => g.Count());
You are attempting to compare the entire dictionary to a string, that doesn't tell you if there is a key in the dictionary that corresponds to the string. As the dictionary never is equal to the string, your code will always think that it should add a new item even if one already exists, and that is the cause of the runtime error.
Use the ContainsKey method to check if the string exists as a key in the dictionary.
Instead of using a variable count, you would want to increase the numbers in the dictionary, and initialise new items with a count of one:
string key = s[i].ToString();
textBox2.Text = key;
if (dic.ContainsKey(key)) {
dic[key]++;
} else {
dic.Add(key, 1);
}
I'm going to suggest a different and somewhat simpler approach for doing this. Assuming you are using English strings, you can create an array with capacity = 26. Then depending on the character you encounter you would increment the appropriate index in the array. For example, if the character is 'a' increment count at index 0, if the character is 'b' increment the count at index 1, etc...
Your implementation will look something like this:
int count[] = new int [26] {0};
for(int i = 0; i < s.length; i++)
{
count[Char.ToLower(s[i]) - int('a')]++;
}
When this finishes you will have the number of 'a's in count[0] and the number of 'z's in count[25].

How do I assign a String at Array1[x] to an int at Array2[x]?

I'm trying to organize data I am given from a text file, there are for 4 pieces of info on each line (City, country, population, and date). I wanted to have an array for each so I first put it all into one big String array and started to separate them into 4 arrays but I needed to change the Population info to an int array but it says *
"Type mismatch: cannot convert from element type int to String"
//Separate the information by commas
while(sc.hasNextLine()){
String line = sc.nextLine();
input = line.split(",");
//Organize the data into 4 seperate arrays
for(int x=0; x<input.length;x++){
if(x%4==0){
cities[x] = input[x];
}
if(x%4==1){
countries[x] = input[x];
}
if(x%4==2){
population[x] = Integer.parseInt(input[x]);
}
if(x%4==3){
dates[x] = input[x];
}
}
}
And when I print out the arrays they have a bunch of nulls in between each data. I'm planning to create objects that have the 4 pieces of data so that I can then sort them by population, dates etc... I'm pretty new to working with objects so if anyone has a better way of getting the 4 pieces of data into an object cause I haven't figured a way yet :/ My end goal was to have an array of these objects that I can u different sorting methods on them
I would recommend doing something like this:
public class MyData {
private String city;
private String country;
private Integer population;
private String date;
public MyData(String city, String, country, Integer population, String date) {
this.city = city;
this.country = country;
this.population = population;
this.date = date;
}
// Add getters and setters here
}
And then in the file you're posting about:
...
ArrayList<MyData> allData = new ArrayList<MyData>();
while(sc.hasNextLine()) {
String[] values = sc.nextLine().split(",");
allData.add(new MyData(values[0], values[1], Integer.parseInt(values[2]), values[3]));
}
...
You need an object to store the data in so that you keep the relationship between the values in each column.
Also, I'm just assuming you're using Java here. Which language we're talking about is something you should include in your question or as a tag.
The problem is with your x index. If you look carefully at your "for" you will see that it will insert a value at every 3 positions.
try
int index = 0;
while(sc.hasNextLine()){
String line = sc.nextLine();
input = line.split(",");
//Organize the data into 4 seperate arrays
for(int x=0; x<input.length;x++){
if(x%4==0){
cities[index] = input[x];
}
if(x%4==1){
countries[index] = input[x];
}
if(x%4==2){
population[index] = Integer.parseInt(input[x]);
}
if(x%4==3){
dates[index] = input[x];
}
}
++index;
}

Resources