Why are all data end up in one partition after reduceByKey?

Why are all data end up in one partition after reduceByKey? - apache-spark

I have this simple spark program. I am wondering why all data end up in one partition.
val l = List((30002,30000), (50006,50000), (80006,80000),
(4,0), (60012,60000), (70006,70000),
(40006,40000), (30012,30000), (30000,30000),
(60018,60000), (30020,30000), (20010,20000),
(20014,20000), (90008,90000), (14,0), (90012,90000),
(50010,50000), (100008,100000), (80012,80000),
(20000,20000), (30010,30000), (20012,20000),
(90016,90000), (18,0), (12,0), (70016,70000),
(20,0), (80020,80000), (100016,100000), (70014,70000),
(60002,60000), (40000,40000), (60006,60000),
(80000,80000), (50008,50000), (60008,60000),
(10002,10000), (30014,30000), (70002,70000),
(40010,40000), (100010,100000), (40002,40000),
(20004,20000),
(10018,10000), (50018,50000), (70004,70000),
(90004,90000), (100004,100000), (20016,20000))
val l_rdd = sc.parallelize(l, 2)
// print each item and index of the partition it belongs to
l_rdd.mapPartitionsWithIndex((index, iter) => {
iter.toList.map(x => (index, x)).iterator
}).collect.foreach(println)
// reduce on the second element of the list.
// alternatively you can use aggregateByKey
val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a})
// print the reduced results along with its partition index
l_reduced.mapPartitionsWithIndex((index, iter) => {
iter.toList.map(x => (index, x._1, x._2.size)).iterator
}).collect.foreach(println)
When you run this, you will see that data (l_rdd) is distributed into two partitions. Once I reduced, the resultant RDD (l_reduced) also has two partitions but all the data is in one partition (index 0) and the other one is empty. This happens even if the data is huge (a few GBs). Shouldn't the l_reduced be also distributed into two partitions.

val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a})
With reference to the above snippet, you are partitioning by the second field of the RDD. All the numbers in the second field end with 0.
When you call HashPartitioner, the partition number for a record is decided by the following function:
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
And the Utils.nonNegativeMod is defined as follows:
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
Let us see what happens when we apply the above two pieces of logic to your input:
scala> l.map(_._2.hashCode % 2) // numPartitions = 2
res10: List[Int] = List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Therefore, all of your records end up in partition 0.
You can solve this problem by a repartition:
val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a}).repartition(2)
which gives:
(0,100000,4)
(0,10000,2)
(0,0,5)
(0,20000,6)
(0,60000,5)
(0,80000,4)
(1,50000,4)
(1,30000,6)
(1,90000,4)
(1,70000,5)
(1,40000,4)
Alternatively, you can create a custom partitioner.

Unless you specify otherwise, the partitioning will be done based on the hashcode of the keys concerned, with the assumption that the hashcodes will result in a relatively even distribution. In this case, your hashcodes are all even, and therefore will all go into partition 0.
If this is truly representative of your data set, there is an overload for reduceByKey which takes the partitioner as well as the reduce function. I would suggest providing an alternative partitioning algorithm for a dataset like this.

Related

Removing padding at end of rust vector?

I have gotten this neat way of padding vector messages, such that I can know that they will be the same length
let len = 20;
let mut msg = vec![1, 23, 34];
msg.resize(len, 0);
println!("msg {:?}", msg);
Nice, this pads a lot of zeros to any message, running this code will give me:
msg [1, 23, 34, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
But let's say I send this message over some connection, and the other party receives it at their end.
How do I take a vector like this, and strip off all the 0's in the end?
Notice that the length of the original message may be of variable length, but always less than 20
Another thing that could work for me, is to have all the padding at the beginning of the vector, doing something like this:
let len = 20;
let msg = vec![1, 23, 34];
let mut payload = vec![0;len-msg.len()];
payload.extend(&msg);
println!("msg {:?}", payload);
And then just removing, all the preleading zero's.

As stated in the comments above, probably changing your protocol to include the length of the message is cleaner.
But here would be the solution to remove padding in front of the message (if the message doesn't start with zeros):
let msg: Vec<_> = payload.into_iter().skip_while(|x| *x == 0).collect();
Note that this allocates a new Vec for your msg, probably you could use the iterator directly.
Playground

c# assigning a random result from an array to a variable returns outofrange error

I'm trying to create a program that generates random numbers; indexes a table; stores the results in some lists; then shows the results from the lists at a later point. I have a int variable called FreqMod which I'm trying to assign the random result from an array called FrequencyModifiers[]. This is throwing a runtime error "OutOfRange" as if it's getting a null value. But I don't see how. Let me see if I can post all the relevant code:
int TheHour = 0, TheMinute = 0, PlaceHolder = 0, FreqMod = 0;
int[] FrequencyModifier =
{
-3, -2, -1, 0, 0, 0, 1, 2, 3, 4, 6
};// int array FrequencyModifier
DiceResult = 0;
DiceResult = RollDice.TwoD6();
FreqMod = FrequencyModifier[DiceResult - 1];
This will be a modifier added to the results of later dice rolls.
tried this after some research and still no joy:
int[] FrequencyModifier =
{
-3, -2, -1, 0, 0, 0, 1, 2, 3, 4, 6
} ;// int array FrequencyModifier
FrequencyModifier = new int[11];
DiceResult = 0;
DiceResult = RollDice.TwoD6();
FreqMod = FrequencyModifier[DiceResult - 1];
from a class called dice, of which RollDice is an object:
public int TwoD6()
{
diceresult = 0;
numdice = 2;
for(int i = 1; i <= numdice; i++)
{
dieresult = 0;
lowest = 1;
highest = 6;
diceresult = diceresult + (1 + Rolldie.Next(lowest-1, highest));
}
return diceresult;
DiceResult should be the random number returned from the method RollDice.TwoD6.
I am trying to use it to assign the corresponding number from the array to FreqMod. So that if the random number returned is 3, then it would assign -2.
Before Christmas it had been almost 20 years since I looked at C and C++; I'm now trying to learn C#.

NuSMV: how to exclude a possible next state

I want to exclude a possible next case under specific conditions.
For example, I have:
token : array 1..2 of {0, 1, 2, 3, 4, 5, 6};
next(token[1]) := case
x : {1, 2, 3, 4, 5, 6};
TRUE : 0;
esac;
next(token[2]) := case
x : {1, 2, 3, 4, 5, 6};
TRUE : 0;
esac;
-- exclude state value 1 if !position1free
...
DEFINE position1free := token[1] != 1 & token[2] != 1;
...
The same for all the values 1..6.
Otherwise, I have to do a lot of combinations to return only the position that are free.
Has anyone an idea if this is possible?

A possible approach is to further constraint the space of states with
TRANS (!position1free) -> (next(token) != 1) ;
Please beware that an inadvertent use of TRANS can result in a Finite State Machine which has no initial state or it contains some state s_i which does not have any future state:
source: nuXmv: Introduction.

Draw cube vertices with fewest number of steps

What's the fewest number of steps needed to draw all of the cube's vertices, without picking up the pen from the paper?
So far I have reduced it to 16 steps:
0, 0, 0
0, 0, 1
0, 1, 1
1, 1, 1
1, 1, 0
0, 1, 0
0, 0, 0
1, 0, 0
1, 0, 1
0, 0, 1
0, 1, 1
0, 1, 0
1, 1, 0
1, 0, 0
1, 0, 1
1, 1, 1
I presume it can be reduced less than 16 steps as there are only 12 vertices to be drawn
You can view a working example in three.js javascript here:
http://jsfiddle.net/kmturley/5aeucehf/show/

Well I encoded a small brute force solver for this
the best solution is with 16 vertexes
took about 11.6 sec to compute
all is in C++ (visualization by OpenGL)
First the cube representation:
//---------------------------------------------------------------------------
#define a 0.5
double pnt[]=
{
-a,-a,-a, // point 0
-a,-a,+a,
-a,+a,-a,
-a,+a,+a,
+a,-a,-a,
+a,-a,+a,
+a,+a,-a,
+a,+a,+a, // point 7
1e101,1e101,1e101, // end tag
};
#undef a
int lin[]=
{
0,1,
0,2,
0,4,
1,3,
1,5,
2,3,
2,6,
3,7,
4,5,
4,6,
5,7,
6,7,
-1,-1, // end tag
};
// int solution[]={ 0, 1, 3, 1, 5, 4, 0, 2, 3, 7, 5, 4, 6, 2, 6, 7, -1 }; // found polyline solution
//---------------------------------------------------------------------------
void draw_lin(double *pnt,int *lin)
{
glBegin(GL_LINES);
for (int i=0;lin[i]>=0;)
{
glVertex3dv(pnt+(lin[i]*3)); i++;
glVertex3dv(pnt+(lin[i]*3)); i++;
}
glEnd();
}
//---------------------------------------------------------------------------
void draw_pol(double *pnt,int *pol)
{
glBegin(GL_LINE_STRIP);
for (int i=0;pol[i]>=0;i++) glVertex3dv(pnt+(pol[i]*3));
glEnd();
}
//---------------------------------------------------------------------------
Now the solver:
//---------------------------------------------------------------------------
struct _vtx // vertex
{
List<int> i; // connected to (vertexes...)
_vtx(){}; _vtx(_vtx& a){ *this=a; }; ~_vtx(){}; _vtx* operator = (const _vtx *a) { *this=*a; return this; }; /*_vtx* operator = (const _vtx &a) { ...copy... return this; };*/
};
const int _max=16; // know solution size (do not bother to find longer solutions)
int use[_max],uses=0; // temp line usage flag
int pol[_max],pols=0; // temp solution
int sol[_max+2],sols=0; // best found solution
List<_vtx> vtx; // model vertexes + connection info
//---------------------------------------------------------------------------
void _solve(int a)
{
_vtx *v; int i,j,k,l,a0,a1,b0,b1;
// add point to actual polyline
pol[pols]=a; pols++; v=&vtx[a];
// test for solution
for (l=0,i=0;i<uses;i++) use[i]=0;
for (a0=pol[0],a1=pol[1],i=1;i<pols;i++,a0=a1,a1=pol[i])
for (j=0,k=0;k<uses;k++)
{
b0=lin[j]; j++;
b1=lin[j]; j++;
if (!use[k]) if (((a0==b0)&&(a1==b1))||((a0==b1)&&(a1==b0))) { use[k]=1; l++; }
}
if (l==uses) // better solution found
if ((pols<sols)||(sol[0]==-1))
for (sols=0;sols<pols;sols++) sol[sols]=pol[sols];
// recursion only if pol not too big
if (pols+1<sols) for (i=0;i<v->i.num;i++) _solve(v->i.dat[i]);
// back to previous state
pols--; pol[pols]=-1;
}
//---------------------------------------------------------------------------
void solve(double *pnt,int *lin)
{
int i,j,a0,a1;
// init sizes
for (i=0;i<_max;i++) { use[i]=0; pol[i]=-1; sol[i]=-1; }
for(i=0,j=0;pnt[i]<1e100;i+=3,j++); vtx.allocate(j); vtx.num=j;
for(i=0;i<vtx.num;i++) vtx[i].i.num=0;
// init connections
for(uses=0,i=0;lin[i]>=0;uses++)
{
a0=lin[i]; i++;
a1=lin[i]; i++;
vtx[a0].i.add(a1);
vtx[a1].i.add(a0);
}
// start actual solution (does not matter which vertex on cube is first)
pols=0; sols=_max+1; _solve(0);
sol[sols]=-1; if (sol[0]<0) sols=0;
}
//---------------------------------------------------------------------------
Usage:
solve(pnt,lin); // call once to compute the solution
glColor3f(0.2,0.2,0.2); draw_lin(pnt,lin); // draw gray outline
glColor3f(1.0,1.0,1.0); draw_pol(pnt,sol); // overwrite by solution to visually check correctness (Z-buffer must pass also on equal values!!!)
List
is just mine template for dynamic array
List<int> x is equivalent to int x[]
x.add(5) ... adds 5 to the end of list
x.num is the used size of list in entries
x.allocate(100) preallocate list size to 100 entries (to avoid relocations slowdowns)
solve(pnt,lin) algorithm
first prepare vertex data
each vertex vtx[i] corresponds to point i-th point in pnt table
i[] list contains the index of each vertex connected to this vertex
start with vertex 0 (on cube is irrelevant the start point
otherwise there would be for loop through every vertex as start point
_solve(a)
it adds a vertex index to actual solution pol[pols]
then test how many lines is present in actual solution
and if all lines from lin[] are drawn and solution is smaller than already found one
copy it as new solution
after test if actual solution is not too long recursively add next vertex
as one of the vertex that is connected to last vertex used
to limit the number of combinations
at the end sol[sols] hold the solution vertex index list
sols is the number of vertexes used (lines-1)
[Notes]
the code is not very clean but it works (sorry for that)
hope I did not forget to copy something

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

I have a pair of strings, for example: abcabcabc and abcxxxabc and a List of Common Substring Pairs (LCSP), in this case LCSP is 6 pairs, because three abc in the first string map to two abc in the second string. Now I need to find the longest valid (incrementing) sequence of pairs, in this case there are three equally long solutions: 0:0,3:6; 0:0,6:6; 3:0,6:6 (those numbers are starting positions of each pair in the original strings, the length of substrings is 3 as length of "abc"). I would call it the Longest Substring Pair Sequence or LSPQ. (Q is not to confuse String and Sequence)
Here is the LCSP for this example:
LCSP('abcabcabc', 'abcxxxabc') =
[ [ 6, 6, 3 ],
[ 6, 0, 3 ],
[ 3, 6, 3 ],
[ 0, 6, 3 ],
[ 3, 0, 3 ],
[ 0, 0, 3 ] ]
LSPQ(LCSP('abcabcabc', 'abcxxxabc'), 0, 0, 0) =
[ { a: 0, b: 0, size: 3 }, { a: 3, b: 6, size: 3 } ]
Now I find it with brute force recursively trying all combinations. So I am limited to about 25 pairs, otherwise it is unpractical. Size=[10,15,20,25,26,30], Time ms = [0,15,300,1000,2000,19000]
Is there a way to do that in linear time or at least not quadratic complexity so that longer input LCSP (List of Common Substring Pairs) could be used.
This problem is similar to the "Longest Common Subsequence", but not exactly it, because the input is not two strings but a list of common substrings sorted by their length. So I do not know where to look for an existing solutions or even if they exist.
Here is my particular code (JavaScript):
function getChainSize(T) {
var R = 0
for (var i = 0; i < T.length; i++) R += T[i].size
return R
}
function LSPQ(T, X, Y, id) {
// X,Y are first unused character is str1,str2
//id is current pair
function findNextPossible() {
var x = id
while (x < T.length) {
if (T[x][0] >= X && T[x][1] >= Y) return x
x++
}
return -1
}
var id = findNextPossible()
if (id < 0) return []
var C = [{a:T[id][0], b:T[id][1], size:T[id][2] }]
// with current
var o = T[id]
var A = C.concat(LSPQ(T, o[0]+o[2], o[1]+o[2], id+1))
// without current
var B = LSPQ(T, X, Y, id+1)
if (getChainSize(A) < getChainSize(B)) return B
return A
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why are all data end up in one partition after reduceByKey? - apache-spark

Related

Removing padding at end of rust vector?

c# assigning a random result from an array to a variable returns outofrange error

NuSMV: how to exclude a possible next state

Draw cube vertices with fewest number of steps

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

Categories

Resources