I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api.
data = xrange(1, 10000)
xrangeRDD = sc.parallelize(data, 8)
def ten(value):
"""Return whether value is below ten.
Args:
value (int): A number.
Returns:
bool: Whether `value` is less than ten.
"""
if (value < 10):
return True
else:
return False
filtered = xrangeRDD.filter(ten)
print filtered.collect()
print filtered.take(8)
print filtered.collect() gives this as output [1, 2, 3, 4, 5, 6, 7, 8, 9].
As per my understanding filtered.take(n) will take n elements from RDD and print it.
I am trying two cases :-
1)Giving value of n less than or equal to number of elements in RDD
2)Giving value of n greater than number of elements in RDD
I have pyspark application UI to see number of jobs that run in each case. In first case only one job is running but in second five jobs are running.
I am not able to understand why is this happening. Thanks in advance.
RDD.take tries to evaluate as few partitions as possible.
If you take(9) it will fetch partition 0 (job 1) find 9 items and happily terminate.
If you take(10) it will fetch partition 0 (job 1) and find 9 items. It needs one more. Since partition 0 had 9, it thinks partition 1 will probably have at least one more (job 2). But it doesn't! In 2 partitions it has found 9 items. So 4.5 items per partition so far. The formula divides it by 1.5 for pessimism and decides 10 / (4.5 / 1.5) = 3 partitions will do it. So it fetches partition 2 (job 3). Still nothing. So 3 items per partition so far, divided by 1.5 means we need 10 / (3 / 1.5) = 5 partitions. It fetches partitions 3 and 4 (job 4). Nothing. We have 1.8 items per partition, 10 / (1.8 / 1.5) = 8. It fetches the last 3 partitions (job 5) and that's it.
The code for this algorithm is in RDD.scala. As you can see it's nothing but heuristics. It saves some work usually, but it can lead to unnecessarily many jobs in degenerate cases.
Related
I am reading dataframe from JDBC source using partitioning as described here, using numPartitions, partitionColumn, upperBound, lowerBound. I've been using this quite often, but this time I noticed something weird. With numPartition = 32 and 124 distinct partition column values, this split data into 30 smaller chunks and 2 large.
Task 1 - partitions 1 .. 17 (17 values!)
Task 2 - partitions 18 .. 20 (3 values)
Task 3 - partitions 21 .. 23 (3 values)
Task 4 - partitions 24 .. 26 (3 values)
...
Task 30 - partitions 102 .. 104 (3 values)
Task 31 - partitions 105 .. 107 (3 values)
Task 32 - partitions 108 .. 124 (17 values!)
I'm just wondering whether this actually worked as expected and what I can do to make it split into even chunks apart from experimenting maybe with different values of numPartitions (note that I number of values can vary and I'm not always able to predict it).
I looked through source code of JDBCRelation.scala and found out that this is exactly how it's implemented. It first calculates stride as (upperBound - lowerBound) / numPartitions, which in my case is 124 / 32 = 3, and then remaining values are allocated evenly to fist and last partition.
I was a bit unlucky with the number of values, because if I had 4 more, then 128 / 32 = 4 and it would nicely align 32 partitions of 4 values each.
I ended up pre-querying the table for exact range and then manually providing predicates using:
val partRangeSql = "SELECT min(x), max(x) FROM table")
val (partMin, partMax) =
spark.read.jdbc(jdbcUrl, s"($partRangeSql) _", props).as[(Int, Int)].head
val predicates = (partMin to partMax).map(p => s"x = $p").toArray
spark.read.jdbc(jdbcUrl, s"table", predicates, props)
That makes it 124 partitions (one per value), so need to be careful with overloading database server (but I'm limiting number of executors so that not running more than 32 concurrent sessions).
I guess adjusting lowerBound/upperBound so that upperBound - lowerBound is a multiple of numPartitions would also do the work.
I have a series of a graph edges in the format of src_node dst_node edge_weigth. I want to to read them from a .txt file and then use them for some algorithms. But before applying the algorithms, I need to clean up and group the data. Also I want to use only RDD's.
The problem with the the input data is that it has loops, first I want to filter the loops out and then re-partition the RDD to 8 partitions. However want to have access to each nodes outgoing edges etc, so for each node I need to store set of (dst,weight)s. Also I want to have a good performance(I'm afraid that a repatriation and a groupby will cause two shuffles)
I have came-up with this code until now but I don't know how to finish it :
links = sc.textFile("./inputs/graph1.txt") \
.map(lambda link: links.split(" "))\
.filter(lambda link: link[0] != link[1])
This is a sample input data:
1 2 5
1 3 2
1 1 2
2 3 1
3 1 2
3 2 4
What should I do?
I am trying to find the capacity of a list by a function. But a step involves subtracting the list size by 64 ( in my machine ) and also it has to be divided by 8 to get the capacity. What does this capacity value mean ?
I tried reading the docs of python to know about sys.getsizeof() method but still it couldn't answer my doubts.
import sys
def disp(l1):
print("Capacity",(sys.getsizeof(l1)-64)//8) // What does this line mean especially //8 part
print("Length",len(l1))
mariya_list=[]
mariya_list.append("Sugarisverysweetand it can be used for cooking sweets
and also used in beverages ")
mariya_list.append("Choco")
mariya_list.append("bike")
disp(mariya_list)
print(mariya_list)
mariya_list.append("lemon")
print(mariya_list)
disp(mariya_list)
mariya_list.insert(1,"leomon Tea")
print(mariya_list)
disp(mariya_list)
Output:
Capacity 4
Length 1
['Choco']
['Choco', 'lemon']
Capacity 4
Length 2
['Choco', 'leomon Tea', 'lemon']
Capacity 4
Length 3
This is the output. Here I am unable to understand what does capacity 4 mean. Why does it repeats the same value four even after subsequent addition of elements.
I have the following df,
A id
[ObjectId('5abb6fab81c0')] 0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')] 1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')] 2
I like to flatten each list in A, and assign its corresponding id to each element in the list like,
A id
ObjectId('5abb6fab81c0') 0
ObjectId('5abb6fab81c3') 1
ObjectId('5abb6fab81c4') 1
ObjectId('5abb6fab81c2') 2
ObjectId('5abb6fab81c1') 2
I think the comment is coming from this question ? you can using my original post or this one
df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]:
id 0
0 0 1.0
1 1 2.0
2 1 3.0
3 1 4.0
4 2 5.0
5 2 6.0
Or
pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]:
A id
0 1 0
1 2 1
1 3 1
1 4 1
2 5 2
2 6 2
This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.
a_list = []
id_list = []
for index, a, i in df.itertuples():
for item in a:
a_list.append(item)
id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])
As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.
EDIT (April 2, 2018)
I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:
Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).
Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).
Flattening and unflattening can be done using this function
def flatten(df, col):
col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
return df
Unflattening:
def unflatten(flat_df, col):
flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})
After unflattening we get the same dataframe except column order:
(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True
To create unique index you can call reset_index after flattening
What is the best way in C++11 to perform a pairwise computation in multiple threads? What I mean is, I have a vector of elements, and I want to compute a function for each pair of distinct elements. The caveat is that I cannot use the same element in multiple threads at the same time, e.g. the elements have states that evolve during the computation, and the computation relies on that.
An easy way would be to group the pairs by offsets.
If v is a vector, then the elements N apart (mod v.size()) form two collections of pairs. Each of those collections of pairs contain no overlaps inside themselves.
Examine a 10 element vector 0 1 2 3 4 5 6 7 8 9. The pairs 1 apart are:
0 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8 9, 9 0
if you split these by "parity" into two collections we get:
0 1, 2 3, 4 5, 6 7, 8 9
1 2, 3 4, 5 6, 7 8, 9 0
You can work, in parallel, on each of the above collections. When the collection is finished, sync up, then work on the next collection.
Similar tricks work for 2 apart.
0 2, 1 3, 4 6, 5 7
2 4, 3 5, 6 8, 7 9
with leftovers:
8 0, 9 1
For every offset from 1 to n/2 there is are 2 "collections" and leftovers.
Here is offset of 4:
0 4, 1 5, 2 6, 3 7
4 8, 5 9, 6 0, 7 1
and leftovers
8 2, 9 3
(I naively think the size of leftovers is vector size mod offset)
Calculating these collections (and the leftovers) isn't hard; arranging to queue up threads and get the right tasks efficiently in the right threads is harder.
There are N choose 2, or (n^2+n)/2, pairs. This split gives you O(1.5n) collections and leftovers, each of size at most n/2, and full parallelism within each collection.
If you have a situation where some elements are far more expensive than others, and thus waiting for each collection to finish idles threads too much, you could add fine-grained synchronization.
Maintain a vector of atomic bools. Use that to indicate that you are currently processing an element. Always "lock" (set to true, and check that it was false before you set it to true) the lower index one before the upper one.
If you manage to lock both, process away. Then clear them both.
If you fail, remember the task for later, and work on other tasks. When you have too many tasks queued, wait on a condition variable, trying to check and set the atomic bool you want to lock in the spin-lambda.
Periodically kick the condition variable when you clear the locks. How often you do this will depend on profiling. You can kick without aquiring the mutex mayhap (but you must sometimes acquire the mutex after clearing the bools to deal with a race condition that could starve a thread).
Queue the tasks in the order indicated by the above collection system, as that reduces the likelihood of threads colliding. But with this system, work can still progress even if there is one task that is falling behind.
It adds complexity and synchronization, which could easily make it slower than the pure collection/cohort one.