EXECUTE HIVE QUERIES IN TEZ AND SPARK
I have hive query running in tez taking 10 min
executed same query in spark using hivecontext.sql taking 13-14min
Hardware Info
Spark-1.5.2
Cluster-Yarn (in client mode)
Nodes- 11, memory-192GB, Cores-24
To Achieve
Like to execute in spark less than 5 min without changing my query
My Findings
executors driverMemory ExecutorMemory Cores start end difference
4, 6, 15, 6, 17:12:15, 17:26:03, 0:13:48
4, 6, 15, 10, 17:37:55, 17:49:24, 0:11:29
4, 6, 7, 10, 17:53:40, 18:07:12, 0:13:32
1, 6, 7, 4, 21:54:10, 22:08:16, 0:14:06
1, 6, 7, 6, 23:44:15, 23:57:25, 0:13:10
3, N/a, 60, 5, 11:12:49, 11:28:58, 0:16:09
However I use it it is **not reducing execution time**.
Can please someone let me know how to tune it
This is because spark 1.5.2 did not use tungsten engine.
Check out the discussion here here
Related
I am trying to set up a loop that will run through a list of power readings and sort it into two seperate lists based on the minimum temperature recorded that night (checking for aircon use). But i can't seem to get the list to sort.
Pwr_Usd = [0, 9, 13, 11, 13, 8.5, 12.5, 14, 10, 8.5, 7.5, 21, 7, 8, 12, 17, 6, 12, 11.25, 12.75, 8]
Temp_Min = [13.8, 14.1, 13.7, 16.9]
Ovr_Usd = PwrUsd[1::4] # Overnight Usage netween 12am and 6am
## Air Conditioning
AirCon = []
## Standby Power
StbyPwr = []
for i in Ovr_Usd:
if i > 16 in Temp_Min:
AirCon.append(i)
else:
StbyPwr.append(i)
print(AirCon)
print(StbyPwr)
In a perfect world I would end up with two lists
Aircon = [12]
stbyPwr = [9, 8.5, 8.5, 8]
but i am open to any suggestion that would let me run a temperature list against a powerlist and get sort the nights where its too cold for aircon power usage to be a factor. its worth noting that all the lists are organised sequentially over time. with Temp_min being over a week, and every Pwr_Usd reading being 6 hours apart, hence why Ovr_Usd only selects every 4th value for comparison.
for sorting lists in python, try python's .sort list method:
pwr_usd = [0, 9, 13, 11, 13, 8.5, 12.5, 14, 10, 8.5, 7.5, 21, 7, 8, 12, 17, 6, 12, 11.25, 12.75, 8]
temp_min = [13.8, 14.1, 13.7, 16.9]
pwr_usd.sort()
temp_min.sort()
print(pwr_usd)
print(temp_min)
and the output will be:
[0, 6, 7, 7.5, 8, 8, 8.5, 8.5, 9, 10, 11, 11.25, 12, 12, 12.5, 12.75, 13, 13, 14, 17, 21]
[13.7, 13.8, 14.1, 16.9]
and then do your desired seperation and classification using a list comprehension or good old for loops. in the case of a list comprehension:
air_con = [i for i in pwr_usd if i > 16 and i in temp_min]
though this returns an empty list when i run it, and that is because none of the values in pwr_usd are actually in temp_min (they have to be EXACTLY the same)
SIDENOTE: it is a good practice to maintain the conventions in naming variables. for example instead of PwrUsd, go with pwr_usd
I have a custom keras.utils.sequence which generates batches in a specific (and critical) order.
However, I need to parellelise batch generation across multiple cores. Does the name 'OrderedEnqueuer' imply that the order of batches in the resulting queue is guaranteed to be the same as the order of the original keras.utils.sequence?
My reasons for thinking that this order is not guaranteed:
OrderedEnqueuer uses python multiprocessing's apply_async internally.
Keras' docs explicitly say that OrderedEnqueuer is guaranteed not to duplicate batches - but not that the order is guaranteed.
My reasons for thinking that it is:
The name!
I understand that keras.utils.sequence objects are indexable.
I found test scripts on Keras' github which appear to be designed to verify order - although I could not find any documentation about whether these were passed, or whether they are truly conclusive.
If the order here is not guaranteed, I would welcome any suggestions on how to parellelise batch preparation while maintaining a guaranteed order, with the proviso that it must be able to parellelise arbitrary python code - I believe e.g tf.data.Dataset API does not allow this (tf.py_function calls back to original python process).
Yes, it's ordered.
Check it yourself with the following test.
First, let's create a dummy Sequence that returns just the batch index after waiting a random time (the random time is to assure that the batches will not be finished in order):
import time, random, datetime
import numpy as np
import tensorflow as tf
class DataLoader(tf.keras.utils.Sequence):
def __len__(self):
return 10
def __getitem__(self, i):
time.sleep(random.randint(1,2))
#you could add a print here to see that it's out of order
return i
Now let's create a test function that creates the enqueuer and uses it.
The function takes the number of workers and prints the time taken as well as the results as returned.
def test(workers):
enq = tf.keras.utils.OrderedEnqueuer(DataLoader())
enq.start(workers = workers)
gen = enq.get()
results = []
start = datetime.datetime.now()
for i in range(30):
results.append(next(gen))
enq.stop()
print('test with', workers, 'workers took', datetime.datetime.now() - start)
print("results:", results)
Results:
test(1)
test(8)
test with 1 workers took 0:00:45.093122
results: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
test with 8 workers took 0:00:09.127771
results: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Notice that:
8 workers is way faster than 1 worker -> it is parallelizing ok
results are ordered for both cases
It seems one of my assumptions were incorrect regarding order in RDDs (related).
Suppose I wish to repartition a RDD after having sorted it.
import random
l = list(range(20))
random.shuffle(l)
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()
Which yields:
[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.
I would like to preserve total order of the RDD, like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
I am having difficulty finding anything online which could be of assistance. Help would be appreciated.
It appears that we can provide the argument numPartitions=partitions to the sortBy function to partition the RDD and preserve total order:
import random
l = list(range(20))
random.shuffle(l)
partitions = 3
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x ,numPartitions=partitions)\
.collect()
I want to split an RDD into n-parts of equal length using Pyspark
If the RDD is something like
data = range(0,20)
d_rdd = sc.parallelize(data)
d_rdd.glom().collect()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
I want set of any two random numbers together, like
[[0,4],[6,11],[5,18],[3,14],[17,9],[12,8],[2,10],[1,15],[13,19],[7,16]]
Two methods:
set partition num when using parallelize, and use function distinct()
data = range(0,20)
d_rdd = sc.parallelize(data, 10).distinct()
d_rdd.glom().collect()
using repartition() and distinct()
data = range(0,20)
d_rdd = sc.parallelize(data).repartition(10).distinct()
d_rdd.glom().collect()
In Apache Spark there is an RDD.top() API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?
No, it doesn't sort the whole RDD, that operation would be too expensive.
It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.
See RDD.scala for more details.
Example:
3 input partitions
RDD.top(2)
[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
|| || ||
[10, 7] [12, 8] [11, 9]
================== reduce ==================
[12, 11]