Why do I see a spike of steps per second in tensorflow training initially? - multithreading

Hi tensorflow experts,
I see the following training speed profile using dataset API and prefetching of 128, 256, 512, or 1024 batches (each of 128 examples):
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 0.969178, step = 0
INFO:tensorflow:global_step/sec: 70.3812
INFO:tensorflow:loss = 0.65544295, step = 100 (1.422 sec)
INFO:tensorflow:global_step/sec: 178.33
INFO:tensorflow:loss = 0.47716027, step = 200 (0.560 sec)
INFO:tensorflow:global_step/sec: 178.626
INFO:tensorflow:loss = 0.53073615, step = 300 (0.560 sec)
INFO:tensorflow:global_step/sec: 132.039
INFO:tensorflow:loss = 0.4849593, step = 400 (0.757 sec)
INFO:tensorflow:global_step/sec: 121.437
INFO:tensorflow:loss = 0.4055175, step = 500 (0.825 sec)
INFO:tensorflow:global_step/sec: 122.379
INFO:tensorflow:loss = 0.28230205, step = 600 (0.817 sec)
INFO:tensorflow:global_step/sec: 122.163
INFO:tensorflow:loss = 0.4917924, step = 700 (0.819 sec)
INFO:tensorflow:global_step/sec: 122.509
The initial spike of 178 steps per second is reproducible across multiple runs and different prefetching amount. I am trying to understanding the underlying multi-threading mechanism on why that happens.
Additional information:
my cpu usage peaks at 1800% on a 48 core machine. My gpu usage is consistently at only 9%. So it's pretty amazing that both of these are not exhausted. So I am wondering if the mutex in queue_runner is causing the cpu processing to not realize its full potential, as described here?
Thanks,
John
[update] I also observed the same spike when I use prefetch_to_device(gpu_device, ..), with similar buffer sizes. Surprisingly, prefetch_to_device only slows things down, by about 10%.
NFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into
INFO:tensorflow:loss = 1.3881096, step = 0
INFO:tensorflow:global_step/sec: 52.3374
INFO:tensorflow:loss = 0.48779136, step = 100 (1.910 sec)
INFO:tensorflow:global_step/sec: 121.154
INFO:tensorflow:loss = 0.3451385, step = 200 (0.827 sec)
INFO:tensorflow:global_step/sec: 89.3222
INFO:tensorflow:loss = 0.37804496, step = 300 (1.119 sec)
INFO:tensorflow:global_step/sec: 80.4857
INFO:tensorflow:loss = 0.49938473, step = 400 (1.242 sec)
INFO:tensorflow:global_step/sec: 79.1798
INFO:tensorflow:loss = 0.5120025, step = 500 (1.263 sec)
INFO:tensorflow:global_step/sec: 81.2081

It's common to see spikes in steps per second at the start of each training run, as the cpu had time to fill up the buffer. Your step per seconds are very reasonable compared to the start, but the lack of cpu usage might indicate a bottleneck.
First question is, whether or not you are using the Dataset API in combination with the estimator. From your terminal output I suspect you do, if not I would start by changing your code to use the Estimator class. If you are already using the Estimator class, then make sure you are following the best performance practices as documented here.
If your are doing all of the above already, then there is a bottleneck in you pipeline. Due to the low CPU usage I would guess you are experiencing an I/O bottleneck. You might have your Dataset on a slow medium (hard-drive) or you aren't using a serialized format and are saturating the IOPS (again hard-drive or network storage). In either case, start by using a serialized data format such as TF-records and upgrade your storage to SSD or multiple hard drives in raid 1,0,10 your pick.

Related

What is openCostInBytes?

Can someone explain me openCostInBytes in Apache Spark? I can see definition in documentation but I dont understand how exactly it can affect reading files. Should I really care about this and if yes how should I tune it?
spark.files.openCostInBytes will affect how many partitions the input data will be read into. The exact calculation can be found in Filepartition.scala.
The way it exists at the time of this answer, the calculation is the following:
def maxSplitBytes(
sparkSession: SparkSession,
selectedPartitions: Seq[PartitionDirectory]): Long = {
val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum
.getOrElse(sparkSession.leafNodeDefaultParallelism)
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / minPartitionNum
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
}
So that last line is the interesting one. We take the minimum of:
defaultMaxSplitBytes, which comes from spark.sql.files.maxPartitionBytes and is by default 128 * 1024 * 1024
the max of:
openCostInBytes, which comes from spark.sql.files.openCostInBytes, and is by default 4 * 1024
bytesPerCore which is totalBytes / minPartitionNum. minPartitionNum comes from spark.default.parallelism in the default case and this is equal to your total number of cores
So now we know this, we can try to understand the 3 edge cases of this calculation (taking into account default values of the parameters):
If the result is the value of defaultMaxSplitBytes, this is because we have a bytesPerCore that is larger than the other values. This only happens when we're handling BIG files. So big, that if we would fairly split the data over all the cores it would be bigger than defaultMaxSplitBytes. So here we are limiting the size of each partition.
If the result is the value of bytesPerCore, then that means that it was smaller than 128MB but larger than 4MB. In this case, we are fairly splitting the data over all of the cores.
If the result is the value of openCostInBytes, then that means bytesPerCore was so small it was smaller than 4MB. Since each partition has a cost of opening, we want to limit the amount of partitions that get created. So in this case, we are limiting the amount of partitions created
From understanding this, we see that this value only has an effect if your data is small w.r.t. your cluster (i.e. if your data size / nr of cores in cluster is small)
Hope this helps!

Resolving performance issue with skewed partitions in PySpark Window function

I am attempting to calculate some moving averages in Spark, but am running into issues with skewed partitions. Here is the simple calculation I'm trying to perform:
Getting the base data
# Variables
one_min = 60
one_hour = 60*one_min
one_day = 24*one_hour
seven_days = 7*one_day
thirty_days = 30*one_day
# Column variables
target_col = "target"
partition_col = "partition_col"
df_base = (
spark
.sql("SELECT * FROM {base}".format(base=base_table))
)
df_product1 = (
df_base
.where(F.col("product_id") == F.lit("1"))
.select(
F.col(target_col).astype("double").alias(target_col),
F.unix_timestamp("txn_timestamp").alias("window_time"),
"transaction_id",
partition_col
)
)
df_product1.persist()
Calculating running averages
window_lengths = {
"1day": one_day,
"7day": seven_days,
"30day": thirty_days
}
# Create window specs for each type
part_windows = {
time: Window.partitionBy(F.col(partition_col))
.orderBy(F.col("window_time").asc())
.rangeBetween(-secs, -one_min)
for (time, secs) in window_lengths.items()
}
cols = [
# Note: not using `avg` as I will be smoothing this at some point
(F.sum(target_col).over(win)/F.count("*").over(win)).alias(
"{time}_avg_target".format(time=time)
)
for time, win in part_windows.items()
]
sample_df = (
df_product1
.repartition(2000, partition_col)
.sortWithinPartitions(F.col("window_time").asc())
.select(
"*",
*cols
)
)
Now, I can collect a limited subset of these data (say just 100 rows), but if I try to run the full query, and, for example, aggregate the running averages, Spark gets stuck on some particularly large partitions. The vast majority of the partitions have fewer than 1million records in them. Only about 50 of them have more than 1M record and only about 150 have more than 500K
However, a small handful have more than 2.5M (~10), and 3 of them have more than 5M records. These partitions have run for more than 12 hours and failed to complete. The skew in these partitions are a natural part of the data representing larger activity in these distinct values of the partitioning column. I have no control over the definition of the values of this partitioning column.
I am using a SparkSession with dynamic allocation enabled, 32G of RAM and 4 cores per executor, and 4 executors minimum. I have attempted to up the executors to 96G with 8 cores per executor and 10 executors minimum, but the job still does not complete.
This seems like a calculation which shouldn't take 13 hours to complete. The df_product1 DataFrame contains just shy of 300M records.
If there is other information that would be helpful in resolving this problem, please comment below.

Time Optimization for pandas dataframe reconstruction (random to fixed sampling)

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
I'm setting a time_delta as a static time step of 0.5 seconds
I'm dropping reccords corresponding to the same nanosecond
I'm generating timestamps from my start_time to the calculated end_time
I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance

How to avoid sudden spikes in batch size in Spark streaming?

I am streaming data from kafka and trying to limit the number of events per batch to 10 events. After processing for 10-15 batches, there is a sudden spike in the batch size. Below are my settings:
spark.streaming.kafka.maxRatePerPartition=1
spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.pid.minRate=1
spark.streaming.receiver.maxRate=2
Please check this image for the streaming behavior
This is the bug in spark, please reffer to: https://issues.apache.org/jira/browse/SPARK-18371
The pull request isn't merged yet, but you may pick it up and build spark on your own.
To summarize the issue:
If you have the spark.streaming.backpressure.pid.minRate set to a number <= partition count, then an effective rate of 0 is calculated:
val totalLag = lagPerPartition.values.sum
...
val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
...
(the second line calculates rate per partition where rate is rate comming from PID and defaults to minRate, when PID calculates it shall be smaller)
As here: DirectKafkaInputDStream code
This resulting to 0 causes the fallback to (unreasonable) head of partitions:
...
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
...
maxMessagesPerPartition(offsets).map { mmp =>
mmp.map { case (tp, messages) =>
val lo = leaderOffsets(tp)
tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
}
}.getOrElse(leaderOffsets)
As in DirectKafkaInputDStream#clamp
This makes the backpressure basically not working when your actual and minimal receive rate/msg/ partitions is smaller ~ equal to partitions count and you experience significant lag (e.g. messages come in spikes and you have constant processing powers).

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).
val numTopics=10
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
.setK(numTopics)
.setMaxIterations(2)
.setDocConcentration(-1) // use default symmetric document-topic prior
.setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)
/**
* Print results
*/
// Print training time
println(s"Finished training LDA model. Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")
numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA#72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel#351e2b4c
Finished training LDA model. Summary:
Training time (sec) 202.640775542
However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. Why? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).
ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.
In general, calculating the perplexity is not a straightforward matter:
https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus
LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. I calculated perplexity on both, local and distributed: they take quite some time. I mean looking at the code, they have nested map calls on the whole Dataset.
Calling:
docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)
for (9M * nonzero BOW entries) times can take quite some time. The Code is from:
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312
Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls.
Btw. the default for docConcentration is Vectors.dense(-1) and not just an Int.
Btw. number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.

Resources