Rate limit with Apache Spark GCS connector - apache-spark

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:
java.io.IOException: Error inserting: bucket: *****, object: *****
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
"code" : 429,
"errors" : [ {
"domain" : "usageLimits",
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"reason" : "rateLimitExceeded"
} ],
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
... 3 more
Anyone knows any solution for that?
Is there a way to control the read/write rate of Spark?
Is there a way to increase the rate limit for my Google Project?
Is there a way to use local Hard-Disk for temp files that don't have
to be shared with other slaves?
Thanks!

Unfortunately, the usage of GCS when set as the DEFAULT_FS can pop up with high rates of directory-object creation whether using it for just intermediate directories or for final input/output directories. Especially for using GCS as the final output directory, it's difficult to apply any Spark-side workaround to reduce the rate of redundant directory-creation requests.
The good news is that most of these directory requests are indeed redundant, just because the system is used to being able to essentially "mkdir -p", and cheaply return true if the directory already exists. In our case, it's possible to fix it on the GCS-connector side by catching these errors and then just checking whether the directory indeed got created by some other worker in a race condition.
This should be fixed now with https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7
To test, just run:
git clone https://github.com/GoogleCloudPlatform/bigdata-interop.git
cd bigdata-interop
mvn -P hadoop1 package
# Or or Hadoop 2
mvn -P hadoop2 package
And you should find the files "gcs/target/gcs-connector-*-shaded.jar" available for use. To plug it into bdutil, simply gsutil cp gcs/target/gcs-connector-*shaded.jar gs://<your-bucket>/some-path/ and then edit bdutil/bdutil_env.sh for Hadoop 1 or bdutil/hadoop2_env.sh to change:
GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'
To instead point at your gs://<your-bucket>/some-path/ path; bdutil automatically detects that you're using a gs:// prefixed URI and will do the right thing during deployment.
Please let us know if it fixes the issue for you!

Have you tried to set the spark.local.dir config parameter and attach a disk (preferable SSD) for that tmp space to your Google Compute Engine instances?
https://spark.apache.org/docs/1.2.0/configuration.html
You can not change the rate limiting for your project, what you would have to use is a back-off algorithm once the limit is reached. Since you mentioned most of the reads/writes are for tmp files, try to configure Spark to use local disks for that.

Related

Log the size of the artifacts archive before upload attempt in gitlab-ci

I have an on-prem instance of GitLab and I started seeing the following error in one of the pipelines:
ERROR: Uploading artifacts as “archive” to coordinator…
413 Request Entity Too Large id=1390915 responseStatus=413 Request Entity Too Large status=413
Before I go to the admins and request that they increase the limit (as suggested here), I would like to see what is the size of the compressed artifact - perhaps I am packing too many things.
I am using:
gitlab-runner 14.7.0
windows
How can I log the size of the package into the output?
A way to achieve this is by passing the path to the artifact to the du command.
du (disc usage) command estimates file_path space usage
The options -sh are (from man du):
-s, --summarize
display only a total for each argument
-h, --human-readable
print sizes in human readable format (e.g., 1K 234M 2G)
E.g
...
script:
...
- du -hs path/to/artifacts
artifacts:
paths:
- path/to/artifacts
output:
40M path/to/artifacts
On Windows, you can use gci and measure to get the size in bytes of a directory, using powershell.
So, in your GitLab job you could use the following method, assuming that .\dist is your artifact directory:
my_job:
after_script:
- powershell -c "gci -Recurse .\dist | Measure-Object -Property Length -sum"
You can omit the powershell -c and quotes if your runner uses powershell by default.
You'll see an output like:
Count : 1234
Average :
Sum : 1234567890 # <--- this is the size in bytes
Minimum :
Maximum :
Property : Length
If you're using Linux images on a docker executor on Windows, then you can use the method mentioned in the other answer.
Unfortunately, it's not practical to get the compressed size of the artifact from your end. In part, because the compression algorithm is determined by the GitLab runner's settings. Additionally, some artifacts, like any under reports: are not compressed at all. GitLab also generates metadata files for each individual artifact, which will contribute to the overall storage size.
You can guess the approximate compressed size by first using gzip to compress your artifacts (as GoLang's compress/gzip is what is used by the runner) then using the above method. Though, I wouldn't expect compressed size to be significantly smaller unless your artifact data lends itself well to compression, so the uncompressed size should be reasonable for an Admin to use to increase your limits.
Keep in mind, you'll probably want to request a limit with some reasonable overhead in case your artifact size varies in the future.

Reduce Service Fabric backup size

I'm trying to use Service Fabric backups with Actors:
var backupDescription = new BackupDescription(BackupOption.Full, BackupCallbackAsync);
await BackupAsync(backupDescription, TimeSpan.FromHours(1), cancellationToken);
But I've noticed that one backup file may contains several files like:
edb0000036A.log 5120 KB
edb0000036B.log 5120 KB
edb00000366.log 5120 KB
...
I haven't found any info about these files but it seems that they are just logs and I may not include them. Am I right or these files must be included in backup?
These files are quite heavy so I'm trying to reduce size of backups
UPDATE 1:
I have tried to use incremental backup. But it seems that Actors do not support Incremental backup as I have read on MSDN. Moreover I have tested but got Exception "Invalid backup option. Parameter name: option"
Instead of doing full backups every hour, you can also use incremental backups, which will result in a smaller size. (For example, do a full backup every day, and incrementals every hour for instance)
The log files are transaction logs, they are not optional for restore. More info here.

How to prevent eventLog file of Spark stream jobs eating up space?

We have multiple run-forever streaming jobs generating huge eventLogs. These in-progress logs won't be removed until reach the the max age config (spark.history.fs.cleaner.maxAge).
Based on the Spark source code, "Only completed applications older than the specified max age will be deleted." https://github.com/apache/spark/blob/a45647746d1efb90cb8bc142c2ef110a0db9bc9f/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
So, in-progress eventLog will never be removed before completion and they are eating up space. Anyone have idea how to prevent it?
We have option like script periodically removes old files, but it will be our last resort, and we cannot modify the source code, but configuration.

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.
Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.
I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

Why does Spark job fail with "too many open files"?

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.
This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell
the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf
Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Resources