java.lang.NumberFormatException: in Pyspark when writing to S3 - apache-spark

I am trying to read compressed log file from S3 bucket using pyspark on EC2 instance.
EC2 instance has read permission to S3 bucket as I am able to manually download the file using AWS CLI command.
This is how my code looks like
file_path= 's3a://<bucket_name>/<path_of_file>'
rdd1 = sc.textFile(file_path)
rdd1.take(3)
But I am getting below error
*py4j.protocol.Py4JJavaError: An error occurred while calling o36.partitions.
: java.lang.NumberFormatException: For input string: "64M"*
Can somebody help me out?

you are mixing versions of hadoop-common with an older version of hadoop-aws.
the s3a connector added support for using a unit when declaring multipart block size in 2016, eight years ago, in https://issues.apache.org/jira/browse/HADOOP-13680.
hadoop-common JAR versions 2.8+ set it to "64M"
if the version of the s3a connector you are using can't cope with that, it means it is nine years old
please
upgrade your hadoop-* jars to a recent version, ideally 3.3.0+
make sure they are all the same version unless you enjoy seeing stack traces
and use the exact same aws-sdk-bundle jar which hadoop was built with unless you want to see different stack traces.
thisis not an opinion, these are instructions from the hadoop-aws maintenance team.

Related

Reading CSV file with Spark runs sometimes forever

i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.
But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.
What i am doing wrong?
#JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.
For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

History Server running with different Spark versions

I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
All the new committers config documentation I've read up to date, is missing one fundamental fact:
spark 2.x.x does not have needed support classes to make new S3a committers to function.
They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.
Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with directory committer but not the magic.
In general the directory committer is the recommended over magic as it has been well tested and tried. It requires shared filesystem (magic committer does not require one, but needs s3guard) such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.

spark LOCAL and alluxio client

I'm running spark in LOCAL mode and trying to get it to talk to alluxio. I'm getting the error:
java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
I have looked at the page here:
https://www.alluxio.org/docs/master/en/Debugging-Guide.html#q-why-do-i-see-exceptions-like-javalangruntimeexception-javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
Which details the steps to take in this situation, but I'm not finding success.
According to Spark documentation, I can instance a local Spark like so:
SparkSession.builder
.appName("App")
.getOrCreate
Then I can add the alluxio client library like so:
sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
sparkSession.conf.set("spark.executor.extraClassPath", ALLUXIO_SPARK_CLIENT)
I have verified that the proper jar file exists in the right location on my local machine with:
logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
But I still get the error. Is there anything else I can do to figure out why Spark is not picking the library up?
Please note I am not using spark-submit - I am aware of the methods for adding the client jar to a spark-submit job. My Spark instance is being created as local within my application and this is the use case I want to solve.
As an FYI there is another application in the cluster which is connecting to my alluxio using the fs client and that all works fine. In that case, though, the fs client is being packaged as part of the application through standard sbt dependencies.
Thanks
In the hopes that this helps someone else:
My problem here was not that the library wasn't getting loaded or wasn't on the classpath, it was that I was using the "fs" version of the client rather than the "hdfs" version.
I had been using a generic 1.4 client - at some point this client was split into a fs version and an hdfs version. When I updated this for 1.7 recently I mistakenly added the "fs" version.

Unable to access S3 data using Spark 2.2

I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.
Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A

Resources