configuration for `spark.hadoop.fs.s3` gets applied to `fs.s3a` not `fs.s3` - apache-spark

I have read the answer posted on how to configure s3 access keys for dataproc but I have find it dissatisfactory.
Reason is because when I follow the steps and set hadoop conf for spark.hadoop.fs.s3, s3://... path still have access issue while as s3a://... path works. Test spark-shell run can be seen below.
s3 vs s3n vs s3a is topic of it's own, although I guess we don't have to worry about s3n. But I find it strange that configuration application to s3 is manifest for s3a.
here are my questions:
is this dataproc or spark? I'm assuming spark considering spark-shell is having this issue.
is there a way to configure s3 from spark-submit conf flags without code change?
Is this a bug or are we now preferring s3a over `s3?
Thanks,
***#!!!:~$ spark-shell --conf spark.hadoop.fs.s3.awsAccessKeyId=CORRECT_ACCESS_KEY \
> --conf spark.hadoop.fs.s3.awsSecretAccessKey=SECRETE_KEY
// Try to read existing path, which breaks...
scala> spark.read.parquet("s3://bucket/path/to/folder")
17/06/01 16:19:58 WARN org.apache.spark.sql.execution.datasources.DataSource: Error while looking for metadata directory.
java.io.IOException: /path/to/folder doesn't exist
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:170)
...
// notice `s3` not `s3a`
scala> spark.conf.getAll("spark.hadoop.fs.s3.awsAccessKeyId")
res3: String = CORRECT_ACCESS_KEY
scala> spark.conf.getAll("fs.s3.awsAccessKeyId")
java.util.NoSuchElementException: key not found: fs.s3.awsAccessKeyId
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
... 48 elided
scala> sc
res5: org.apache.spark.SparkContext = org.apache.spark.SparkContext#426bf2f2
scala> sc.hadoopConfiguration
res6: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/etc/hive/conf.dist/hive-site.xml
scala> sc.hadoopConfiguration.get("fs.s3.access.key")
res7: String = null <--- ugh... wtf?
scala> sc.hadoopConfiguration.get("fs.s3n.access.key")
res10: String = null <--- I understand this...
scala> sc.hadoopConfiguration.get("fs.s3a.access.key")
res8: String = CORRECT_ACCESS_KEY <--- But what is this???
// Successfull file read
scala> spark.read.parquet("s3a://bucket/path/to/folder")
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
res9: org.apache.spark.sql.DataFrame = [whatev... ... 22 more fields]

There's some magic in spark-submit which picks up your AWS_ env vars and sets them for {s3, s3n, s3a} filesystens; that may be what's happening under the hood.
There is not any magic copying of the s3a settings from the s3a to s3n options in the Hadoop JARs, or anywhere else, so it may be some of the -site.xml files that is defining it.
Do use s3a:// if you have the Hadoop 2.7.x JARs on your CP anyway: better performance. V4 API auth (mandatory for some regions), ongoing maintenance.

Related

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
I am able to read the data from Hudi using my jupyter notebook using below commands
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate
import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
"path"
)
import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)
This jupyter notebook was opened using a cluster which was created with one of the below properties
--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \
however, when I try to run the job using spark-submit with the same cluster, I get the error above.
I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.
As your application is dependent on hudi jar, hudi itself has some dependencies, when you add the maven package to your session, spark will install hudi jar and its dependencies, but in your case, you provide only the hudi jar file from a GCS bucket.
You can try this property instead:
--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \
Or directly from you notebook:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config(
"spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
).config(
"spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
).config(
"spark.jars.package", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0"
).enableHiveSupport().getOrCreate

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/spark-archive.zip
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.show
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says:
java.io.IOException: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.

How to add hbase-site.xml config file using spark-shell

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?
1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

spark on yarn java.io.IOException: No FileSystem for scheme: s3n

My english poor , sorry,but I really need help.
I use spark-2.0.0-bin-hadoop2.7 and hadoop2.7.3. and read log from s3, write result to local hdfs. and I can run spark driver use standalone mode successfully. But when I run the same driver on yarn mode. It's throw
17/02/10 16:20:16 ERROR ApplicationMaster: User class threw exception: java.io.IOException: No FileSystem for scheme: s3n
hadoop-env.sh I add
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
run hadoop fs -ls s3n://xxx/xxx/xxx, can list files.
I thought it's should be can't find aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
how can do.
I'm not using the same versions as you, but here is an extract of my [spark_path]/conf/spark-defaults.conf file that was necessary to get s3a working:
# hadoop s3 config
spark.driver.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.executor.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key [key]
spark.hadoop.fs.s3a.secret.key [key]
spark.hadoop.fs.s3a.fast.upload true
Alternatively you can specify paths to the jars in a comma-separated format to the --jars option on job submit:
--jars [path]aws-java-sdk-[version].jar,[path]hadoop-aws-[version].‌​‌​jar
Notes:
Ensure the jars are in the same location on all nodes in your cluster
Replace [path] with your path
Replace s3a with your preferred protocol (last time I checked s3a was best)
I don't think guava is required to get s3a working but I can't remember
Stick the JARs into SPARK_HOME/lib, with the rest of the spark bits.
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem isn't needed; the JAR will be autoscanned and picked up.
don't play with fast.output.enabled on 2.7.x unless you know what you are doing and prepared to tune some of the thread pool options. Start without that option.
Add these jars to $SPARK_HOME/jars:
ws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar

Why Zeppelin notebook is not able to connect to S3

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.
Spark Version:
Standalone: spark-1.2.1-bin-hadoop1.tgz
I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.
Code:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
val file = "s3n://<bucket>/<key>"
val data = sc.textFile(file)
data.count
file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
I have build the Zeppelin by following command:
mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests
when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.
I have also tried -Phadoop-1 mentioned in this spark website. but got the same error.
1.x to 2.1.x hadoop-1
Please let me know what I am missing here.
The following installation worked for me (spent also many days with the problem):
Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):
mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)
After this installation steps my notebook worked with S3 files:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.
Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)
For me it worked in one two steps-
1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. -
val performanceFactor = sqlContext.
read. parquet("s3n://<accessKey>:<secretKey>#mybucket/myfile/")
where access key and secret key you need to supply.
in #2 I am using s3n protocol and access and secret keys in path itself.

Resources