Why spark submit --files option not working? - apache-spark

spark-submit --files option not working as expecting
I am trying use following option for spark-submit
--files FILES Comma-separated list of files to be
placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
sh-4.2$ spark-shell --files etl_emr_test_config.json
..............................................
.............................................
..........................
..................................
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("etl_emr_test_config.json")
res0: String = /mnt/tmp/spark-770e7981-2a38-4b12-950d-3519e70bdbe0/userFiles-afa53bd8-45c9-4c30-a923-feb2f0927117/etl_emr_test_config.json
scala> spark.read.text(SparkFiles.get("etl_emr_test_config.json")).show()
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://ip-100-69-166-111.ec2.internal:8020/mnt/tmp/spark-770e7981-2a38-4b12-950d-3519e70bdbe0/userFiles-afa53bd8-45c9-4c30-a923-feb2f0927117/etl_emr_test_config.json;
I was expecting etl_emr_test_config.json to be present in SparkFiles.get("etl_emr_test_config.json") path but it gives me error that file is not present

Related

Pyspark does not display the hive database

I try to connect to hive database via pyspark, but can't see my database (only default)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>> spark.sql('show databases')
DataFrame[databaseName: string]
>>> spark.sql('show databases').show()
+------------+
|databaseName|
+------------+
| default|
+------------+
But if i do this command using hive I get the following:
hive> show databases;
OK
signals
default
test
Time taken: 0.973 seconds, Fetched: 3 row(s)
hive>
What I should do to connect to me hive instance?
Please check whether you have configured spark to use hive metastore.
Go to SPARK_HOME/conf/hive-site.xml.
And check the following property, if it's not there add that.
<configuration>
<property>
<name>hive.metastore.uris</name>
<!-- hostname must point to the Hive metastore URI in your cluster -->
<value>thrift://hostname:port</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>
Note: If you don’t know hostname and port of your metastore, go to HIVE_HOME/conf/hive-site.xml. You can find those property there

Google PubSub in Apache Spark 2.2.1

I'm trying to use Google Cloud PubSub within a Spark application. For simplicity let's just say that this application is Spark's shell. Trying to instantiate a Publisher throws a NoClassDefFoundError, which is most likely the result of dependency version conflicts. However, with a simple setup like this (just Spark and a Google Cloud PubSub dependency), I can't figure out how to resolve this issue.
bash-4.4# spark-shell --packages com.google.cloud:google-cloud-pubsub:1.105.0
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.google.cloud#google-cloud-pubsub added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.google.cloud#google-cloud-pubsub;1.105.0 in central
found io.grpc#grpc-api;1.28.1 in central
...
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> com.google.cloud.pubsub.v1.Publisher.newBuilder("topic").build
java.lang.NoClassDefFoundError: com/google/api/gax/grpc/InstantiatingGrpcChannelProvider
at com.google.cloud.pubsub.v1.stub.PublisherStubSettings.defaultGrpcTransportProviderBuilder(PublisherStubSettings.java:225)
at com.google.cloud.pubsub.v1.TopicAdminSettings.defaultGrpcTransportProviderBuilder(TopicAdminSettings.java:169)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:674)
at com.google.cloud.pubsub.v1.Publisher$Builder.<init>(Publisher.java:625)
at com.google.cloud.pubsub.v1.Publisher.newBuilder(Publisher.java:621)
... 48 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.gax.grpc.InstantiatingGrpcChannelProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 53 more
Is there any way to get this to work? I could change the pubsub dependency version, but not the Spark version.
That is due to Google's Guava dependency conflict which is known to exist whenever using Spark + Google Libraries. The workaround (with Maven) is using the maven-shade-plugin.

Apache Spark installation error.

I am able to install Apache spark with the given set of commands on ubuntu 16:
dpkg -i scala-2.12.1.deb
mkdir /opt/spark
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
cp -rv spark-2.0.2-bin-hadoop2.7/* /opt/spark
cd /opt/spark
executing spark shell worked well
./bin/spark-shell --master local[2]
return this output on the shell:
jai#jaiPC:/opt/spark$ ./bin/spark-shell --master local[2]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/05/15 19:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/15 19:00:55 WARN Utils: Your hostname, jaiPC resolves to a loopback address: 127.0.1.1; using 172.16.16.46 instead (on interface enp4s0)
18/05/15 19:00:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/05/15 19:00:55 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.16.16.46:4040
Spark context available as 'sc' (master = local[2], app id = local-1526391055793).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
but when I tried to access
Spark context Web UI available at http://172.16.16.46:4040
it shows
The page cannot be displayed
How can I resolve this problem
Please help:
Thanks and Regards

NoSuchMethodError using Databricks Spark-Avro 3.2.0

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I'm getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version is 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?
The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.
To check if this is the case, open a spark-shell and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should tell you the location of the class like so:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.
For example, in a Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.
I have encountered a similar problem before.
Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

error: not found: value sqlContext on EMR

I am on EMR using Spark 2. When I ssh into the master node and run spark-shell I can't see to have access to sqlContext. Is there something I'm missing?
[hadoop#ip-172-31-13-180 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 21:07:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/11/10 21:07:14 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.31.13.180:4040
Spark context available as 'sc' (master = yarn, app id = application_1478720853870_0003).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> sqlContext
<console>:25: error: not found: value sqlContext
sqlContext
^
Since I'm getting same error on my local computer I've tried the following to no avail:
exported SPARK_LOCAL_IP
➜ play grep "SPARK_LOCAL_IP" ~/.zshrc
export SPARK_LOCAL_IP=127.0.0.1
➜ play source ~/.zshrc
➜ play spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/10 16:12:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/10 16:12:19 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://127.0.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1478812339020).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sqlContext
<console>:24: error: not found: value sqlContext
sqlContext
^
scala>
My /etc/hosts contains the following
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
Spark 2.0 doesn't use SQLContext anymore:
use SparkSession (initialized in spark-shell as spark).
for legacy application you can:
val sqlContext = spark.sqlContext

Resources