trouble in adding spark-csv package in Cloudera VM - apache-spark

I am using Cloudera quickstart VM to test out some pyspark work. For one task, I need to add spark-csv package. And here is what I did:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
pyspark started up fine, however I did get warnings as:
**16/02/09 17:41:22 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/02/09 17:41:22 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/09 17:41:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
then I ran my code in pyspark:
yelp_df = sqlCtx.load(
source="com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path = 'file:///directory/file.csv')
But I am getting an error message:
Py4JJavaError: An error occurred while calling o19.load.: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27)
What could have gone wrong?? Thanks in advance for your help.

Try this
PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0
Without the space, there's a typo.

Related

Spark accessing remote master

I'm trying to run my code in jupyter notebook locally, to access a spark cluster on my own server, but without success, so that's the code
I've tried this
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('SparkApp').setMaster('spark://X.X.X.123:7077')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
and this way
spark = SparkSession.builder.master("spark://X.X.X.123:7077").getOrCreate()
I received this error [updated]
### Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
###: java.lang.NullPointerException
new error after open port 7077
21/11/23 20:39:32 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
21/11/23 20:39:33 WARN StandaloneAppClient$ClientEndpoint: Failed
to connect to master 192.168.0.123:7077
with 'local' work normally

Pyspark - Failed to get main class in JAR with error 'File file:/home/xpto/spark/, does not exist'

Im using pyspark to write into kafka.
When I run the command:
bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10-assembly_2.12:3.0.1,org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.2 --jars /home/xpto/spark/jars/spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar , /home/xpto/spark/jars/spark-sql-kafka-0-10_2.11-2.0.2.jar , /home/xpto/spark/jars/kafka-clients-2.6.0.jar --verbose --master local[2] /home/xavy/Documents/PersonalProjects/Covid19Analysis/pyspark_job_to_write_data_to_kafkatopic.py
Im receiving an error:
:: retrieving :: org.apache.spark#spark-submit-parent-ad9bf9ab-6d6d-4edd-bd1f-4b3145c2457f
confs: [default]
0 artifacts copied, 7 already retrieved (0kB/3ms)
20/11/22 18:35:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/home/xpto/spark/, does not exist'. Please specify one with --class.
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:457)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I dont know which class the spark is asking for...
Im running this locally in my pc, not sure if is the right way to do it.
Can someone help and point me to the right direction?
So, spacing matters - make sure you don't put spaces in your file paths
For example, you've put this path in your packages line
, /home/xpto/spark/jars/spark-sql-kafka-0-10_2.11-2.0.2.jar
Not clear why you give a local file path when getting them from maven should work fine. However, you need to use consistent Spark versions... You've mixed 3.x and 2.x as well as Scala 2.12 and 2.11
You also shouldn't don't need both spark-streaming-kafka and spark-sql-kafka
Regarding the error, the syntax that it thinks you've tried to use is for Java
spark-submit [options] --class MainClass application.jar
For python applications, you might want to use --py-files

Why does submitting a Spark application to Mesos fail with 'Failed to load native Mesos library'?

I'm getting the following exception when I'm trying to submit a Spark application to a Mesos cluster:
/home/knoldus/application/spark-2.2.0-rc4/conf/spark-env.sh: line 40: export: `/usr/local/lib/libmesos.so': not a valid identifier
/home/knoldus/application/spark-2.2.0-rc4/conf/spark-env.sh: line 41: export: `hdfs://spark-2.2.0-bin-hadoop2.7.tgz': not a valid identifier
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/30 14:17:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/30 14:17:31 WARN Utils: Your hostname, knoldus resolves to a loopback address: 127.0.1.1; using 192.168.0.111 instead (on interface wlp6s0)
17/09/30 14:17:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Failed to load native Mesos library from
java.lang.UnsatisfiedLinkError: Expecting an absolute path of the library:
at java.lang.Runtime.load0(Runtime.java:806)
at java.lang.System.load(System.java:1086)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:159)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:188)
at org.apache.mesos.MesosSchedulerDriver.<clinit>(MesosSchedulerDriver.java:61)
at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.createSchedulerDriver(MesosSchedulerUtils.scala:104)
at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.createSchedulerDriver(MesosCoarseGrainedSchedulerBackend.scala:49)
at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.start(MesosCoarseGrainedSchedulerBackend.scala:170)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
... 47 elided
I have built spark using
./build/mvn -Pmesos -DskipTests clean package
I have set the following properties in spark-env.sh:
export MESOS_NATIVE_JAVA_LIBRARY= /usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://spark-2.2.0-bin-hadoop2.7.tgz
And in spark-defaults.conf :
spark.executor.uri hdfs://spark-2.2.0-bin-hadoop2.7.tgz
I have resolved the issue.
The problem is that there should be no space while exporting path.
export MESOS_NATIVE_JAVA_LIBRARY= /usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://spark-2.2.0-bin-hadoop2.7.tgz
For example
export foo = bar
the shell will interpret that as a request to export three names: foo, = and bar. = isn't a valid variable name, so the command fails. The variable name, equals sign and it's value must not be separated by spaces for them to be processed as a simultaneous assignment and export.
Remove the spaces.
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI=hdfs://spark-2.2.0-bin-hadoop2.7.tgz

Getting error when run spark-shell in CDH 5.7

I am new in Spark and using CDH-5.7 for running Spark, But I am getting these error when I run Spark-shell in terminal , I have run all Cloudera Services including Spark also by Launch Cloudera Express. Plz help.
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated. Type :help for more information.
16/07/13 02:14:53 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.44.133 instead (on interface eth1)
16/07/13 02:14:53 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/13 02:19:28 ERROR spark.SparkContext: Error initializing SparkContext. org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)

Can't access hadoop with ip address

I'm following this guide for installing hadoop in centos.
Everything works normal when I run hadoop and I compare it with the guide also, but when I try to access mine with ip address like 192.168.0.1:50070 then nothing works.
Here is the output when I run had:
bash-4.2$ start-dfs.sh
14/10/15 16:28:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out
14/10/15 16:29:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Do you think I have to configure IP somewhere to access them? My configuration is exactly the same as above link, even the xml files...
Did you tried to disable the firewall/make an ip tables line for it on the master/slaves?
for centOS6.5, try:
service firewall stop
to disable the firewall. If it works properly, you just need to add the allowance to your iptables.
Also, CentOS has the SELinux. I would advice turning it off and check if it keeps with an error.

Resources