Spark accessing remote master - apache-spark

I'm trying to run my code in jupyter notebook locally, to access a spark cluster on my own server, but without success, so that's the code
I've tried this
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('SparkApp').setMaster('spark://X.X.X.123:7077')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
and this way
spark = SparkSession.builder.master("spark://X.X.X.123:7077").getOrCreate()
I received this error [updated]
### Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
###: java.lang.NullPointerException
new error after open port 7077
21/11/23 20:39:32 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
21/11/23 20:39:33 WARN StandaloneAppClient$ClientEndpoint: Failed
to connect to master 192.168.0.123:7077
with 'local' work normally

Related

Local spark submit does not return any results

I am trying to the below script locally using spark submit it does not return. The same code works in spark shell. Not sure what am I missing?
Spark Submit
./bin/spark-submit \
--master local\
~/Desktop/projects/S3_Snowflake_Prototype/main.py
Code: main.py
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("~/Desktop/projects/S3_Snowflake_Prototype/csv/source.csv")
df.count()
Current Output
21/08/03 08:36:48 WARN Utils: Your hostname, vinays-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.3 instead (on interface en0)
21/08/03 08:36:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/03 08:36:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/08/03 08:36:49 INFO ShutdownHookManager: Shutdown hook called
21/08/03 08:36:49 INFO ShutdownHookManager: Deleting directory /private/var/folders/_l/r0yqws8j5hl5bsc5rvzjkm5c0000gn/T/spark-a3ea7970-7ef7-4edd-a539-f5c1f264b59d

Exception in thread "main" java.lang.IllegalStateException: Cannot retrieve files with 'spark' scheme without an active SparkEnv

I'm very new to spark and cassandra, got one sample from github and tried to run the application from the below link
spark-on-cassandra-quickstart
After jar file generated, Tried executing with the below syntax
C:\Users\user\Desktop\softwares\spark-2.4.3-bin-hadoop2.7\spark-2.4.3-bin-hadoop2.7\bin>spark-submit --class com.github.boneill42.JavaDemo --master spark://localhost:7077
C:\Users\user\git\spark-on-cassandra-quickstart\target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar spark://localhost:7077 localhost
Below is the issue I'm facing
19/06/08 22:59:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.IllegalStateException: Cannot retrieve files with 'spark' scheme without an active SparkEnv.
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:690)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me in resolving the issue
In your case, It seems you want to start in standalone mode
spark://HOST:PORT Connect to the given Spark standalone cluster master.
The port must be whichever one your master is configured to use, which is 7077 by default.
Do you start spark master and worker first ?
launch master
./sbin/start-master.sh
launch worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
After start master and worker, then you can submit your job again.

How to add hbase-site.xml config file using spark-shell

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?
1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

Spark submit throws error while using Hive tables

i have a strange error, i am trying to write data to hive, it works well in spark-shell, but while i am using spark-submit, it throwing database/table not found in default error.
Following is the coding i am trying to write in spark-submit , i am using custom build of spark 2.0.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.table("spark_schema.iris_ori")
Following is the command i am using,
/home/ec2-user/Spark_Source_Code/spark/bin/spark-submit --class TreeClassifiersModels --master local[*] /home/ec2-user/Spark_Snapshots/Spark_2.6/TreeClassifiersModels/target/scala-2.11/treeclassifiersmodels_2.11-1.0.3.jar /user/ec2-user/Input_Files/defPath/iris_spark SPECIES~LBL+PETAL_LENGTH+PETAL_WIDTH RAN_FOREST 0.7 123 12
Following is the Error,
16/05/20 09:05:18 INFO SparkSqlParser: Parsing command: spark_schema.measures_20160520090502
Exception in thread "main" org.apache.spark.sql.AnalysisException: Database 'spark_schema' does not exist;
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:37)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:195)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:360)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:464)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:458)
at TreeClassifiersModels$.main(TreeClassifiersModels.scala:71)
at TreeClassifiersModels.main(TreeClassifiersModels.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The issue was because of the deprecation happened on Spark Version 2.0.0. Hive Context was deprecated in Spark 2.0.0. To read/Write Hive tables on Spark 2.0.0 we need to use Spark session as follows.
val sparkSession = SparkSession.withHiveSupport(sc)

trouble in adding spark-csv package in Cloudera VM

I am using Cloudera quickstart VM to test out some pyspark work. For one task, I need to add spark-csv package. And here is what I did:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
pyspark started up fine, however I did get warnings as:
**16/02/09 17:41:22 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/02/09 17:41:22 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/09 17:41:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
then I ran my code in pyspark:
yelp_df = sqlCtx.load(
source="com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path = 'file:///directory/file.csv')
But I am getting an error message:
Py4JJavaError: An error occurred while calling o19.load.: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27)
What could have gone wrong?? Thanks in advance for your help.
Try this
PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0
Without the space, there's a typo.

Resources