I have just started with Spark. I have CDH5 Installed with Spark . However when I try to use sparkcontext it gives Error as below
<console>:17: error: not found: value sc
val distdata = sc.parallelize(data)
I have researched about this and found error: not found: value sc
and tried to start spark context with ./spark-shell . It gives error No such File or Directory
You can either start spark-shell starting with ./ if you're in its exact directory or path/to/spark-shell if you're elsewhere.
Also, if you're running a script with spark-submit, you need to initialize sc as SparkContext first:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
There is another stackoverflow post that answers this question by getting sc(spark context) from spark session. I do it this way:
val spark = SparkSession.builder().appName("app_name").enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
original answer here:
Retrieve SparkContext from SparkSession
Add spark directory to path then you may use spark-shell from anywhere.
Add import org.apache.spark.SparkContext if you are using it in a spark-submit job to create a spark context using:
val sc = new SparkContext(conf)
where conf is already defined.
Starting a new terminal fixes the problem in my case.
You need to run Hadoop daemons first (run this command "start-all.sh"). Then try
you ca run this command in spark(scala) prompt
conf.set("spark.driver.allowMultipleContexts","true")
Related
I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.
`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`
I ran following python3 script to query data using pyspark python module.
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)
Which caused following runtime exception for missing jars.
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found
How do I convert the pyspark --jars.. to a pythonic equivalent.
As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.
Use spark-submit command to execute your python script. Example :
spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py
My SparkSession takes forever to initialize
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName('Huy')
.getOrCreate())
sc = spark.SparkContext
waited for hours without success
I got the same error. I have resolved it by setting the environment variables. We can set them directly in python code. You need a JDK in the program files.
import os
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-19"
os.environ["SPARK_HOME"] = "C:\Program Files\Spark\spark-3.3.1-bin-hadoop2"
When I try to init SparkContext with SparkConf as below:
from pyspark import *
from pyspark.streaming import *
cfg = SparkConf().setMaster('yarn').setAppName('MyApp')
sc = SparkContext(conf=cfg)
print(sc.getConf().getAll())
rdd = sc.parallelize(list('abcdefg')).map(lambda x:(x,1))
print(rdd.collect())
The output show that it does not run with yarn:
[(u'spark.master', u'local[10]'), ...]
It used the config which in $SPARK_HOME/conf/spark-defaults.conf:
spark.master local[10]
My computer:
Python2.7.2 Spark2.1.0
Then I run the same code in spark2.0.2 and SparkConf() works as well
So it is really a bug ?
To utilize yarn, you should specify whether the driver should run on the master or one of the worker nodes.
yarn-client will execute driver on the master node
SparkConf().setMaster('yarn-client')
yarn-cluster will execute driver on one of the worker nodes
SparkConf().setMaster('yarn-cluster')
Here is an example for running in yarn-client mode.
I get this error when I try to run my application using spark 2.0. I tried downloading the package from https://github.com/spark-packages/dstream-mqtt but the repositories don't exist. Also, I tried searching for the packages at https://spark-packages.org/ , but couldn't find any. My program is very simple,
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from dstream_mqtt import MQTTUtils
#from pyspark.streaming.mqtt import MQTTUtils
sc = SparkContext()
ssc = StreamingContext(sc, 6)
mqttStream = MQTTUtils.createStream(ssc,"tcp://192.168.4.54:1883","/test")
mqttStream.pprint()
mqttStream.saveAsTextFiles("test/status", "txt")
ssc.start()
ssc.awaitTermination()
ssc.stop()
I have downloaded and tried including the Jar files - spark-streaming-mqtt-assembly_2.11-1.6.2.jar and spark-streaming-mqtt_2.11-1.6.2.jar, but it is not helping.
The same code and packages just work fine with Spark 1.6.
Any help will be appreciated.
I am running spark 1.2 on CDH 5.3 and trying a simple code in spark-shell
I am failing on
val sqlContext = new SQLContext(sc)
with the error:
not found : type SQLContext
What is wrong with my environment?
Make sure you import it:
import org.apache.spark.sql.SQLContext