spark-shell commands throwing error : “error: not found: value spark” - apache-spark

problem screenshot
:14: error: not found: value spark
import spark.implicits._
^
:14: error: not found: value spark
import spark.sql
^
here is my enviroment configuration. I different many times but I keep getting this error. Anyone knows the reason? I saw a similar question but the answers did not solve my problem.
JAVA_HOME : C:\Program Files\Java\jdk1.8.0_51
HADOOP_HOME : C:\Hadoop\winutils-master\hadoop-2.7.1
SPARK_HOME : C:\Hadoop\spark-2.2.0-bin-hadoop2.7
PATH :%JAVA_HOME%\bin;%SCALA_HOME%\bin;%HADOOP_HOME%\bin;%SPARK_HOME%\bin;

Related

pyspark & hadoop mismatch in jupyter notebook

In Google Colab, I'm able to successfully set up a connection to S3 using pyspark and hadoop using the code below. When I try running the code in Jupyter Notebook, I get the error:
Py4JJavaError: An error occurred while calling o46.parquet. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
From what I've found on stackoverflow, it seems that this error occurs when the hadoop and pyspark versions are not matched, but in my setup, I've specified that they are both version 3.1.2.
Is someone able to tell me why I am getting this error and what I need to change for it to work? Code below:
In[1]
# Download AWS SDK libs
! rm -rf aws-jars
! mkdir -p aws-jars
! wget -c 'https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar'
! wget -c 'https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.2/hadoop-aws-3.1.2.jar'
! mv *.jar aws-jars
In[2]
! pip install pyspark==3.1.2
! pip install pyspark[sql]==3.1.2
In[3]
from pyspark.sql import SparkSession
AWS_JARS = '/content/aws-jars'
AWS_CLASSPATH = "{0}/hadoop-aws-3.1.2.jar:{0}/aws-java-sdk-bundle-1.11.271.jar".format(AWS_JARS)
spark = SparkSession.\
builder.\
appName("parquet").\
config("spark.driver.extraClassPath", AWS_CLASSPATH).\
config("spark.executor.extraClassPath", AWS_CLASSPATH).\
getOrCreate()
AWS_KEY_ID = "YOUR_KEY_HERE"
AWS_SECRET = "YOUR_SECRET_KEY_HERE"
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY_ID)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
In[4]
dataset1 = spark.read.parquet("s3a://bucket/dataset1/path.parquet")
dataset1.createOrReplaceTempView("dataset1")
dataset2 = spark.read.parquet("s3a://bucket/dataset2/path.parquet")
dataset2.createOrReplaceTempView("dataset2")
After In[4] is run, I receive the error:
Py4JJavaError: An error occurred while calling o46.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

An error occurred while calling o137.partitions. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip

I am trying to execute this github project in an aws emr spark cluster
https://github.com/pran4ajith/spark-twitter-streaming.git
I've succeeded to run 2 fisrt codes
tweet_stream_producer.py
sparkml_train_model.py
But when I run consumer part with command
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0,io.delta:delta-core_2.12:0.7.0 tweet_stream_consumer.py
I got file path error
Py4JJavaError: An error occurred while calling o137.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-0-0-61.ec2.internal:8020/home/hadoop/spark-twitter-streaming/TwitterStreaming/src/app/models/metadata
It seems that the problem is located mapping between local file system path and hadoop file system path
model_path = str(SRC_DIR / 'models')
pipeline_model = PipelineModel.load(model_path)

Unable to save model in Apache Spark -- Py4JJavaError

We're getting an error while trying to save a model. model.save('DT')
Py4JJavaError: An error occurred while calling o822.save.
: org.apache.spark.SparkException: Job aborted.```
Complete Error Stack --> http://dpaste.com/16Y07B9
Anything we missed here? It is creating the folder but not writing anything.
OS: Windows 10
TIA
So it turns out I was using Spark 3.0.0Preview and ran into trouble. Switched to 2.4.5 and resolved it.

Spark-shell is not working

When I submit the spark-shell command, I see the following error:
# spark-shell
> SPARK_MAJOR_VERSION is set to 2, using Spark2
File "/usr/bin/hdp-select", line 249
print "Packages:"
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(t "Packages:")?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
at org.apache.spark.launcher.Main.main(Main.java:118)
The problem is that the HDP script /usr/bin/hdp-select is apparently run under Python3, whereas it contains incompatible Python2 specific code.
You may port /usr/bin/hdp-select to Python3 by:
adding parentheses to the print statements
replacing the line "packages.sort()" by "list(package).sort()")
replacing the line "os.mkdir(current, 0755)" by "os.mkdir(current, 0o755)"
You may also try to force HDP to run /usr/bin/hdp-select under Python2:
PYSPARK_DRIVER_PYTHON=python2 PYSPARK_PYTHON=python2 spark-shell
Had the same problem: I set HDP_VERSION before running spark.
export HDP_VERSION=<your hadoop version>
spark-shell

Creating a DataFrame in PySpark gives pyj4 error

I am working in a windows 7 environment, running my code on the windows command prompt. I am running a very simple set of code of right now.
data = [('Alice', 1), ('Bob', 2)]
df = sqlContext.createDataFrame(data)
Which gives me the errors
py4j.protocol.Py4JJavaError: An error occurred while calling o23.applySchemaToPythonRDD.
: java.lang.RuntimeException: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741515:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.
There is much more error output following but the actual error is the first line. I've looked up this error on other post but they don't concern actually creating a dataframe.
I looked at the runtime exception as well and saw there was an error trying to get file permissions. I tried running my command prompt in administrator mode instead but it didn't help.
Does anyone have any ideas what could be causing this?

Resources