pyspark & hadoop mismatch in jupyter notebook - apache-spark

In Google Colab, I'm able to successfully set up a connection to S3 using pyspark and hadoop using the code below. When I try running the code in Jupyter Notebook, I get the error:
Py4JJavaError: An error occurred while calling o46.parquet. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
From what I've found on stackoverflow, it seems that this error occurs when the hadoop and pyspark versions are not matched, but in my setup, I've specified that they are both version 3.1.2.
Is someone able to tell me why I am getting this error and what I need to change for it to work? Code below:
In[1]
# Download AWS SDK libs
! rm -rf aws-jars
! mkdir -p aws-jars
! wget -c 'https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar'
! wget -c 'https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.2/hadoop-aws-3.1.2.jar'
! mv *.jar aws-jars
In[2]
! pip install pyspark==3.1.2
! pip install pyspark[sql]==3.1.2
In[3]
from pyspark.sql import SparkSession
AWS_JARS = '/content/aws-jars'
AWS_CLASSPATH = "{0}/hadoop-aws-3.1.2.jar:{0}/aws-java-sdk-bundle-1.11.271.jar".format(AWS_JARS)
spark = SparkSession.\
builder.\
appName("parquet").\
config("spark.driver.extraClassPath", AWS_CLASSPATH).\
config("spark.executor.extraClassPath", AWS_CLASSPATH).\
getOrCreate()
AWS_KEY_ID = "YOUR_KEY_HERE"
AWS_SECRET = "YOUR_SECRET_KEY_HERE"
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY_ID)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
In[4]
dataset1 = spark.read.parquet("s3a://bucket/dataset1/path.parquet")
dataset1.createOrReplaceTempView("dataset1")
dataset2 = spark.read.parquet("s3a://bucket/dataset2/path.parquet")
dataset2.createOrReplaceTempView("dataset2")
After In[4] is run, I receive the error:
Py4JJavaError: An error occurred while calling o46.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Related

Getting scalatest to run from databricks

What's missing in order to get this simple scalatest working in databricks? (mostly as per their userguide)
I have cluster, into which I've uploaded the following latest scalatest jar and restarted cluster:
https://oss.sonatype.org/content/groups/public/org/scalatest/scalatest_2.13/3.2.13/scalatest_2.13-3.2.13.jar
Following: https://www.scalatest.org/user_guide/using_the_scalatest_shell
I'm running their sample code:
import org.scalatest._
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
test("addition works") {
1 + 1 should equal (2)
}
}
run(new ArithmeticSuite)
The errors start with:
command-1362189841383727:3: error: not found: type AnyFunSuite
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
^
command-1362189841383727:3: error: object should is not a member of package org.scalatest.matchers
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
Databricks clusters are using Scala 2.12, while you have installed library compiled for Scala 2.13 that is binary incompatible with Databricks cluster. Take version for Scala 2.12 (with _2.12 suffix).
P.S. Although scalatest already could be a part of Databricks Runtime - check it by running %sh ls -ls /dataricks/jars/*scalatest*
Needed to use scalatest test structure for 3.0.8
https://www.scalatest.org/scaladoc/3.0.8/org/scalatest/FeatureSpec.html
<artifactId>scalatest_2.12</artifactId>
<version>3.0.8</version>
to match the version 3.0.8 included with databricks runtime
This resolved our errors

Read XML file in Jupiter notebook

I tried to read an XML file in Jupiter notebook getting the below error
import os
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.13:0.4.1 pyspark-shell'
df = spark.read.format('xml').option('rowTag', 'row').load('test.xml')
The error:
Py4JJavaError: An error occurred while calling o251.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

Spark-shell is not working

When I submit the spark-shell command, I see the following error:
# spark-shell
> SPARK_MAJOR_VERSION is set to 2, using Spark2
File "/usr/bin/hdp-select", line 249
print "Packages:"
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(t "Packages:")?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
at org.apache.spark.launcher.Main.main(Main.java:118)
The problem is that the HDP script /usr/bin/hdp-select is apparently run under Python3, whereas it contains incompatible Python2 specific code.
You may port /usr/bin/hdp-select to Python3 by:
adding parentheses to the print statements
replacing the line "packages.sort()" by "list(package).sort()")
replacing the line "os.mkdir(current, 0755)" by "os.mkdir(current, 0o755)"
You may also try to force HDP to run /usr/bin/hdp-select under Python2:
PYSPARK_DRIVER_PYTHON=python2 PYSPARK_PYTHON=python2 spark-shell
Had the same problem: I set HDP_VERSION before running spark.
export HDP_VERSION=<your hadoop version>
spark-shell

Cannot get a hello world running in Apache Spark on Ubuntu

I am trying to setup Apache Spark on Ubuntu and output hello world.
I added these two lines to .bashrc
export SPARK_HOME=/home/james/spark
export PATH=$PATH:$SPARK_HOME/bin
(Where spark is a symbolic link to the current spark version I have).
I run
spark-shell
From the command line and the shell starts up with hundreds of errors, for example:
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test
connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
Caused by: ERROR XJ041: Failed to create database 'metastore_db', see the next exception for details.
Caused by: ERROR XBM0A: The database directory '/home/james/metastore_db' exists. However, it does not contain the expected 'service.properties' file. Perhaps Derby was brought down in the middle of creating this database. You may want to delete this directory and try creating the database again.
Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Caused by: org.apache.derby.iapi.error.StandardException: Failed to create database 'metastore_db', see the next exception for details.
My Java version:
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
I try to open a simple text file with scala:
scala> val textFile = sc.TextFile("test/README.md")
<console>:17: error: not found: value sc
I read I need to create a new SparkContext. So I try this:
scala> sc = SparkContext(appName = "foo")
<console>:19: error: not found: value sc
val $ires1 = sc
^
<console>:17: error: not found: value sc
sc = SparkContext(appName = "foo")
And this:
scala> val sc = new SparkContext(conf)
<console>:17: error: not found: type SparkContext
val sc = new SparkContext(conf)
^
<console>:17: error: not found: value conf
val sc = new SparkContext(conf)
I have no idea what is going on. All I need is the correct set up then there should be no problems.
Thank you in advance.
Try to delete metastore_db folder and derby.log file. I had the same problem and deleting those things fixed it. I'd assume there was some corrupted files in db folder due to Spark not closing properly.

Resources