Cannot get a hello world running in Apache Spark on Ubuntu - apache-spark

I am trying to setup Apache Spark on Ubuntu and output hello world.
I added these two lines to .bashrc
export SPARK_HOME=/home/james/spark
export PATH=$PATH:$SPARK_HOME/bin
(Where spark is a symbolic link to the current spark version I have).
I run
spark-shell
From the command line and the shell starts up with hundreds of errors, for example:
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test
connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
Caused by: ERROR XJ041: Failed to create database 'metastore_db', see the next exception for details.
Caused by: ERROR XBM0A: The database directory '/home/james/metastore_db' exists. However, it does not contain the expected 'service.properties' file. Perhaps Derby was brought down in the middle of creating this database. You may want to delete this directory and try creating the database again.
Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Caused by: org.apache.derby.iapi.error.StandardException: Failed to create database 'metastore_db', see the next exception for details.
My Java version:
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
I try to open a simple text file with scala:
scala> val textFile = sc.TextFile("test/README.md")
<console>:17: error: not found: value sc
I read I need to create a new SparkContext. So I try this:
scala> sc = SparkContext(appName = "foo")
<console>:19: error: not found: value sc
val $ires1 = sc
^
<console>:17: error: not found: value sc
sc = SparkContext(appName = "foo")
And this:
scala> val sc = new SparkContext(conf)
<console>:17: error: not found: type SparkContext
val sc = new SparkContext(conf)
^
<console>:17: error: not found: value conf
val sc = new SparkContext(conf)
I have no idea what is going on. All I need is the correct set up then there should be no problems.
Thank you in advance.

Try to delete metastore_db folder and derby.log file. I had the same problem and deleting those things fixed it. I'd assume there was some corrupted files in db folder due to Spark not closing properly.

Related

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

How to access a file in windows using Spark and winutils?

I am running spark on windows using winutils.
In spark shell in trying to load a csv file, but it says Path does not exist, i.e. I have a file at location E:/data.csv.
I am executing:
scala> val df = spark.read.option("header","true").csv("E:\\data.csv")
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/E:/data.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
I cant figure out why is it appending a "/E:", whereas it should have been only E:
How should I access the file?
In my case I am able to read the file as below
val input = spark.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.option("delimiter", ";").option("quoteAll","true").option("inferSchema","false").load("C:/Work/test.csv").toDF()

basedir must be absolute: ?/.ivy2/local

I'm writing here in a full desperation state...
I have 2 users:
1 local user, created in Linux. Works 100% fine, word count works perfectly. Kerberized Cluster. Valid ticket.
1 Active Directory user, can login, but pyspark instruction (same word count) fails. Same kdc ticket as the one above.
Exception in thread "main" java.lang.IllegalArgumentException: basedir
must be absolute: ?/.ivy2/local
at org.apache.ivy.util.Checks.checkAbsolute(Checks.java:48)
at org.apache.ivy.plugins.repository.file.FileRepository.setBaseDir(FileRepository.java:135)
at org.apache.ivy.plugins.repository.file.FileRepository.(FileRepository.java:44)
at org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:943)
at org.apache.spark.deploy.SparkSubmitUtils$.buildIvySettings(SparkSubmit.scala:1035)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:295)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The Code I'm running. Super simple.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn")
sc = SparkContext(conf=conf)
It ends in error in the last instruction with the above error (see exception).
?/.ivy2/local -> This is the problem but I have no idea what's going on :(.
With the Linux user it works perfectly... but with the AD user that doesn't exists in the local system, but has /home/userFolder ... I have this problem :(
Please help... I've reach the point of insanity... I've googled every corner of the internet but I haven't found any solution to this problem/mistake :( stackoverflow is my last resort heeeeeeeeeelp
Context
Ivy needs a directory called .ivy2, usually located in the home directory. You can also configure where .ivy2 should be by giving a configuration property when Spark starts, or when you execute spark-submit.
Where the problem comes from
In IvySettings.java (line 796 for the version 2.2.0 of ant-ivy) there is this line:
if (getVariable("ivy.home") != null) {
setDefaultIvyUserDir(Checks.checkAbsolute(getVariable("ivy.home"), "ivy.home"));
Message.verbose("using ivy.default.ivy.user.dir variable for default ivy user dir: " + defaultUserDir);
} else {
setDefaultIvyUserDir(new File(System.getProperty("user.home"), ".ivy2"));
Message.verbose("no default ivy user dir defined: set to " + defaultUserDir);
}
As you can see, if ivy.home is not set, and user.home is also not set, then you will get the error:
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
Solution 1 (spark-shell or spark-submit)
As Rocke Yang has mentioned, you can start spark-shell or spark-submit by setting the configuration property spark.jars.ivy. Example:
spark-shell --conf spark.jars.ivy=/tmp/.ivy
Solution 2 (spark-launcher or yarn-client)
A second solution would be to set the configuration property when calling the submit method programmatically:
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
.setConf("spark.jars.ivy","/tmp/.ivy")
Ticket opened
There is a ticket opened by Spark-Community
I have met similar issue with this.
SparkSubmit will looking for ivy home directly. If not found it will report an error. And the name changed slightly on the way.
class SparkSubmitArguments {
ivyRepoPath = sparkProperties.get("spark.jars.ivy").orNull
}
We can pass the ivy.home directory by like this
spark-shell --conf spark.jars.ivy=/tmp/.ivy

Error while trying to connect to SPARK from Rstudio using sparklyr package

I am using the below command to connect to the spark from rstudio :
sc <- spark_connect(master = "local", version = "2.0.0")
I have tried changing the java versions/path but still getting the same issue.
Can some one please help on this
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (5308): Gateway in port (8880) did not respond.
Path: C:\Users\....\Local\rstudio\spark\Cache\spark-2.0.0-bin-hadoop2.7\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, "C:\Users\......\R\win-library\3.4\sparklyr\java\sparklyr-2.0-2.11.jar", 8880, 5308
---- Output Log ----
'C:\Users....\AppData\Local\rstudio\spark\Cache\SPARK-~1.7\bin\SPARK-~4.CMD' is not recognized as an internal or external command,
operable program or batch file.
---- Error Log ----
Probably the initial Spark/ dplyr setup is not done. Check "New Connection" button in 'Spark' tab beside 'Environment' tab in Rstudio:
Select 'Master' as local; 'DB interface' as dplyr
Select proper version of Spark & Hadoop; click 'Connect'
If Spark is not installed in your machine then a pop up will ask you to do
so.
After this you should be able to execute your code.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
Do let us know what you get after this.

SaveAsTextFile() results in Mkdirs failed - Apache Spark

Good, I currently have a cluster in spark with 3 working nodes. I also have a nfs server mounted on /var/nfs with 777 permission for testing. I'm trying to run the following code to count the words in a text:
root#master:/home/usuario# MASTER="spark://10.0.0.1:7077" spark-shell
val inputFile = sc.textFile("/var/nfs/texto.txt")
val counts = inputFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.toDebugString
counts.cache()
counts.count()
counts.saveAsTextFile("/home/usuario/output");
But spark gives me the following error:
Caused by: java.io.IOException: Mkdirs failed to create
file:/var/nfs/output-4/_temporary/0/_temporary/attempt_20170614094558_0007_m_000000_20
(exists=false, cwd=file:/opt/spark/work/app-20170614093824-0005/2)
I have searched for many websites but I can not find the solution for my case. All help is grateful.
When you start a spark-shell with MASTER as valid application-master url - and not local[*], spark treats all paths as HDFS; and performs IO operations only in underlying HDFS; not in local.
YOu have mounted the locations in local file-system; and those paths are not existed in HDFS.
That's why, the error says: exists=false
Same issue with me. Check ownership of your directory again.
sudo chown -R owner-user:owner-group directory

Resources