dataframe,Describe() function on spark 2.3.1 is throwing Py4JJavaError - python-3.x

I am using Spark 2.3.1 and python 3.6.5 on ubuntu. While running a dataframe.Describe() function I am getting below error on Jupyter Notebook.
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-19-ea8415b8a3ee> in <module>()
----> 1 df.describe()
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py in describe(self, *cols)
1052 if len(cols) == 1 and isinstance(cols[0], list):
1053 cols = cols[0]
-> 1054 jdf = self._jdf.describe(self._jseq(cols))
1055 return DataFrame(jdf, self.sql_ctx)
1056
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o132.describe.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.sql.execution.stat.StatFunctions$.aggResult$lzycompute$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$.org$apache$spark$sql$execution$stat$StatFunctions$$aggResult$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$summary$2.apply$mcVI$sp(StatFunctions.scala:286)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.sql.execution.stat.StatFunctions$.summary(StatFunctions.scala:285)
at org.apache.spark.sql.Dataset.summary(Dataset.scala:2473)
at org.apache.spark.sql.Dataset.describe(Dataset.scala:2412)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:844)
This is the test code I am using:
import findspark
findspark.init('/home/pathirippilly/spark-2.3.1-bin-hadoop2.7')
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,StructType,StructField,IntegerType
spark=SparkSession.builder.appName('Basics').getOrCreate()
df=spark.read.json('people.json')
df.describe() #not working
df.describe().show #not working
I have installed below versions of java,scala,python and spark.
pathirippilly#sparkBox:/usr/lib/jvm$ java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
pathirippilly#sparkBox:/usr/lib/jvm$ bashscala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
python : 3.6.5
Spark version is spark-2.3.1-bin-hadoop2.7
And my environmental variable setup is as below. I have saved all these variables in /etc/environment and invoking it through /etc/bash.bashbrc
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
PYSPARK_DRIVER_OPTION="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PYSPARK_PYTHON=python3
SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7/'
PATH=$SPARK_HOME:$PATH
PYTHONPATH=$SPARK_HOME/python/
Also I have not configured spark_env.sh.Is it neccessary to have spark_env.sh configured?
Is it because of any comparability issue ? Or Am i doing something wrong here?
It will be really helpful if some one can route me to the right direction here.
Note:df.show() is working perfect for the same.

This issue fixed for me. I have reconfigured entire set up from beginning.And I have prepared my /etc/environment file as below
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/$
export SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHONPATH=python3
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
And I have added below line to /etc/bash.bashrc
source /etc/environment
Note:
*
My pyspark is available in my PYTHONPATH, so since everytime when I
open a session in my terminal /etc/bash.bashrc will do source
/etc/environment which will inturn export all the environmental
variables
I have used java-1.8.0-openjdk-amd64 instead of java 10 or 11. But
I think 10 or 11 also will work as per pyspark 2.3.1 release
document.Not sure.
I have used scala 2.11.12 only.
My py4j module is also available in my PTHONPATH.
I am not sure where I had messed up before. But now with above set up my pyspark 2.3.1
is working fine with Java1.8,Scala 2.11.12,Python 3.6.5 (and without findspark module)

OP, I'm having exactly the same setup as you had, in fact we are following the same Spark course in Udemy (setting up everything they say to the letter) and encountered the same error at the same place. The only thing I changed for it to work was the Java version. When the course was made,
$ sudo apt-get install default-jre
installed 8, but now it installs 11. I then uninstalled that Java and ran
$ sudo apt-get install openjdk-8-jre
then changed the JAVA_HOME path to point to it and now it works.

I have encountered the same error while following the same Spark course in Udemy.Below are the steps I have followed to resolve the issue.
Removing openjdk version 11 :
1)sudo apt-get autoremove default-jdk openjdk-11-jdk
It would ask for confirmation,please provide the same.
2)sudo apt-get remove default-jre.
Installing jdk 8 and configuring the same
3)sudo apt-get install openjdk-8-jre
Pointing JAVA_HOME to this newly installed jdk
4)export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
Following above approach the error from df.describe() got resolved.

Related

Error while reading data from COSMOS DB using azure-cosmos-spark_3-2_2-12-4.10.0.jar on DataBricks Runtime 10.4 LTS

I have pyspark code deployed on azure databricks, which basically reads and writes data to and from cosmosDB.
In previous months, I used:
DataBricks runtime Version: 6.4(Extended Support) with Scala 2.11 and Spark 2.45
Azure-cosmosdb-spark connector: azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar
Program runs fine with out any errors.
But, when I upgraded my runtime version to:
DataBricks runtime Version: 10.4 LTS with Scala 2.12 and Spark 3.2.1
Azure-cosmosdb-spark connector: azure-cosmos-spark_3-2_2-12-4.10.0.jar
It throws me a error saying:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-2873073042801780> in <module>
14 #Reading data from cosmosdb
15 #try:
---> 16 input_data_cosmosdb = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**ReadConfig_input_cosmos).load()
17 print("Cosmos DB columns ",input_data_cosmosdb.columns)
18 #except:
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
162 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
163 else:
--> 164 return self._df(self._jreader.load())
165
166 def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
Py4JJavaError: An error occurred while calling o663.load.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at com.microsoft.azure.cosmosdb.spark.config.Config$.getOptionsFromConf(Config.scala:281)
at com.microsoft.azure.cosmosdb.spark.config.Config$.apply(Config.scala:229)
at com.microsoft.azure.cosmosdb.spark.DefaultSource.createRelation(DefaultSource.scala:55)
at com.microsoft.azure.cosmosdb.spark.DefaultSource.createRelation(DefaultSource.scala:40)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:222)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
I have tried all the solutions available on internet. I know it comes when I am using 2.11 libraries on 2.12 project, but i am using 2.12 project and 2.12 spark connector itself. Yet i get the error.
Please let me know if any one is using the same environment. I use Databricks on Azure. Looking for a solution.
Any answers related to Databricks are really appreciated.

ClassNotFoundException loading data from snowflake with pyspark

I am getting this error when I try to load data from snowflake into a dataframe with pyspark:
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
Here is some code to reproduce the error:
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars.packages',
'net.snowflake:spark-snowflake_2.11:2.8.4-spark_2.4,net.snowflake:snowflake-jdbc:3.12.17')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sf_reader_options = {'sfURL': 'example.snowflakecomputing.com', 'sfAccount': 'example_account',
'sfWarehouse': 'example_warehouse', 'sfRole': 'DATASCIENCE', 'sfUser': 'user',
'sfPassword': 'pass', 'sfDatabase': 'db_name', 'sfSchema': 'schema_name', 'sfTimezone': 'UTC'}
reader = (spark
.read
.format('net.snowflake.spark.snowflake')
.options(**sf_reader_options))
result = reader.option('query', 'select * from TABLE_NAME').load()
The stacktrace for the error looks like this:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
at net.snowflake.spark.snowflake.Parameters$.mergeParameters(Parameters.scala:202)
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.internal.org.bouncycastle.jce.provider.BouncyCastleProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 17 more
I am using spark 2.4.7 and spark-snowflake 2.8.4, with snowflake jdbc 3.12.17. I am on Mac OS X Big Sur. This happened after I upgraded to Big Sur, though I'm not sure whether that's related.
I have tried:
adding bouncy castle provider to my configuration as a package dependency
checking that JAVA_HOME points to Java 8 (it does)
reinstalling java 8 (with homebrew and adoptopenjdk)
adding bouncy castle as a security provider, per instructions here
updating spark-snowflake and snowflake-jdbc (was using 2.7.0 and 3.12.3 before, same error)
Any help would be much appreciated!
Ultimately, I was able to resolve this by:
downloading Java straight from Oracle (rather than uninstalling and reinstalling with homebrew),
deleting spark, downloading again (from apache, not via homebrew), and setting up environment variables as described here (mostly... I use a virtual environment so I didn't hardcode PYSPARK_PYTHON to system python3)
uninstalling pyspark and reinstalling
quitting pycharm and reopening (this refreshed all my environment variables that were set in .zshrc, like JAVA_HOME)
There's almost certainly an easier way, but this worked.

How do I get Spark working on Windows 10 without Hadoop?

I am trying to get Spark running on Windows 10 but I keep running into errors. I have researched thoroughly but I am still running into issues, here is what I have done:
Installed JDK 1.8. (works fine)
Installed Anaconda3 (works fine)
Unzipped Spark 2.3.1
Downloaded winutils.exe from here and placed it in .\Hadoop\bin\ (outside of this one file the rest of the Hadoop folder is empty--I was told I did not need Hadoop)
Set up enviornment variables as follows:
User Variable : PATH = .\Continuum\anaconda3
System Variable :
JAVA_HOME = .\Java\jdk1.8.0_161
HADOOP_HOME = .\Hadoop
PYSPARK_DRIVER_PYTHON = jupyter
PYSPARK_DRIVER_PYTHON_OPTS = notebook
Path = .\Java\jdk1.8.0_161\bin; .\Hadoop\bin; .\spark\bin; .\Hadoop\bin\
I created folder C:\tmp\hive and ran command winutils.exe chmod -R 777 \tmp\hive
When I run pyspark in the terminal, Jupyter starts fine and runs code. However, the following code keeps breaking. Based on my research I am fairly certain this has to do with my installation.
from datetime import datetime
from pyspark.sql.types import Row
records = sc.parallelize([[1, "Alice", 50],[2, "Bob", 80]])
df = records.toDF()
Which results in the error:
Py4JJavaError Traceback (most recent call last)
~\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~\spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:577)
at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:752)
at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:737)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:114)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:385)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 30 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 45 more
During handling of the above exception, another exception occurred:
AnalysisException Traceback (most recent call last)
<ipython-input-5-576476669b32> in <module>()
----> 1 df = records.toDF()
So I figured out what worked for me:
I deleted the Hadoop repository with only the winutils.exe and instead downloaded from https://github.com/steveloughran/winutils and made the folder Hadoop-2.7.1 my HADOOP_HOME enviornment variable, then added the bin within that folder to PATH and restarted.

Load Spark Dataframe into ElasticSearch using PySpark [duplicate]

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ES_indexer').getOrCreate()
df = spark.createDataFrame([{'num': i} for i in xrange(10)])
df = df.drop('_id')
df.write.format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', 'http://spark-data-push-adertadaltdpioy124.us-west-2.es.amazonaws.com'
).option(
'es.port', 9200
).option(
'es.resource', '%s/%s' % ('index_name', 'doc_type_name'),
).save()
I get an error saying,
java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions would be greatly appreciated.
Error Trace:
Traceback (most recent call last):
File "es_3.py", line 12, in <module>
'es.resource', '%s/%s' % ('index_name', 'doc_type_name'),
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/readwriter.py", line 732, in save
self._jwrite.save()
File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python2.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o46.save.
: java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.sql.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 12 more
tl;dr Use pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.2.0 and use format("es") to reference the connector.
Quoting Installation from the official documentation of the Elasticsearch for Apache Hadoop product:
Just like other libraries, elasticsearch-hadoop needs to be available in Spark’s classpath.
And later in Supported Spark SQL versions:
elasticsearch-hadoop supports both version Spark SQL 1.3-1.6 and Spark SQL 2.0 through two different jars: elasticsearch-spark-1.x-<version>.jar and elasticsearch-hadoop-<version>.jar
elasticsearch-spark-2.0-<version>.jar supports Spark SQL 2.0
That looks like an issue with the document (as they use two different versions of the jar file), but does mean that you have to use the proper jar file on the CLASSPATH of your Spark application.
And later in the same document:
Spark SQL support is available under org.elasticsearch.spark.sql package.
That simply says that the format (in df.write.format('org.elasticsearch.spark.sql')) is correct.
Further down the document you can find that you could even use an alias df.write.format("es") (!)
I found Apache Spark section in the project's repository on GitHub more readable and current.
Update: The current ES-hadoop package as of June 2020 is 7.7.1, so I used pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.7.1 instead.
You have to mention the version of you elasticsearch db at the end of the package e.g --packages org.elasticsearch:elasticsearch-hadoop:(version). In my case it was org.elasticsearch:elasticsearch-hadoop:7.0.0.

pyspark SparkContext issue "Another SparkContext is being constructed"

I installed Spark on my EC2 instance following this tutorial:
https://sparkour.urizone.net/recipes/installing-ec2/#03
but when I try to start pyspark shell, I get this error:
"Another SparkContext is being constructed"
Here is the full exception:
[ec2-user#ip-10-0-0-153 ~]$ pyspark
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 11:46:16 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 54, in <module>
spark = SparkSession.builder.getOrCreate()
File "/opt/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/spark/python/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/spark/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/opt/spark/python/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/spark/python/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I googled a lot and tried everything with no solution. I used this code to get a list of all running Contexts:
>>> from pyspark import SparkConf
>>> conf = SparkConf()
>>> conf.getAll()
And I got this:
[(u'spark.master', u'local[*]'), (u'spark.submit.deployMode', u'client'), (u'spark.app.name', u'PySparkShell')]
Any ideas how can I solve this issue?
I encountered the same error while trying to run PySpark in Jupyter Notebook (on macOS). The problem seems related to a Java version incompatibility and it only works with Java 8. I fixed the issue by changing the java version:
check java (JVM) installation versions. If you don't have java 8 installed, install it following instructions for your OS.
/usr/libexec/java_home -V
change the version to Java 8 (it is enough to do it for the current session)
export JAVA_HOME=/usr/libexec/java_home -v 1.8....
check:
java -version
run PySpark. that should solve the issue.
I solved the problem by setting the SPARK_MASTER_HOST=127.0.0.1 in spark-env.sh file
Navigate to Spark config folder:
cd $SPARK_HOME/conf
Open spark-env.sh file using an editor (is this file does not exist copy the spark-env-template file):
vi spark-env.sh
edit the SPARK_LOCAL_IP value to your machine IP (e.g. 205.210.42.205) :
SPARK_LOCAL_IP="205.210.42.205"
Hope this helps.

Resources