I am getting IllegalArgumentException when creating a SparkSession - apache-spark

I am using pyspark and jupyter notebook on spark 2.1.0 and python 2.7. I am trying to create a new SparkSession using the code below;
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession\
.builder\
.appName("Bank Service Classifier")\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
sc = SparkContext()
sqlContext = SQLContext(sc)
However, if I am getting the following error;
IllegalArgumentException Traceback (most recent call last)
<ipython-input-40-2683a8d0ffcf> in <module>()
4 from pyspark.sql import SQLContext
5
----> 6 spark = SparkSession .builder .appName("example-spark") .config("spark.sql.crossJoin.enabled","true") .getOrCreate()
7
8 sc = SparkContext()
/srv/spark/python/pyspark/sql/session.py in getOrCreate(self)
177 session = SparkSession(sc)
178 for key, value in self._options.items():
--> 179 session._jsparkSession.sessionState().conf().setConfString(key, value)
180 for key, value in self._options.items():
181 session.sparkContext._conf.set(key, value)
/srv/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/srv/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
How do i fix this?

I ran into this same error. Downloading Spark pre-built for Hadoop 2.6 instead of 2.7 worked for me.

Related

Hive query from pyspark

I want to connect to hive from pyspark using following code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
sparkSession = (SparkSession
.builder
.master('spark://spark_host:7077')
.appName('example-pyspark-read-and-write-from-hive')
.config("spark.sql.warehouse.dir", "hdfs://spark_host:9000/user/hive/warehouse", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
Output:
raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
My other attempts have failed. Please help me to configure pyspark and hive for normal working.
Spark version - 2.4.5 Hive version - 3.1.2

Read csv using pyspark

I am new to spark. And I am trying to read csv file using pyspark. And I referred to PySpark How to read CSV into Dataframe, and manipulate it, Get CSV to Spark dataframe and many more. I tried to read it two ways:
1
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.conf import SparkConf
sc = SparkContext.getOrCreate()
df = spark.read.csv('D:/Users/path/csv/test.csv')
df.show()
2
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("D:/Users/path/csv/test.csv"))
df.show()
Neither of the codes are working. I am getting the following error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-28-c6263cc7dab9> in <module>()
4
5 sc = SparkContext.getOrCreate()
----> 6 df = spark.read.csv('D:/Users/path/csv/test.csv')
7 df.show()
8
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py in csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode)
378 if isinstance(path, basestring):
379 path = [path]
--> 380 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
381
382 #since(1.5)
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o663.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
I don't why it is throwing some hive exception Py4JJavaError: An error occurred while calling o663.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat not found. How to resolve this error HiveFileFormat not found.
Can anyone guide me to resolve this error?
Have you tried to use sqlContext.read.csv? This is how I read csvs in Spark 2.1
from pyspark import sql, SparkConf, SparkContext
conf = SparkConf().setAppName("Read_CSV")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
df = sqlContext.read.csv("path/to/data")
df.show()
First of all, the system needs to recognize Spark Session as the following commands:
from pyspark import SparkConf, SparkContext
sc = SparkContext()
after that, SQL library has to be introduced to the system like this:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
and finally you can read your CSV by the following command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('path/to/your/file.csv')
Since in PySpark 3.0.1 SQLContext is Deprectaed - to import a CSV file into PySpark.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python") \
.getOrCreate()
df = spark.read.csv("/path/to/file/csv")
df.show()
Try to specify to use local master by making a configuration object. This would remove some doubts of spark trying to access on hadoop or anywhere as mentioned by somebody in comment.
sc.stop()
conf = (conf.setMaster('local[*]'))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
if this is not working, then dont use sqlcontext for reading the file. Try spark.read.csv("path/filename.csv") by creating sparksession.
Also, it is best if you use Spark/Hadoop with a Linux operating system as it is a lot simpler in those systems.
Error most likely occurs because you are trying to access a local file.
See below how you should access it:
#Local File
spark.read.option("header","true").option("inferSchema","true").csv("file:///path")
#HDFS file
spark.read.option("header","true").option("inferSchema","true").csv("/path")
.csv(<path>) comes last.

Unable to build SparkSession in Python

I'm new to Spark, and I'm using it in a jupyter notebook. I have the following code, which gives me an error:
from pyspark import SparkConf, SparkContext
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.master("local").appName("Epidemiology").config(conf = SparkConf()).getOrCreate()
I'm at a loss here, any suggestions as to what could be the problem?
The complete error is too long to post here, but this is part of it:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
C:\spark\spark\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o23.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
.
.
.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-2-17a54aa52bc2> in <module>()
1 # Boilerplate Spark stuff
2 #conf = SparkConf().setMaster("local").setAppName("Epidemiology")
----> 3 spark = SparkSession.builder.master("local").appName("Epidemiology").config(conf = SparkConf()).getOrCreate()
4 #sc = SparkContext.getOrCreate(conf = conf)
5 #sc = SparkContext(conf = conf)
C:\spark\spark\python\pyspark\sql\session.py in getOrCreate(self)
177 session = SparkSession(sc)
178 for key, value in self._options.items():
--> 179 session._jsparkSession.sessionState().conf().setConfString(key, value)
180 for key, value in self._options.items():
181 session.sparkContext._conf.set(key, value)
C:\spark\spark\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
C:\spark\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"

Querying a spark streaming application from spark-shell (pyspark)

I am following this example in the pyspark console and everything works perfectly.
After that I wrote it as an PySpark application as follows:
# -*- coding: utf-8 -*-
import sys
import click
import logging
from pyspark.sql import SparkSession
from pyspark.sql.types import *
#click.command()
#click.option('--master')
def most_idiotic_bi_query(master):
spark = SparkSession \
.builder \
.master(master)\
.appName("stream-test")\
.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
some_schema = .... # Schema removed
some_stream = spark\
.readStream\
.option("sep", ",")\
.schema(some_schema)\
.option("maxFilesPerTrigger", 1)\
.csv("/data/some_stream", header=True)
streaming_counts = (
linkage_stream.groupBy(some_stream.field_1).count()
)
query = streaming_counts.writeStream\
.format("memory")\
.queryName("counts")\
.outputMode("complete")\
.start()
query.awaitTermination()
if __name__ == "__main__":
logging.getLogger("py4j").setLevel(logging.ERROR)
most_idiotic_bi_query()
The app is executed as:
spark-submit test_stream.py --master spark://master:7077
Now, If I open a new spark driver in another terminal:
pyspark --master spark://master:7077
And try to run:
spark.sql("select * from counts")
It fails with:
During handling of the above exception, another exception occurred:
AnalysisExceptionTraceback (most recent call last)
<ipython-input-3-732b22f02ef6> in <module>()
----> 1 spark.sql("select * from id_counts").show()
/usr/spark-2.0.2/python/pyspark/sql/session.py in sql(self, sqlQuery)
541 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
542 """
--> 543 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
544
545 #since(2.0)
/usr/local/lib/python3.4/dist-packages/py4j-0.10.4-py3.4.egg/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/spark-2.0.2/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'Table or view not found: counts; line 1 pos 14'
I don't understand what is happening.
This is an expected behavior. If you check the documentation for memory sink:
The output is stored in memory as an in-memory table. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Hence, use it with caution.
As you can see memory sink doesn't create a persistent table or global temporary view but a local structure limited to a driver. Hence it cannot be queried from another Spark application.
So the memory output has to be queried from the driver, in which it is written. For example you could mimic console mode as shown below.
A dummy writer:
import pandas as pd
import numpy as np
import tempfile
import shutil
def producer(path):
temp_path = tempfile.mkdtemp()
def producer(i):
df = pd.DataFrame({
"group": np.random.randint(10, size=1000)
})
df["val"] = (
np.random.randn(1000) +
np.random.random(1000) * df["group"] +
np.random.random(1000) * i % 7
)
f = tempfile.mktemp(dir=temp_path)
df.to_csv(f, index=False)
shutil.move(f, path)
return producer
Spark application:
from pyspark.sql.types import IntegerType, DoubleType, StructType, StructField
schema = StructType([
StructField("group", IntegerType()),
StructField("val", DoubleType())
])
path = tempfile.mkdtemp()
query_name = "foo"
stream = (spark.readStream
.schema(schema)
.format("csv")
.option("header", "true")
.load(path))
query = (stream
.groupBy("group")
.avg("val")
.writeStream
.format("memory")
.queryName(query_name)
.outputMode("complete")
.start())
And some events:
from rx import Observable
timer = Observable.timer(5000, 5000)
timer.subscribe(producer(path))
timer.skip(1).subscribe(lambda *_: spark.table(query_name).show())
query.awaitTermination()

Why does pyspark fail with "Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars."?

I am using a standalone cluster of apache spark version 2.0.0 with two nodes and i have not installed hive.I am getting the following error on creating a dataframe.
from pyspark import SparkContext
from pyspark import SQLContext
sqlContext = SQLContext(sc)
l = [('Alice', 1)]
sqlContext.createDataFrame(l).collect()
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-9-63bc4f21f23e> in <module>()
----> 1 sqlContext.createDataFrame(l).collect()
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
297 Py4JJavaError: ...
298 """
--> 299 return self.sparkSession.createDataFrame(data, schema, samplingRatio)
300
301 #since(1.3)
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
525 df = DataFrame(jdf, self._wrapped)
526 df._schema = schema
/home/mok/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'
So should i install Hive or edit the configurations.
IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'
I had the same issue and fixed it by using Java 8. Make sure you install JDK 8 and set the environment variables accordingly.
Do not use Java 11 with Spark / pyspark 2.4.
If you have several java versions you'll have to figure out which spark is using (I did this using trial and error , starting with
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
and ending with
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
Did the trick.
If you have multiple jdks installed, you can find the java homes as below
/usr/libexec/java_home -V
Matching Java Virtual Machines (3):
13.0.2, x86_64: "OpenJDK 13.0.2" /Library/Java/JavaVirtualMachines/adoptopenjdk-13.0.2.jdk/Contents/Home
11.0.6, x86_64: "AdoptOpenJDK 11" /Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home
1.8.0_252, x86_64: "AdoptOpenJDK 8" /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
Now set the JAVA_HOME for 1.8 use
export JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
Please ensure that your JAVA_HOME environment variable is set.
For Mac OS I did, echo export JAVA_HOME=/Library/Java/Home >> ~/.bash_profile and then source ~/.bash_profile or open ~/.bash_profile in type the above.

Resources