How to Stop or Delete HiveContext in Pyspark? - apache-spark

I'm facing the following problem:
def my_func(table, usr, psswrd):
from pyspark import SparkContext, SQLContext, HiveContext, SparkConf
sconf = SparkConf()
sconf.setAppName('TEST')
sconf.set("spark.master", "local[2]")
sc = SparkContext(conf=sconf)
hctx = HiveContext(sc)
## Initialize variables
df = hctx.read.format("jdbc").options(url=url,
user=usr,
password=psswd,
driver=driver,
dbtable=table).load()
pd_df = df.toPandas()
sc.stop()
return pd_df
The problem here is the persistence of HiveContext (i.e if I do hctx._get_hive_ctx() it returns JavaObject id=Id)
So if I use my_func several times in the same script it will failed at the second time.
I would try remove the HiveContext which is apparently not deleted when I stop the SparkContext.
Thanks

Removing HiveContext is not possible as some state persists after sc.stop() that makes it not work in some cases.
But you could have a workaround for this (caution!! it's dangerous) if it's feasible for you. You have to delete the metastore_db everytime you start/stop your sparkContext. Again, see if it's feasible for you. The code is Java is below (in your case you have to modify it in Python).
File hiveLocalMetaStorePath = new File("metastore_db");
FileUtils.deleteDirectory(hiveLocalMetaStorePath);
You can better understand it from the following links.
https://issues.apache.org/jira/browse/SPARK-10872
https://issues.apache.org/jira/browse/SPARK-11924

Related

filter working in pyspark shell not spark-submit

df_filter = df.filter(~(col('word').isin(stop_words_list)))
df_filter.count()
27781
df.count()
31240
While submitting the same code to Spark cluster using spark-submit, the filter function is not working properly, the rows with col('word') in the stop_words_list are not filtered.
Why does this happen?
The filtering is working now after the col('word') is trimmed.
df_filter = df.filter(~(trim(col("word")).isin(stop_words_list)))
I still don't know why it works in pyspark shell, but not spark-submit. The only difference they have is: in pyspark shell, I used spark.read.csv() to read in the file, while in spark-submit, I used the following method.
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
session = pyspark.sql.SparkSession.builder.appName('test').getOrCreate()
sqlContext = SQLContext(session)
df = sqlContext.read.format("com.databricks.spark.csv").option('header','true').load()
I'm not sure if two different read-in methods are causing the discrepancy. Someone who is familiar with this can clarify.
Try using double quotes instead of single quotes.
from pyspark.sql.functions import col
df_filter = df.filter(~(col("word").isin(stop_words_list))).count()

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

Resources