How to parallel insert into Hive using pyspark

How to parallel insert into Hive using pyspark - apache-spark

I have a job which is splitted among workers, each worker outputs a dataframe which needs to be written into hive, I couldn't figure out how to access hive from workers without initializing another sparkcontext so I tried collecting their output and inserting it in one time like below
result = df.rdd.map(lambda rdd: predict_item_by_model(rdd, columns)).collect()
df_list = sc.parallelize(result).map(lambda df: hiveContext.createDataFrame(df)).collect() #throws error
mergedDF = reduce(DataFrame.union, df_list)
mergedDF.write.mode('overwrite').partitionBy("item_id").saveAsTable("items")
but now it throws this error
_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
is there away to access hive from workers directly ? if not, how can I collect the data and insert them once ?

.map(lambda df: hiveContext.createDataFrame(df))
Simply not possible in Spark this approach. Not how it works at all.
The first step of any Spark driver application is to create a SparkContext including Hive context, if required. Driver aspect only. As the message states.
Have a look here https://www.waitingforcode.com/apache-spark/serialization-issues-part-1/read to get yourself going on this serialization issue.

Related

How to access java runtime variables like java.lang.Runtime.getRuntime().maxMemory() for pyspark executors?

The question is all there is. I want a way to check the java runtime variables for the executor jvm created but I am working with pyspark. How can I access java.lang.Runtime.getRuntime().maxMemory() if I am working with pyspark?
based on the comment I have tried to run the following code but both approaches are unsuccessful
#created a RDD
l = sc.range(100)
Now, I have to run func = sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory() on each executor. So, I do the following
l.map(lambda x:sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory()).collect()
Which results in
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
The spark context can only be used on the driver
I also tried
func = sc._gateway.jvm.java.lang.Runtime.getRuntime()
l.map(lambda x:func.maxMemory()).collect()
which results in the following error
TypeError: cannot pickle '_thread.RLock' object

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.

Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

How many SparkSessions can a single application have?

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.
However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.
But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.
So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?
But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.
I was also able to use this on the JavaSparkPi without any problems.
I have tested this in yarn-client and a local spark install.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?

TL;DR You can have as many SparkSessions as needed.
You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.
The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.
I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.
So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?
SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new SparkSession and used a new SparkContext. No error was thrown.
Again, how does this contradict what I said about a single SparkContext being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
Stop SparkContext
Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

Why can't we create an RDD using Spark session

We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.

In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here

It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Spark SQL Stackoverflow

I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this:
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "/home/shaza90/Desktop/tweets_1428981780000"
val tweet = sqlContext.jsonFile(path).cache()
tweet.registerTempTable("tweet")
tweet.printSchema() //This one works fine
val texts = sqlContext.sql("SELECT tweet.text FROM tweet").collect().foreach(println)
The exception that I'm getting is this one:
java.lang.StackOverflowError
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
Update
I'm able to execute select * from tweet but whenever I use a column name instead of * I get the error.
Any Advice?

This is SPARK-5009 and has been fixed in Apache Spark 1.3.0.
The issue was that to recognize keywords (like SELECT) with any case, all possible uppercase/lowercase combinations (like seLeCT) were generated in a recursive function. This recursion would lead to the StackOverflowError you're seeing, if the keyword was long enough and the stack size small enough. (This suggests that if upgrading to Apache Spark 1.3.0 or later is not an option, you can use -Xss to increase the JVM stack size as a workaround.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string