self join works fine while using broadcast in spark application

self join works fine while using broadcast in spark application - apache-spark

I am doing a self-join in my spark application for one of the tables. And, the logic works fine when I don't use spark.sql.autobroadcastjointhreshold=-1. It gives very strange result when I use the spark.sql.autobroadcastjointhreshold=-1 in my spark config.
I am using spark version 3.0.1 and ran it with the help of EMR 6.2.1.

Related

Using Pyspark locally when installed using databricks-connect

I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing.
However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.
I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.
This is one of the pieces of sample code I tried:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()
I am using:
Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1
Do I have to override the default/global spark session to initiate a local one? How would I do that?
I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.
TIA

You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.

Spark 3.0.1 Dataframe Operations Work but Spark SQL does not

I built a Raspberry Pi 6 node cluster running Hadoop (HDFS) and Spark. I originally had Spark 2.4.3 and Hadoop 3.2.1 and Scala 2.11 working like a charm. However, I recently upgraded Spark to 3.0.1 and Scala to 2.12, I left Hadoop alone.
When I run dataframe operations it works like a charm; however, when I try to run Spark SQL command they end up erroring out. e.g spark.sql("s e l ec t count(*) from mytable"). The data set I am using is trivially small (a few kb) but yet the Spark SQL errors out. If I do operations using the dataframe api syntax it works like a charm.
I've attached the stderror (Link to text file in Google Drive) for one such spark sql error. Any and all help would be greatly appreciated.
> 3.0.1 stderr log page for app-20201030212527-0001/2

How to connect local spark to Hive in cluster in Scala IDE

Can you please let me know the steps to connect scala ide which I use for developing spark to connect to hive. Currently the output goes to hdfs and then I create an external table on top of it. But as spark streaming creates small files , the performance is getting bad and I want spark to write directly to Hive and I am not sure what configuration in my PC that I should make for that to happen for my development.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.

Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.

This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

self join works fine while using broadcast in spark application - apache-spark

Related

Using Pyspark locally when installed using databricks-connect

Spark 3.0.1 Dataframe Operations Work but Spark SQL does not

How to connect local spark to Hive in cluster in Scala IDE

How to get access to HDFS files in Spark standalone cluster mode?

How to configure Hive to use Spark execution engine on Google Dataproc?

Categories

Resources