Cannot trigger skew join optimization of AQE in spark 3.0.0 - apache-spark

from pyspark.sql.functions import *
spark.conf.set("spark.sql.autoBroadcastJoinThreshold","-1")
spark.conf.set("spark.sql.shuffle.partitions","3")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes","-1")
df1 = spark.range(10000000000).withColumn("id",lit("x"))
extravalues=spark.range(4).withColumn("id",lit("y"))
more = spark.range(4).withColumn("id",lit("z"))
df1=df1.union(extravalues).union(more)
df2 = spark.range(1000000).withColumn("id",lit("x"))
df2_extra = spark.range(10).withColumn("id",lit("y"))
df2_more = spark.range(10).withColumn("id",lit("z"))
df2=df2.union(df2_extra).union(df2_more)
output = df1.join(df2,df1.id==df2.id).select(df1.id)
output.write.parquet('s3a://...',mode='overwrite')
spark.sql.adaptive.skewJoin.skewedPartitionFactor = 2
spark.sql.adaptive.localShuffleReader.enabled = true
skew join failed

AQE is not enabled by default in Spark 3.0.0.
spark.sql.adaptive.enabled=true

Related

Execute PySpark code from a Java/Scala application

Is there a way to execute PySpark code from a Java/Scala application on an existing SparkSession?
Specifically, given a PySpark code that receives and returns pyspark dataframe, is there a way to submit it to Java/Scala SparkSession and get back the output dataframe:
String pySparkCode = "def my_func(input_df):\n" +
" from pyspark.sql.functions import *\n" +
" return input_df.selectExpr(...)\n" +
" .drop(...)\n" +
" .withColumn(...)\n"
SparkSession spark = SparkSession.builder().master("local").getOrCreate()
Dataset inputDF = spark.sql("SELECT * from my_table")
outputDf = spark.<SUBMIT_PYSPARK_METHOD>(pySparkCode, inputDF)

LSH in Scala and Python API

I was following this SO post Efficient string matching in Apache Spark to get some string matching using LSH algorithm. For some reason getting results thru python API, but not in Scala. I don't see really where what is missing in Scala code.
Here below are the both codes:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, NGram, HashingTF, MinHashLSH
query = spark.createDataFrame(["Bob Jones"], "string").toDF("text")
db = spark.createDataFrame(["Tim Jones"], "string").toDF("text")
model = Pipeline(stages=[
RegexTokenizer(
pattern="", inputCol="text", outputCol="tokens", minTokenLength=1
),
NGram(n=3, inputCol="tokens", outputCol="ngrams"),
HashingTF(inputCol="ngrams", outputCol="vectors"),
MinHashLSH(inputCol="vectors", outputCol="lsh")
]).fit(db)
db_hashed = model.transform(db)
query_hashed = model.transform(query)
model.stages[-1].approxSimilarityJoin(db_hashed, query_hashed, 0.75).show()
And it returns:
> +--------------------+--------------------+-------+ | datasetA| datasetB|distCol|
> +--------------------+--------------------+-------+ |[Tim Jones, [t, i...|[Bob Jones, [b, o...| 0.6|
> +--------------------+--------------------+-------+
However Scala returns nothing, and here is the code:
import org.apache.spark.ml.feature.RegexTokenizer
val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
import org.apache.spark.ml.feature.NGram
val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
import org.apache.spark.ml.feature.HashingTF
val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
import org.apache.spark.ml.feature.{MinHashLSH, MinHashLSHModel}
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))
val query = Seq("Bob Jones").toDF("text")
val db = Seq("Tim Jones").toDF("text")
val model = pipeline.fit(db)
val dbHashed = model.transform(db)
val queryHashed = model.transform(query)
model.stages.last.asInstanceOf[MinHashLSHModel].approxSimilarityJoin(dbHashed, queryHashed, 0.75).show
I am using Spark 3.0, I know its a test, but can't really test it on different version. And I doubt there is a bug like that :)
This code will work in Spark 3.0.1 if you set numHashTables correctly.
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh").setNumHashTables(3)

Process a 1/2 billion rows with PySpark creates shuffle read problems

I am apparently facing a read shuffle problems.
My Pyspark Script is running on a Hadoop cluster 1 EdgeNode and 12 Datanodes, using YARN as resources manager and Spark 1.6.2.
###[ini_file containing conf spark]
spark.app.name = MY_PYSPARK_APP
spark.master = yarn-client
spark.yarn.queue = agr_queue
spark.executor.instances = 24
spark.executor.memory = 14
spark.executor.cores = 3
#spark.storage.memoryFraction = 0.5
#spark.sql.shuffle.partitions = 2001
#spark.sql.shuffle.partitions = 1000
spark.sql.shuffle.partitions = 100
spark.shuffle.memoryFraction=0.5
spark.memory.offHeap.enabled = True
spark.serializer = org.apache.spark.serializer.KryoSerializer
#spark.driver.memory = 14g
spark.driver.maxResultSize = 20g
spark.python.worker.memory = 14g
spark.akka.heartbeat.interval = 100
spark.yarn.executor.memoryOverhead=2000
spark.yarn.driver.memoryOverhead=2000
spark.scheduler.mode = FIFO
spark.sql.tungsten.enabled = True
spark.default.parallelism = 200
spark.speculation = True
spark.speculation.interval = 1000ms
spark.speculation.multiplier = 2.0
Python script
sconf = SparkConf()
sc = SparkContext(sconf)
hctx = HiveContext(sc)
dataframe1 = hctx.sql("SELECT * FROM DB1.TABLE1")
dataframe2 = hctx.sql("SELECT * FROM DB2.TABLE2")
df = dataframe1.join(dataframe2, conditions)
# No major problem at this count()
# it returns 550 000 000 rows
df.count()
# 288 elements in List_dtm_t
List_dtm_t=['00:00:00', '00:05:00', ... '23:45:00', '23:50:00', '23:55:00']
dat_tm_bdcst = sc.broadcast(List_dtm)
global dat_tm_bdcst
def mapper(row):
import datetime
def ts_minus_5(tmstmp):
import datetime
return tmstmp-datetime.timedelta(minutes=5)
lst_tuple = ()
poids = row[9]
for dtm in dat_tm_bdcst.value:
t_minus = ts_minus_5(dtm)
if (row[0]<=dtm) & (row[1]>t_minus):
v1 = str(dtm)
v2 = str(t_minus)
v3 = row[2]
v4 = row[3]
v5 = row[4]
v6 = row[5]
v7 = row[6]
v8 = row[7]
v9 = row[8]
v10 = row[10]
v11 = poids * (min(dtm,row[1])-max(t_minus,row[0])).total_seconds()
v12 = poids
if row[0] <= dtm <= row[1] : v13 = poids
else : v13 = 0
lst_tuple += (((v1, v2, v3, v4, v5, v6, v7, v8, v9, v10),(v11, v12, v13)),)
return lst_tuple
global list_to_row
def list_to_row(keys, values):
from pyspark.sql import Row
row_dict = dict(zip(keys, values[0]+values[1]))
return Row(**row_dict)
f_reduce = lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2])
# This flatMap takes a really infinite long time
# It generally returns a KO because it retries more than 3 times
# Or lose some shuffle path
mapped_df = df.limit(10000000)\
.flatMap(mapper)
reduced_rdd = mapped_df.reduceByKey(f_reduce)
reduced_rdd.count()
list_of_rows = reduced_rdd.map(lambda x: list_to_row(header, x))
df_to_exp = hctx.createDataFrame(list_of_rows)
## register as tempTable df_to_exp then write it into Hive
I tried different ways like :
Resolve skew problem using repartition([keys]) to distribute data by keys used by the reducer then
Different values for spark.sql.shuffle.partitions, spark.default.parallelism and memoryOverhead conf
A partial dataframe version using grouypBy
Use persistence even if I pass over the data only one time
I am looking for solution to reach the end and also speed up the process.
Two screenshot of spark UI:
List of Stages
ReduceByKey Task
We can see the ReduceByKey stage (don't know if it represents only the reduce task, with only 1 task ?!!)
And the shuffle read /records which inscrease too slowly (300 000/100Millions after 13 minutes)
Hope someone could help,
Thanks !

Use Vector Assembler and extracting "features" as org.apache.spark.mllib.linalg.Vectors in spark scala

I have wanted to use the Gaussian Mixture Model in Spark 1.5.1 which uses the linalg.mllib.vector rdd .
This is my code
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions
dummy = dummy.na.drop
var colnames= dummy.columns
var df = dummy
for(x<-colnames)
{
if (dummy.select(x).dtypes(0)._2.equals("StringType") || dummy.select(x).dtypes(0)._2.equals("LongType"))
{ df = df.drop(x)}
}
var colnames = df.columns
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var temp = output.select("features")
The problem is i am not able to change the feature column into org.apache.spark.mllib.linalg.Vector rdd
Anyone has an idea how to do this ?
Spark >= 2.0
Either map:
temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("features"))
or use as:
temp
.select("features")
.as[Tuple1[org.apache.spark.ml.linalg.Vector]]
.rdd.map(_._1)
Spark < 2.0
Just map over RDD[Row] and extract the field:
temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("features"))

How to create InputDStream with offsets in PySpark (using KafkaUtils.createDirectStream)?

How to use KafkaUtils.createDirectStream with the offsets for a particular Topic in Pyspark?
If you want to create an RDD from records in a Kafka topic, use a static set of tuples.
Make available all the imports
from pyspark.streaming.kafka import KafkaUtils, OffsetRange
Then you create a dictionary of Kafka Brokers
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
Then you create your offsets object
start = 0
until = 10
partition = 0
topic = 'topic'
offset = OffsetRange(topic,partition,start,until)
offsets = [offset]
Finally you create the RDD:
kafkaRDD = KafkaUtils.createRDD(sc, kafkaParams,offsets)
To create Stream with offsets you need to do the following:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
Then you create your sparkstreaming context using your sparkcontext
ssc = StreamingContext(sc, 1)
Next we set up all of our parameters
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
start = 0
partition = 0
topic = 'topic'
Then we create our fromOffset Dictionary
topicPartion = TopicAndPartition(topic,partition)
fromOffset = {topicPartion: long(start)}
//notice that we must cast the int to long
Finally we create the Stream
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams,
fromOffsets=fromOffset)
You can do:
from pyspark.streaming.kafka import TopicAndPartition
topic = "test"
brokers = "localhost:9092"
partition = 0
start = 0
topicpartion = TopicAndPartition(topic, partition)
fromoffset = {topicpartion: int(start)}
kafkaDStream = KafkaUtils.createDirectStream(spark_streaming,[topic], \
{"metadata.broker.list": brokers}, fromOffsets = fromoffset)
Note: Spark 2.2.0, python 3.6

Resources