rdd of DataFrame could not change partition number in Spark Structured Streaming python - apache-spark

I have following code of pyspark in Spark Structured Streaming to get dataFrame from Redis
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
Why? How to fix it? The code in main is
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()

Since the partitionNum is 1 at the first place, coalesce operation does not allow to generate fewer partitions. So no matter how you call it, it would be 1 partition. Unless you use repartition

Related

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

PySpark + AWS EMR: df.count() taking a long time to complete

I am using the action count() to trigger my udf function to run. This works, but long after my udf function has completed running, the df.count() takes days to complete. The dataframe itself is not large, and has about 30k to 100k rows.
AWS Cluster Settings:
1 m5.4xlarge for the master node
2 m5.4xlarge for the worker nodes.
Spark Variables & Settings (These are the spark variables being used to run the script)
--executor-cores 4
--conf spark.sql.execution.arrow.enabled=true
'spark.sql.inMemoryColumnarStorage.batchSize', 2000000 (set inside pyspark script)
Psuedo Code
Here is the actual structure of our script. The custom pandas udf function makes a call to a PostGres database for every row.
from pyspark.sql.functions import pandas_udf, PandasUDFType
# udf_schema: A function that returns the schema for the dataframe
def main():
# Define pandas udf for calculation
# To perform this calculation, every row in the
# dataframe needs information pulled from our PostGres DB
# which does take some time, ~2-3 hours
#pandas_udf(udf_schema(), PandasUDFType.GROUPED_MAP)
def calculate_values(local_df):
local_df = run_calculation(local_df)
return local_df
# custom function that pulls data from our database and
# creates the dataframe
df = get_df()
df = df\
.groupBy('some_unique_id')\
.apply(calculate_values)
print(f'==> finished running calculation for {df.count()} rows!')
return
I met same issues like yours,but the reason for me its due to the limition of jdbc not the cluster itself. So if you are same as me, using jdbc access to db like impala or postgre. you can try following scripts
df = spark.read.option("numPartitions",100).option("partitionColumn",$COLUMN_NAME).jdbc($THE_JDBC_SETTTING)
instead of
df = spark.read.jdbc($THE_JDBC_SETTTING)

Spark streaming performance issue. Every minute process time increasing

We are facing performance issue in my streaming application. I am reading my data from kafka topic using DirectStream and converting the data into dataframe. After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
In this I am facing performance issue. My streaming job ruining in first min 15sec, second min 18 sec and third min 20 sec like that it keeps increasing the processing time. Over period my streaming job will queued.
About my application.
Streaming will run on every 60 sec.
Spark version 2.1.1( I am using pyspark)
Kafka topic have four partitions.
To solve the issue, I have tried below steps.
Step 1: While submitting my job I am giving “spark.sql.shuffle.partitions=4” .
Step 2: while saving my dataframe as textfile I am using coalesce(4).
When I see my spark url every min run, save as file stage is doubling up like first min 14 stages, second min 28 stages and third min 42 stages.
Spark UI Result.
Hi,
Thanks for reply,
Sorry, i am new to spark. I am not sure what exactly I need change, can you please help me. do I need do unpersist my "df_data"?
I cache and unpersist my "df_data" data frame also. But still, i am facing the same issue.
Do I need to enable my checkpoint? like adding "ssc.checkpoint("/user/test/checkpoint")" this code.createDirectStream will support checkpoint? Or i need to enable offset value? please let me know what change is required here.
if __name__ == "__main__":
sc = SparkContext(appName="PythonSqlNetworkWordCount")
sc.setLogLevel("ERROR")
sparkSql = SQLContext(sc)
ssc = StreamingContext(sc, 60)
zkQuorum = {"metadata.broker.list" : "m2.hdp.com:9092"}
topic = ["kpitmumbai"]
kvs = KafkaUtils.createDirectStream(ssc,topic,zkQuorum)
schema = StructType([StructField('A', StringType(), True), StructField('B', LongType(), True), StructField('C', DoubleType(), True), StructField('D', LongType(), True)])
first_empty_df = sqlCtx.createDataFrame(sc.emptyRDD(), schema)
first_empty_df.registerTempTable("streaming_tbl")
lines = kvs.map(lambda x :x[1])
lines.foreachRDD(lambda rdd: empty_rdd() if rdd.count() == 0 else
CSV(rdd))
ssc.start()
ssc.awaitTermination()
def CSV(rdd1):
spark = getSparkSessionInstance(rdd1.context.getConf())
psv_data = rdd1.map(lambda l: l.strip("\s").split("|") )
data_1 = psv_data.map(lambda l : Row(
A=l[0],
B=l[1],
C=l[2],
D=l[3])
hasattr(data_1 ,"toDF")
df_2= data_1.toDF()
df_last_min_data = sqlCtx.sql("select A,B,C,D,sample from streaming_tbl")#first time will be empty and next min onwards have values
df_data = df_2.groupby(['A','B']).agg(func.sum('C').alias('C'),func.sum('D').alias('D'))
df_data.registerTempTable("streaming_tbl")
con=(df_data.A==df_last_min_data.A) & (df_data.B==df_last_min_data.B)
path1=path="/user/test/str" + str(Starttime).replace(" ","").replace("-","").replace(":","")
df_last_min_data.join(df_data,con,"inner").select(df_last_min_data.A,df_last_min_data.b,df_data.C,df_data.D).write.csv(path=path1,mode="append")
Once again thanks for the reply.
After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
The most likely problem is you don't checkpoint the table and lineage keeps growing with each iteration. This makes each iteration more and more expensive, especially when data is not cached.
Overall if you need stateful operations you should prefer existing stateful transformations. Both "old" streaming and structured streaming come with their own variants, which can be used in a variety of scenarios.

Spark streaming data sharing between batches

Spark streaming processes the data in micro batches.
Each interval data is processed in parallel using RDDs with out any data sharing between each interval.
But my use case needs to share the data between intervals.
Consider the Network WordCount example which produces the count of all words received in that interval.
How would I produce following word count ?
Relative count for the words "hadoop" and "spark" with the previous interval count
Normal word count for all other words.
Note: UpdateStateByKey does the Stateful processing but this applies function on every record instead of particular records.
So, UpdateStateByKey doesn't fit for this requirement.
Update:
consider the following example
Interval-1
Input:
Sample Input with Hadoop and Spark on Hadoop
output:
hadoop 2
sample 1
input 1
with 1
and 1
spark 1
on 1
Interval-2
Input:
Another Sample Input with Hadoop and Spark on Hadoop and another hadoop another spark spark
output:
another 3
hadoop 1
spark 2
and 2
sample 1
input 1
with 1
on 1
Explanation:
1st interval gives the normal word count of all words.
In the 2nd interval hadoop occurred 3 times but the output should be 1 (3-2)
Spark occurred 3 times but the output should be 2 (3-1)
For all other words it should give the normal word count.
So, while processing 2nd Interval data, it should have the 1st interval's word count of hadoop and spark
This is a simple example with illustration.
In actual use case, fields that need data sharing are part of the RDD element(RDD) and huge no of values needs to be tracked.
i.e, in this example like hadoop and spark keywords nearly 100k keywords to be tracked.
Similar usecases in Apache Storm:
Distributed caching in storm
Storm TransactionalWords
This is possible by "remembering" the last RDD received and using a left join to merge that data with the next streaming batch. We make use of streamingContext.remember to enable RDDs produced by the streaming process to be kept for the time we need them.
We make use of the fact that dstream.transform is an operation that executes on the driver and therefore we have access to all local object definitions. In particular we want to update the mutable reference to the last RDD with the required value on each batch.
Probably a piece of code makes that idea more clear:
// configure the streaming context to remember the RDDs produced
// choose at least 2x the time of the streaming interval
ssc.remember(xx Seconds)
// Initialize the "currentData" with an empty RDD of the expected type
var currentData: RDD[(String, Int)] = sparkContext.emptyRDD
// classic word count
val w1dstream = dstream.map(elem => (elem,1))
val count = w1dstream.reduceByKey(_ + _)
// Here's the key to make this work. Look how we update the value of the last RDD after using it.
val diffCount = count.transform{ rdd =>
val interestingKeys = Set("hadoop", "spark")
val interesting = rdd.filter{case (k,v) => interestingKeys(k)}
val countDiff = rdd.leftOuterJoin(currentData).map{case (k,(v1,v2)) => (k,v1-v2.getOrElse(0))}
currentData = interesting
countDiff
}
diffCount.print()

Spark DataTables: where is partitionBy?

A common Spark processing flow we have is something like this:
Loading:
rdd = sqlContext.parquetFile("mydata/")
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
Processing by id (this is why we do the partitionBy above!)
rdd.reduceByKey(....)
rdd.join(...)
However, Spark 1.3 changed sqlContext.parquetFile to return DataFrame instead of RDD, and it no longer has the partitionBy, getNumPartitions, and reduceByKey methods.
What do we do now with partitionBy?
We can replace the loading code with something like
rdd = sqlContext.parquetFile("mydata/").rdd
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
df = rdd.map(lambda ...: Row(...)).toDF(???)
and use groupBy instead of reduceByKey.
Is this the right way?
PS. Yes, I understand that partitionBy is not necessary for groupBy et al. However, without a prior partitionBy, each join, groupBy &c may have to do cross-node operations. I am looking for a way to guarantee that all operations requiring grouping by my key will run local.
It appears that, since version 1.6, repartition(self, numPartitions, *cols) does what I need:
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns.
Also made numPartitions optional if partitioning columns are specified.
Since DataFrame provide us an abstraction of Table and Column over RDD, the most convenient way to manipulate DataFrame is to use these abstraction along with the specific table manipulations methods that DataFrame enables us.
On a DataFrame, we could:
transform the table schema with select() \ udf() \ as()
filter rows out by filter() or where()
fire an aggregation through groupBy() and agg()
or other analytic job using sample() \ join() \ union()
persist your result using saveAsTable() \ saveAsParquet() \ insertIntoJDBC()
Please refer to Spark SQL and DataFrame Guide for more details.
Therefore, a common job looks like:
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
And for your specific requirements, this could look like:
val t = sqlContext.parquetFile()
t.filter().select().groupBy().agg()

Resources