I am using the action count() to trigger my udf function to run. This works, but long after my udf function has completed running, the df.count() takes days to complete. The dataframe itself is not large, and has about 30k to 100k rows.
AWS Cluster Settings:
1 m5.4xlarge for the master node
2 m5.4xlarge for the worker nodes.
Spark Variables & Settings (These are the spark variables being used to run the script)
--executor-cores 4
--conf spark.sql.execution.arrow.enabled=true
'spark.sql.inMemoryColumnarStorage.batchSize', 2000000 (set inside pyspark script)
Psuedo Code
Here is the actual structure of our script. The custom pandas udf function makes a call to a PostGres database for every row.
from pyspark.sql.functions import pandas_udf, PandasUDFType
# udf_schema: A function that returns the schema for the dataframe
def main():
# Define pandas udf for calculation
# To perform this calculation, every row in the
# dataframe needs information pulled from our PostGres DB
# which does take some time, ~2-3 hours
#pandas_udf(udf_schema(), PandasUDFType.GROUPED_MAP)
def calculate_values(local_df):
local_df = run_calculation(local_df)
return local_df
# custom function that pulls data from our database and
# creates the dataframe
df = get_df()
df = df\
.groupBy('some_unique_id')\
.apply(calculate_values)
print(f'==> finished running calculation for {df.count()} rows!')
return
I met same issues like yours,but the reason for me its due to the limition of jdbc not the cluster itself. So if you are same as me, using jdbc access to db like impala or postgre. you can try following scripts
df = spark.read.option("numPartitions",100).option("partitionColumn",$COLUMN_NAME).jdbc($THE_JDBC_SETTTING)
instead of
df = spark.read.jdbc($THE_JDBC_SETTTING)
Related
I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?
I have following code of pyspark in Spark Structured Streaming to get dataFrame from Redis
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
Why? How to fix it? The code in main is
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()
Since the partitionNum is 1 at the first place, coalesce operation does not allow to generate fewer partitions. So no matter how you call it, it would be 1 partition. Unless you use repartition
My problem is as follows:
I have a large dataframe called customer_data_pk containing 230M rows and the other one containing 200M rows named customer_data_pk_isb.
Both have a column callID on which I would like to do a left join, the left dataframe being customer_data_pk.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e.
customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn ==
customer_data_pk_isb.btn, 'left')
gives out of memory or (just times out with Error: Removing executor driver with no recent heartbeats: 468990 ms exceeds timeout 120000 ms).
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals. If someone could shed light on this, it would be great.
I have tried this as well but didn't work and code gets stuck:
customer_data_pk.persist(StorageLevel.DISK_ONLY)
Further more from configuration end I am using: --conf spark.sql.shuffle.partitions=5000
My complete code is as under:
from pyspark import SparkContext
from pyspark import SQLContext
import time
import pyspark
sc = SparkContext("local", "Example")
sqlContext = SQLContext(sc);
customer_data_pk = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/matchingqueryautomation',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk',
user='XXXX',
password='XXXX').load()
customer_data_pk.persist(pyspark.StorageLevel.DISK_ONLY)
customer_data_pk_isb = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/lookupdb',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk_isb',
user='XXXX',
password='XXXX').load()
print('###########################', customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn == customer_data_pk_isb.btn, 'left').count(),
'###########################')
I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.
I am using PySpark in a Jupyter notebook. The following step takes up to 100 seconds, which is OK.
toydf = df.select("column_A").limit(20)
However, the following show() step takes 2-3 minutes. It only has 20 rows of lists of integers, and each list has no more than 60 elements. Why does it take so long?
toydf.show()
df is generated as follows:
spark = SparkSession.builder\
.config(conf=conf)\
.enableHiveSupport()\
.getOrCreate()
df = spark.sql("""SELECT column_A
FROM datascience.email_aac1_pid_enl_pid_1702""")
In spark there are two major concepts:
1: Transformations: whenever you apply withColumn, drop, joins or groupBy they are actually evaluating they just produce a new dataframe or RDD.
2: Actions: Rather in case of actions like count, show, display, write it actually doing all the work of transformations. and this all Actions internally call Spark RunJob API to run all transformation as Job.
And in your case case when you hit toydf = df.select("column_A").limit(20) nothing is happing.
But when you use Show() method which is an action so it will collect data from the cluster to your Driver node and on this time it actually evaluating your toydf = df.select("column_A").limit(20).