Facing Issue while Loading Data in through PySpark and performing joins - apache-spark

My problem is as follows:
I have a large dataframe called customer_data_pk containing 230M rows and the other one containing 200M rows named customer_data_pk_isb.
Both have a column callID on which I would like to do a left join, the left dataframe being customer_data_pk.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e.
customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn ==
customer_data_pk_isb.btn, 'left')
gives out of memory or (just times out with Error: Removing executor driver with no recent heartbeats: 468990 ms exceeds timeout 120000 ms).
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals. If someone could shed light on this, it would be great.
I have tried this as well but didn't work and code gets stuck:
customer_data_pk.persist(StorageLevel.DISK_ONLY)
Further more from configuration end I am using: --conf spark.sql.shuffle.partitions=5000
My complete code is as under:
from pyspark import SparkContext
from pyspark import SQLContext
import time
import pyspark
sc = SparkContext("local", "Example")
sqlContext = SQLContext(sc);
customer_data_pk = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/matchingqueryautomation',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk',
user='XXXX',
password='XXXX').load()
customer_data_pk.persist(pyspark.StorageLevel.DISK_ONLY)
customer_data_pk_isb = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/lookupdb',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk_isb',
user='XXXX',
password='XXXX').load()
print('###########################', customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn == customer_data_pk_isb.btn, 'left').count(),
'###########################')

Related

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

how to move large table from PSQL to parquet on gcloud via Apache Spark?

I have a large table(around 300gb) and a ram of about (50Gb), and 8 cpus.
I want to move my psql table into google cloud storage using spark and jdbc connection. very similar to:How to convert an 500GB SQL table into Apache Parquet?.
I know my connections work, because I was able to move a small table. But with large table I get memory issues. How can I optimize it?
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import DataFrameReader
conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar:/usr/local/jar/gcs-connector-hadoop2-latest.jar")
,("spark.executor.instances", "8")
,("spark.executor.cores", "4")
,("spark.executor.memory", "1g")
,("spark.driver.memory", "6g")
,("spark.memory.offHeap.enabled","true")
,("spark.memory.offHeap.size","40g")])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","/home/user/analytics/gcloud_key_name.json")
sqlContext = SQLContext(sc)
url = 'postgresql://address:port/db_name'
properties = {
'user': 'user',
'password': 'password'}
df_users = sqlContext.read.jdbc(
url='jdbc:%s' % url, table='users', properties=properties
)
gcloud_path= "gs://BUCKET/users"
df_users.write.mode('overwrite').parquet(gcloud_path)
Bonus question:
can I do partition now, or first I should save it as parquet then read it and repartition it?
Bonus question2:
If the answer to Bonus question 1 is yes, can I do sort it now, or first I should save it as parquet then read it and repartition it?

pyspark df.count() taking a very long time (or not working at all)

I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.

Deleting specific column in Cassandra from Spark

I was able to delete specific column with the RDD API with -
sc.cassandraTable("books_ks", "books")
.deleteFromCassandra("books_ks", "books",SomeColumns("book_price"))
I am struggling to do this with the Dataframe API.
Can someone please share an example?
You cannot delete via the DF API and it's unnatural via the RDD api. RDDs and DFs are immutable, meaning no modification. You can filter them to cut them down but this generates a new RDD / DF.
Having said that what you can do is filter out the rows that you wish to delete and then just build a C* client to carry out that deletion:
// imports for Spark and C* connection
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
spark.setCassandraConf("Test Cluster", CassandraConnectorConf.ConnectionHostParam.option("localhost"))
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "books_ks", "table" -> "books")).load()
val dfToDelete = df.filter($"price" < 3).select($"price");
dfToDelete.show();
// import for C* client
import com.datastax.driver.core._
// build a C* client (part of the dependency of the scala driver)
val clusterBuilder = Cluster.builder().addContactPoints("127.0.0.1");
val cluster = clusterBuilder.build();
val session = cluster.connect();
// loop over everything that you filtered in the DF and delete specified row.
for(price <- dfToDelete.collect())
session.execute("DELETE FROM books_ks.books WHERE price=" + price.get(0).toString);
Few Warnings This wont work well if you're trying to delete a large portion of rows. Using collect here means that this work will be done in Spark's driver program, aka SPOF & bottle-neck.
Better way to do this would be to go a) define a DF UDF to carry out the delete, benefit would be you get parallelization. Option b) to the RDD level and just the delete as you've shown above.
Moral of the story, just because it can be done, doesn't mean it should be done.

Why is .show() on a 20 row PySpark dataframe so slow?

I am using PySpark in a Jupyter notebook. The following step takes up to 100 seconds, which is OK.
toydf = df.select("column_A").limit(20)
However, the following show() step takes 2-3 minutes. It only has 20 rows of lists of integers, and each list has no more than 60 elements. Why does it take so long?
toydf.show()
df is generated as follows:
spark = SparkSession.builder\
.config(conf=conf)\
.enableHiveSupport()\
.getOrCreate()
df = spark.sql("""SELECT column_A
FROM datascience.email_aac1_pid_enl_pid_1702""")
In spark there are two major concepts:
1: Transformations: whenever you apply withColumn, drop, joins or groupBy they are actually evaluating they just produce a new dataframe or RDD.
2: Actions: Rather in case of actions like count, show, display, write it actually doing all the work of transformations. and this all Actions internally call Spark RunJob API to run all transformation as Job.
And in your case case when you hit toydf = df.select("column_A").limit(20) nothing is happing.
But when you use Show() method which is an action so it will collect data from the cluster to your Driver node and on this time it actually evaluating your toydf = df.select("column_A").limit(20).

Resources