Avoid loops and memory issues in PySpark - apache-spark

Let’s say I have two data frames df1 and df2. df1 has the raw data which is quite large (about 500 GB) and df2 has some filtering criteria which are used to select data from df1, do some computations, and save the output to azure blobs. What is the best approach to do this without relying on python loops. Currently what I have in place looks something like this: 
for idx, row in df2.iterrows():
    filter1 = row[“filter1”]
    filter2 = row[“filter2”]
    df_filtered = df1.where((df1.col1>filter1) & (df1.col2<filter2))
    df_filtered.write.format('csv').option("delimiter", ",").save(outputpath, header = True)
Above solution is actually very slow and I am running into memory issues. I have thought of using an UDF but UDF has two problems for my scenario.
I cannot set spark configuration to save data to azure blobs within a UDF.
UDF returns a result to be appended to the data frame on which the function is being applied. Here, I need the function to be applied on df1 which has the raw data and is quite large but the output data needs be appended to df2 which is the smaller dataset with the filtering criteria.
Are there any alternative ways to design my solution to avoid memory issues?


Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

Partitioning of Data Frame in Pyspark using Custom Partitioner

Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method.
Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code.
I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count?
I am using Spark version and below is my Dataframe structure:
datadf= spark.sql("""
from udb.sometable
The incoming data has data for six countries, like AUS, IND, THA, RUS, CHN and USA.
CHN and USA has skew data.
so if I do repartition on COUNTRY_CODE, two partitions contains a lot data whereas others are fine. I checked this using glom() method.
newdf = datadf.repartition("COUNTRY_CODE")
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext, DataFrameWriter, DataFrame
newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA")
I was trying repartition my data into 3 more partitions for country USA and CHN only and would like to keep the other countries data into single partition.
This is what I am expecting
AUS- one partition
IND- one partition
THA- one partition
RUS- one partition
CHN- three partition
USA- three partition
Traceback (most recent call last): File "", line 1, in
"/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line
1182, in getattr
"'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute
Try something like this with hashing:
newDf = oldDf.repartition(N, $"col1", $"coln")
or for ranging approach:
newDF = oldDF.repartitionByRange(N, $"col1", $"coln")
There is no custom partitioning for DF's just yet.
In your case I would go for hashing, but there are no guarantees.
But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach.
E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols.
For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col.
In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent.
Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.
There is no custom partitioner in Structured API, so in order to use custom partitioner, you'll need to drop down to RDD API. Simple 3 steps as follows:
Convert Structured API to RDD API
dataRDD = dataDF.rdd
Apply custom partitioner in RDD API
import random
# Extract key from Row object
dataRDD = dataRDD.map(lambda r: (r[0], r))
def partitioner(key):
if key == "CHN":
return random.randint(1, 10)
elif key == "USA":
return random.randint(11, 20)
# distinctCountryDict is a dict mapping distinct countries to distinct integers
# these distinct integers should not overlap with range(1, 20)
return distinctCountryDict[key]
numPartitions = 100
dataRDD = dataRDD.partitionBy(numPartitions, partitioner)
# Remove key extracted previously
dataRDD = dataRDD.map(lambda r: r[1])
Convert RDD API back to Structured API
dataDF = dataRDD.toDF()
This way, you get the best of both worlds, Spark types and optimized physical plan in Structured API, as well as custom partitioner in low-level RDD API. And we only drop down to low-level API only when it's absolutely necessary.
There is no direct way to apply user defined partitioner on PySpark, the short cut is to create a new column with a UDF, assigning each record with a partition ID based on the business logic. And use the new column for partitioning, that way the data gets spread evenly.
numPartitions= 3
df = df.withColumn("Hash#", udf_country_hash(df['Country']))
df = df.withColumn("Partition#", df["Hash#"] % numPartitions)
df.repartition(numPartitions, "Partition#")
Please check the online version of code #
In my experience converting DataFrame to RDD and back to DataFrame is a costly operation, better to avoid it.

Databricks Create a list of dataFrames with their size

I'm working on Databricks and I want to have a list of all my dataframes with their number of observations.
Is it possible to have the size (number of rows) for each dataframe in the DataLake?
I found how to list all dataframmes:
I know how to count it.
Is it possible to have a list of my dataframes and the size?
Thank you,
You can create a DataFrame from the file listing and the row counts. The following code assumes all your tables are in Parquet format. If that's not the case, you need to change the reading code.
def namesAndRowCounts(root: String) =
dbutils.fs.ls(root).map { info =>
(info.name, spark.read.load(info.path).count)
).toDF("name", "rows").orderBy('name)

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
Any idea how to add all the dynamically created dataframes to one initially created data frame?
Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

How to divide a column by its sum in a Spark DataFrame

How can I divide a column by its own sum in a Spark DataFrame, efficiently and without immediately triggering a computation?
Suppose we have some data:
import pyspark
from pyspark.sql import SparkSession, Window
import pyspark.sql.functions as spf
spark = SparkSession.builder.master('local').getOrCreate()
data = spark.range(0, 100)
data # --> DataFrame[id: bigint]
I’d like to create a new column on this data frame called “normalized” that contains id / sum(id). One way to do it is to pre-compute the sum, like this:
s = data.select(spf.sum('id')).collect()[0][0]
data2 = data.withColumn('normalized', spf.col('id') / s)
data2 # --> DataFrame[id: bigint, normalized: double]
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Another way to do it is with a windowing specification that includes the whole table:
w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
data3 = data.withColumn('normalized', spf.col('id') / spf.sum('id').over(w))
data3 # --> DataFrame[id: bigint, normalized: double]
In this case, it's fine to define data3, but once you try to actually compute it, Spark 2.2.0 will move all the data into a single partition, which typically causes the job to fail for large data sets.
What other approaches are there to solving this problem, that don't trigger an immediate computation and that will work with large data sets? I'm interested in any solutions, not necessarily solutions based on pyspark.
crossJoin with aggregate is one approach:
).withColumn("normalized", spf.col("id") / spf.col("sum_id"))
but I wouldn't worry to much:
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Just compute multiple statistics at once:
data2 = data.select(spf.rand(42).alias("x"), spf.randn(42).alias("y"))
mean_x, mean_y = data2.groupBy().mean().first()
and the rest is just an operation on local expressions:
data2.select(spf.col("x") - mean_x, spf.col("y") - mean_y)
