How can I repartition RDD by key and then pack it to shards? - apache-spark

I have many files containing millions of rows in format:
id, created_date, some_value_a, some_value_b, some_value_c
This way of repartitioning was super slow and created for me over million of small ~500b files:
rdd_df = rdd.toDF(["id", "created_time", "a", "b", "c"])
rdd_df.write.partitionBy("id").csv("output")
I would like to achieve output files, where each file contains like 10000 unique IDs and all their rows.
How could I achieve something like this?

You can repartition by adding a Random Salt key.
val totRows = rdd_df.count
val maxRowsForAnId = rdd_df.groupBy("id").count().agg(max("count"))
val numParts1 = totRows/maxRowsForAnId
val totalUniqueIds = rdd_df.select("id").distinct.count
val numParts2 = totRows/(10000*totalUniqueIds)
val numPart = numParts1.min(numParts2)
rdd_df
.repartition(numPart,col("id"),rand)
.csv("output")
The main concept is each partition will be written as 1 file. SO you would have bring your required rows in to 1 partition by repartition(numPart,col("id"),rand).
The first 4-5 operations is just to calculate how many partitions we need to achieve almost 10000 ids per file.
Calculate assuming 10000 ids per partition
Corner case : if a single id has too many rows and doesn't fit in the above calculated partition size.
Hence we calculate no of paritition according to the largest count of ID present
Take min of the 2 noOfPartitons
rand is necessary so, that we can bring multiple IDs in a single partition
NOTE : Although this will give you larger files and each file will contain a set of unique ids for sure. But this involves shuffling , due to which your operation actually might be slower than the code you have mentioned in question.

You would need something like this:
rdd_df.repartition(*number of partitions you want*).write.csv("output", header = True)
or honestly - just let the job decide the number partitions instead of repartitioning. In theory, that should be faster:
rdd_df.write.csv("output", header = True)

Related

Spark: problem with crossJoin (takes a tremendous amount of time)

First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.
I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.
What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)
I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.

Avoid lazy evaluation of code in spark without cache

How can I avoid lazy evaluation in spark. I have a data frame which needs to be populated at once, since I need to filter the data on the basis of random number generated for each row of data frame, say if random number generated > 0.5, it will be filtered as dataA and if random number generated < 0.5 it will be filtered as dataB.
val randomNumberDF = df.withColumn("num", Math.random())
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
Since spark is doing lazy eval, while filtering there is no reliable distribution of rows which are being filtered as dataA and dataB(sometimes same row is being present in both dataA and dataB)
How can I avoid this re-computation of "num" column, I have tried using "cache", which worked, but given my data size is going to be big, I am ruling out that solution.
I have also tried using other actions on the randomNumberDF, like :
count
rdd.count
show
first
these didn't solve the problem.
Please suggest something different from cache/persist/writing data to HDFS and again reading it as solution.
References I have already checked :
How to force spark to avoid Dataset re-computation?
How to force Spark to only execute a transformation once?
How to force Spark to evaluate DataFrame operations inline
If all you're looking for is a way to ensure that the same values are in randomNumberDF.num, then you can generate random numbers with a seed (using org.apache.spark.sql.functions.rand()):
The below is using 112 as the seed value:
val randomNumberDF = df.withColumn("num", rand(112))
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
That will ensure that the values in num are the same across the multiple evaluations of randomNumberDF.
besides using org.apache.spark.sql.functions.rand with a given seed, you coud use eager-checkpointing:
df.checkpoint(true)
This will materialize the dataframe to disk

spark save taking lot of time

I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!

Dataframe sample in Apache spark | Scala

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg
df1.count() = 10
df2.count() = 1000
noOfSamples = 10
I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)
Now while doing so,
var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())
What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.
Also is there anyway we can specify the number of rows to be sampled?
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:
val newSample = df1.sample(true, 1D*noOfSamples/df1.count)
However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:
val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)
Some scalability observations
You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.
Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.
val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)
Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.
val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)
As for your questions:
can it be greater than 1?
No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.
Also is there anyway we can specify the number of rows to be sampled?
You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
To answer your question, is there anyway we can specify the number of rows to be sampled?
I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,
Convert the spark data frame to rdd.
Example: df_test.rdd
RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number.
Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)
Convert RDD back to spark data frame using sqlContext.createDataFrame()
Above process combined to single step:
Data Frame (or Population) I needed to Sample from has around 8,000 records:
df_grp_1
df_grp_1
test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
test1 data frame will have 125 sampled records.
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:
java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):
val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")
It returns an exact count as long as your current row count is as large as your sample size.
The below code works if you want to do a random split of 70% & 30% of a data frame df,
val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)
I use this function for random sampling when exact number of records are desirable:
def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
ratio = 1.08 * float(row_count) / df.count() # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
# it could be more or less actual number of records returned by df.sample()
if ratio>1.0:
ratio = 1.0
result_df = (df
.sample(with_replacement, ratio, random_seed)
.limit(row_count) # since we oversampled, make exact row count here
)
return result_df
May be you want to try below code..
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

Ordered union on spark RDDs

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
Thanks
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.

Resources