Spark: problem with crossJoin (takes a tremendous amount of time) - apache-spark

First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.

I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.

What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)

I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.

Related

How can I repartition RDD by key and then pack it to shards?

I have many files containing millions of rows in format:
id, created_date, some_value_a, some_value_b, some_value_c
This way of repartitioning was super slow and created for me over million of small ~500b files:
rdd_df = rdd.toDF(["id", "created_time", "a", "b", "c"])
rdd_df.write.partitionBy("id").csv("output")
I would like to achieve output files, where each file contains like 10000 unique IDs and all their rows.
How could I achieve something like this?
You can repartition by adding a Random Salt key.
val totRows = rdd_df.count
val maxRowsForAnId = rdd_df.groupBy("id").count().agg(max("count"))
val numParts1 = totRows/maxRowsForAnId
val totalUniqueIds = rdd_df.select("id").distinct.count
val numParts2 = totRows/(10000*totalUniqueIds)
val numPart = numParts1.min(numParts2)
rdd_df
.repartition(numPart,col("id"),rand)
.csv("output")
The main concept is each partition will be written as 1 file. SO you would have bring your required rows in to 1 partition by repartition(numPart,col("id"),rand).
The first 4-5 operations is just to calculate how many partitions we need to achieve almost 10000 ids per file.
Calculate assuming 10000 ids per partition
Corner case : if a single id has too many rows and doesn't fit in the above calculated partition size.
Hence we calculate no of paritition according to the largest count of ID present
Take min of the 2 noOfPartitons
rand is necessary so, that we can bring multiple IDs in a single partition
NOTE : Although this will give you larger files and each file will contain a set of unique ids for sure. But this involves shuffling , due to which your operation actually might be slower than the code you have mentioned in question.
You would need something like this:
rdd_df.repartition(*number of partitions you want*).write.csv("output", header = True)
or honestly - just let the job decide the number partitions instead of repartitioning. In theory, that should be faster:
rdd_df.write.csv("output", header = True)

Spark: Use aggregation function on all columns [duplicate]

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Dataframe sample in Apache spark | Scala

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg
df1.count() = 10
df2.count() = 1000
noOfSamples = 10
I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)
Now while doing so,
var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())
What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.
Also is there anyway we can specify the number of rows to be sampled?
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:
val newSample = df1.sample(true, 1D*noOfSamples/df1.count)
However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:
val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)
Some scalability observations
You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.
Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.
val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)
Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.
val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)
As for your questions:
can it be greater than 1?
No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.
Also is there anyway we can specify the number of rows to be sampled?
You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
To answer your question, is there anyway we can specify the number of rows to be sampled?
I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,
Convert the spark data frame to rdd.
Example: df_test.rdd
RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number.
Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)
Convert RDD back to spark data frame using sqlContext.createDataFrame()
Above process combined to single step:
Data Frame (or Population) I needed to Sample from has around 8,000 records:
df_grp_1
df_grp_1
test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
test1 data frame will have 125 sampled records.
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:
java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):
val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")
It returns an exact count as long as your current row count is as large as your sample size.
The below code works if you want to do a random split of 70% & 30% of a data frame df,
val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)
I use this function for random sampling when exact number of records are desirable:
def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
ratio = 1.08 * float(row_count) / df.count() # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
# it could be more or less actual number of records returned by df.sample()
if ratio>1.0:
ratio = 1.0
result_df = (df
.sample(with_replacement, ratio, random_seed)
.limit(row_count) # since we oversampled, make exact row count here
)
return result_df
May be you want to try below code..
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

Awfully slow execution on a small datasets – where to start debugging?

I do some experimentation on a MacBook (i5, 2.6GHz, 8GB ram) with Zeppelin NB and Spark in standalone mode. spark.executor/driver.memory both get 2g. I have also set spark.serializer org.apache.spark.serializer.KryoSerializer in spark-defaults.conf, but that seems to be ignored by zeppelin
ALS model
I have trained a ALS model with ~400k (implicit) ratings and want to get recommendations with val allRecommendations = model.recommendProductsForUsers(1)
Sample set
Next I take a sample to play around with
val sampledRecommendations = allRecommendations.sample(false, 0.05, 1234567).cache
This contains 3600 recommendations.
Remove product recommendations that users own
Next I want to remove all ratings for products that a given user already owns, the list I hold in a RDD of the form (user_id, Set[product_ids]): RDD[(Long, scala.collection.mutable.HashSet[Int])]
val productRecommendations = (sampledRecommendations
// add user portfolio to the list, but convert the key from Long to Int first
.join(usersProductsFlat.map( up => (up._1.toInt, up._2) ))
.mapValues(
// (user, (ratings: Array[Rating], usersOwnedProducts: HashSet[Long]))
r => (r._1
.filter( rating => !r._2.contains(rating.product))
.filter( rating => rating.rating > 0.5)
.toList
)
)
// In case there is no recommendation (left), remove the entry
.filter(rating => !rating._2.isEmpty)
).cache
Question 1
Calling this (productRecommendations.count) on the cached sample set generates a stage that includes flatMap at MatrixFactorizationModel.scala:278 with 10,000 tasks, 263.6 MB of input data and 196.0 MB shuffle write. Shouldn't the tiny and cached RDD be used instead and what is going (wr)on(g) here? The execution of the count takes almost 5 minutes!
Question 2
Calling usersProductsFlat.count which is fully cached according to the "Storage" view in the application UI takes ~60 seconds each time. It's 23Mb in size – shouldn't that be a lot faster?
Map to readable form
Next I bring this in some readable form replacing IDs with names from a broadcasted lookup Map to put into a DF/table:
val readableRatings = (productRecommendations
.flatMapValues(x=>x)
.map( r => (r._1, userIdToMailBC.value(r._1), r._2.product.toInt, productIdToNameBC.value(r._2.product), r._2.rating))
).cache
val readableRatingsDF = readableRatings.toDF("user","email", "product_id", "product", "rating").cache
readableRatingsDF.registerTempTable("recommendations")
Select … with patience
The insane part starts here. Doing a SELECT takes several hours (I could never wait for one to finish):
%sql
SELECT COUNT(user) AS usr_cnt, product, AVG(rating) AS avg_rating
FROM recommendations
GROUP BY product
I don't know where to look to find the bottlenecks here, there is obviously some huge kerfuffle going on here! Where can I start looking?
Your number of partitions may be too large. I think you should use about 200 when running in local mode rather than 10000. You can set the number of partitions in different ways. I suggest you edit the spark.default.parallelism flag in the Spark configuration file.

Resources