I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.
I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set
import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())
but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :)
Thanks for the help!
I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.
The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!
Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))
df1.show()
df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+
8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]
The same again with more data:
...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+
8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]
You can see 4 common random values - within a Spark session or out of session.
I know it's a bit late, but have you considered using hashing of IDs, dates etc. that are deterministic, instead of using built-in random functions? I'm encountering similar issue but I believe my problem can be solved using for example xxhash64, which is a PySpark built-in hash function. You can then use the last few digits, or normalize if you know the total range of the hash values, which I couldn't find in its documentations.
Related
How can I avoid lazy evaluation in spark. I have a data frame which needs to be populated at once, since I need to filter the data on the basis of random number generated for each row of data frame, say if random number generated > 0.5, it will be filtered as dataA and if random number generated < 0.5 it will be filtered as dataB.
val randomNumberDF = df.withColumn("num", Math.random())
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
Since spark is doing lazy eval, while filtering there is no reliable distribution of rows which are being filtered as dataA and dataB(sometimes same row is being present in both dataA and dataB)
How can I avoid this re-computation of "num" column, I have tried using "cache", which worked, but given my data size is going to be big, I am ruling out that solution.
I have also tried using other actions on the randomNumberDF, like :
count
rdd.count
show
first
these didn't solve the problem.
Please suggest something different from cache/persist/writing data to HDFS and again reading it as solution.
References I have already checked :
How to force spark to avoid Dataset re-computation?
How to force Spark to only execute a transformation once?
How to force Spark to evaluate DataFrame operations inline
If all you're looking for is a way to ensure that the same values are in randomNumberDF.num, then you can generate random numbers with a seed (using org.apache.spark.sql.functions.rand()):
The below is using 112 as the seed value:
val randomNumberDF = df.withColumn("num", rand(112))
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
That will ensure that the values in num are the same across the multiple evaluations of randomNumberDF.
besides using org.apache.spark.sql.functions.rand with a given seed, you coud use eager-checkpointing:
df.checkpoint(true)
This will materialize the dataframe to disk
My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()