Filtering Spark DataFrame on new column - apache-spark

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?

Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Related

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

PySpark apply function on 2 dataframes and write to csv for billions of rows on small hardware

I am trying to apply a levenshtein function for each string in dfs against each string in dfc and write the resulting dataframe to csv. The issue is that I'm creating so many rows by using the cross join and then applying the function, that my machine is struggling to write anything (taking forever to execute).
Trying to improve write performance:
I'm filtering out a few things on the result of the cross join i.e. rows where the LevenshteinDistance is less than 15% of the target word's.
Using bucketing on the first letter of each target word i.e. a, b, c, etc. still no luck (i.e. job runs for hours and doesn't generate any results).
from datetime import datetime
from config import config
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
def fuzzy_match(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey"))
df = dfc\
.crossJoin(dfs)\
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) )\
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") )\
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15")\
.orderBy("OrganisationName", "CompanyName")
def fuzzy_match_approve(df, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary):
"""
Filters fuzzy matching DataFrame results on approved/rejected based on set of conditions:
- If there is only 1 match against the SF name
- If more than 1 match then take that with LS distance of 1
- If more than 1 match and more multiple LS distances of 1, then take the one where Soundex codes are the same
Writes results and summary to CSV.
"""
def write_with_bucket(df, bucket_col, path):
df.write\
.mode("overwrite")\
.bucketBy(26, bucket_col)\
.option("path", path)\
.option("header", True)\
.saveAsTable("bucket", format="csv")
# Add window function columns:
# OrganisationNameMatchCount: Count AccountID per OrganisationName
# LevenshteinDistance1Count: Count AccountID per OrganisationName where LevenshteinDistance = 1
windowSpec = Window.partitionBy("OrganisationName")
df = df\
.select("AccountID", "OrganisationName", "OrganisationNameKey", "CompanyNumber", "CompanyName", "LevenshteinDistance", "HasSameSoundex")\
.withColumn("OrganisationNameMatchCount", F.count("AccountID").over(windowSpec))\
.withColumn("LevenshteinDistance1Count", F.count(F.when(F.col("LevenshteinDistance")==1, F.col("AccountID"))).over(windowSpec))
# Add bucket key column
df = df.withColumn( "OrganisationNameBucketKey", F.substring( col("OrganisationNameKey"),0,1) )
# Define fuzzy match approved condition
is_approved_1 = ( F.col("OrganisationNameMatchCount") == 1 )
is_approved_2 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") == 1) & (F.col("LevenshteinDistance") == 1) )
is_approved_3 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") > 1) & (F.col("HasSameSoundex") == 'true') )
is_approved = is_approved_1 | is_approved_2 | is_approved_3
# Split fuzzy match results into approved and rejected
df_approved = df.filter(is_approved)
df_rejected = df.filter(~is_approved)
# Export results
# df_approved.write.csv(path_fuzzy_match_approved, mode="overwrite", header=True, quoteAll=True)
# df_rejected.write.csv(path_fuzzy_match_rejected, mode="overwrite", header=True, quoteAll=True)
write_with_bucket(df_approved, "OrganisationNameBucketKey", path_fuzzy_match_approved)
write_with_bucket(df_rejected, "OrganisationNameBucketKey", path_fuzzy_match_rejected)
def main():
spark = SparkSession...
# Apply fuzzy match
dfs = spark.read...
dfc = spark.read...
path_summary = ...
df_fuzzy_match = fuzzy_match(dfs, dfc, path_summary)
# Export results
path_fuzzy_match_approved = ...
path_fuzzy_match_rejected = ...
fuzzy_match_approve(df_fuzzy_match, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary)
main()
Other info:
df.rdd.getNumPartitions() is 2
dfs.count() is 12,515
dfc.count() is 5,110,430
Jobs:
How can I improve performance here and get the results into a CSV successfully?
There are a couple of things you can do to improve your computation:
Improve parallelism
As Nithish mentioned in the comments, you don't have enough partitions in your input data frames to make use of all your CPU cores. You're not using all your CPU capability and this will slow you down.
To increase your parallelism, repartition dfc to at least your number of cores:
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)
You need to do this because your crossJoin is run as a BroadcastNestedLoopJoin which doesn't reshuffle your large input dataframe.
Separate your computation stages
A Spark dataframe/RDD is conceptually just a directed action graph (DAG) of operations to run on your input data but it does not hold data. One consequence of this behavior is that, by default, you'll rerun your computations as many times as you reuse your dataframe.
In your fuzzy_match_approve function, you run 2 separate filters on your df, this means you rerun the whole cross-join operations twice. You really don't want this !
One easy way to avoid this is to use cache() on your fuzzy_match result which should be fairly small given your inputs and matching criteria.
def fuzzy_match_running(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey")).cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism).cache()
df = dfc.crossJoin(dfs) \
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) ) \
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") ) \
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15") \
.orderBy("OrganisationName", "CompanyName") \
.cache()
return df
If I run my fuzzy_match_running on some example data frames on my 8 core/16 threads I9-9980HK laptop (spark in local[*] mode with 8GB driver memory):
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 679.5572581291199 seconds
Matches/core/sec: 933436.210726889
The job takes about 12 min doing 572494*17728 ~ 10 billion row comparisons
at 933k comparisons/seconds/core. Since your job does 64 billions row comparisons I would expect it to take about 80 min on my laptop.
You should run a similar experiment on your computer with a smaller sample to get an idea of your actual computing speed.
Going further: maximizing matches/sec
To go faster, we need to adjust the computation and increase the number of comparisons that can be done per seconds.
A few things stand out in the function:
you filter your output by comparing the levenshtein distance, an integer, to a decimal calculation. This means spark will cast your integer to a decimal and operate on decimal. Comparing decimals is much slower than integers and it's unnecessary here, you can cast the bound to an int beforehand.
your levenshtein operates on the lower versions of your keys, this means, for each row comparison, Spark will convert the column values to lower again and again, wasting CPU cycles for redundant stuff. You can preprocess this before your join.
I update the function like this:
def fuzzy_match(dfs: DataFrame, dfc: DataFrame, path_summary: str) -> DataFrame:
dfs = dfs.withColumn("OrganisationNameKeyLower", F.lower("OrganisationNameKey"))\
.withColumn("MatchingTolerance", F.ceil(F.length("OrganisationNameKey") * 0.15).cast("int"))\
.cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)\
.withColumn("CompanyNameKeyLower", F.lower("CompanyNameKey"))\
.cache()
df = dfc.crossJoin(dfs)\
.withColumn("LevenshteinDistance", F.levenshtein(F.col("OrganisationNameKeyLower"), F.col("CompanyNameKeyLower")).cast("int")) \
.where("LevenshteinDistance < MatchingTolerance")\
.drop("MatchingTolerance")\
.cache()
# clean unnecessary caches before returning
dfs.unpersist()
dfc.unpersist()
return df
When running the updated version on the same inputs as before and on the same computer I get nearly twice the performance as the first implementation
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 356.23311281204224 seconds
Matches/core/sec: 1780641.1846241967
If that is still too slow for your needs, you'll need to find conditions on your data that you can use as a join condition but that's highly data and use case specific.

Network bound transformation and threading

I am trying to use a REST API to enrich data I have in a spark dataframe. The REST API isn't built by me and requires a single input at a time (no batch option). Unfortunately the REST API latency is slower than I would like so my spark applications seem to spend a lot of time waiting for the API to iterate over each row. Although my REST API has higher latency, it does have very high throughput/capacity which does not seem to get fully used by my spark application.
Since my application appears to be network bound, I was wondering if it would make sense to use threading to help improve the speed of my application. Does spark already capable of doing this internally? If using threads does make sense, is there an easy way to accomplish this? Has anybody successfully done this?
I’ve encountered the same problem when fetching data from a blob storage.
Below is a small self-contained dummy example that I think you can easily modify for your needs.
In the example you should be able to register that it takes a lot longer to construct df_slow vs constructing df_fast.
It works by making each worker process a list of rows in parallel, instead of processing one row at a time sequentially.
You might be able to just swap the slowAdd function with your own Row transforming function. The slowAdd function simulates network latency by sleeping 0.1 seconds.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row
# Just some dataframe with numbers
data = [(i,) for i in range(0, 1000)]
df = spark.createDataFrame(data, ["Data"], T.IntegerType())
# Get an rdd that contains 'list of Rows' instead of 'Row'
standardRdd = df.rdd # contains [row1, row3, row3,...]
number_of_partitions = 10
repartionedRdd = standardRdd.repartition(number_of_partitions) # contains [row1, row2, row3,...] but repartioned to increase parallelism
glomRdd = repartionedRdd.glom() # contains roughly [[row1, row2, row3,..., row100], [row101, row102, row103, ...], ...]
# where the number of sublists corresponds to the number of partitions
# Define a transformation function with an artificial delay.
# Substitute this with your own transformation function.
import time
def slowAdd(r):
d = r.asDict()
d["Data"] = d["Data"] + 100
time.sleep(0.1)
return Row(**d)
# Define a function that maps the slowAdd function from 'list of Rows' to 'list of Rows' in parallel
import concurrent.futures
def slowAdd_with_thread_pool(list_of_rows):
thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=100)
return [result for result in thread_pool.map(slowAdd, list_of_rows)]
# Perform a fast mapping from 'list of Rows' to 'Rows'.
transformed_fast_rdd = glomRdd.flatMap(slowAdd_with_thread_pool)
# For reference, perform a slow mapping from 'Rows' to 'Rows'
transformed_slow_rdd = repartionedRdd.map(slowAdd)
# Convert the rdds back to dataframes from the rdd's
df_fast = spark.createDataFrame(transformed_fast_rdd)
#This sum operation will be fast (~100 threads sleeping in parallel on each worker)
df_fast.agg(F.sum(F.col("Data"))).show()
df_slow = spark.createDataFrame(transformed_slow_rdd)
#This sum operation will be slow (1 thread sleeping in parallel on each worker)
df_slow.agg(F.sum(F.col("Data"))).show()

Avoid lazy evaluation of code in spark without cache

How can I avoid lazy evaluation in spark. I have a data frame which needs to be populated at once, since I need to filter the data on the basis of random number generated for each row of data frame, say if random number generated > 0.5, it will be filtered as dataA and if random number generated < 0.5 it will be filtered as dataB.
val randomNumberDF = df.withColumn("num", Math.random())
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
Since spark is doing lazy eval, while filtering there is no reliable distribution of rows which are being filtered as dataA and dataB(sometimes same row is being present in both dataA and dataB)
How can I avoid this re-computation of "num" column, I have tried using "cache", which worked, but given my data size is going to be big, I am ruling out that solution.
I have also tried using other actions on the randomNumberDF, like :
count
rdd.count
show
first
these didn't solve the problem.
Please suggest something different from cache/persist/writing data to HDFS and again reading it as solution.
References I have already checked :
How to force spark to avoid Dataset re-computation?
How to force Spark to only execute a transformation once?
How to force Spark to evaluate DataFrame operations inline
If all you're looking for is a way to ensure that the same values are in randomNumberDF.num, then you can generate random numbers with a seed (using org.apache.spark.sql.functions.rand()):
The below is using 112 as the seed value:
val randomNumberDF = df.withColumn("num", rand(112))
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
That will ensure that the values in num are the same across the multiple evaluations of randomNumberDF.
besides using org.apache.spark.sql.functions.rand with a given seed, you coud use eager-checkpointing:
df.checkpoint(true)
This will materialize the dataframe to disk

Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Resources