How to compare 2 columns in pyspark dataframe using asserts functions - apache-spark

I am using the below code to compare 2 columns in data frame. I dont want to do it in pandas. Can someone help how to compare using spark data frames?
df1=context.spark.read.option("header",True).csv("./test/input/test/Book1.csv",)
df1=df1.withColumn("Curated", dataclean.clean_email(col("email")))
df1.show()
assert_array_almost_equal(df1['expected'], df1['Curated'],verbose=True)

You can either do it through:
pyspark-test library which is inspired by the pandas testing module built for Spark as in this documentation or
exceptAll as in this documentation. Once used, you then have to check whether count is greater then zero, if yes, then the tables are not the same.
Good luck!

One efficient way would be to try to identify the first difference as soon as possible. One way to achieve that is via left-anti joins:
assert(df1.join(df1, (df1['expected'] == df1['Curated']), "leftanti").first() != None)

Related

sort values in pyspark > Does it give good and reliable results?

I am new with pyspark.
I need to analyze big CSV files .
at the beginning of the analyze I had to sort the data by ID and TIME .
I tried to use dask in order to do it but I realize that it give wrong answer and also many times it stuck in the middle. So dask not good with sorting values as mentioned in the link . apparently Because it do in parallel way.
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html
My question is how pyspark handle with this issue?
Does it give good and reliable results?
If the answer is yes I would like to know how spark is sorting data in parallel way and why it is difficult to dask

Pyspark UDF experience

there.
I am very new to Pyspark and I am learning the UDF myself. I realize UDF sometimes will slow down your code. I want to know about your experience. What UDF function did you apply(cannot be achieved with Pyspark code only). Is there any useful UDF function that helps you clean the data? Except for the Pyspark document, is there any source that can help me learn the UDF function?
You can find most of your needed functionality within the standard library functions spark has.
import pyspark.sql.functions - Check the docs here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions
Now, sometimes you do have to create custom UDF's but be aware that it does slow down since spark has to evaluate it for every dataframe row.
try to avoid this as much as you can.
When you don't have any other option, use it, but try to minimize the complexity and the external libraries you use.
Another approach is to use an RDD, which means you convert your dataframe to an rdd (MYDF.rdd)
And right after you call mapPartitions or map which accept a function that manipulate your data.
It basically sends chunks each time as a list of spark Row entity.
Read more about mapPartitions or map here: https://sparkbyexamples.com/spark/spark-map-vs-mappartitions-transformation/

Easiest way to count distinct number of rows in Pandas dataframe?

I just did:
len(my_df.drop_duplicates())
Is there not a more elegant way to do this?
in R you can do:
nrow(distinct(my_df))
Which to me is very readable, drop_duplicates() feels worrying, because as new Python user, I get lost with what operations are happening in place and which ones you need to store/overwrite copies of for the environment to persist the change.
The fact that searching on google didn't give me a clear one click answer for what I'd think was a simple function worried me a bit...
Thanks!
In pandas you can do by another way groupby or duplicated with sum
df.groupby(list(df)).ngroup()
(~df.duplicated()).sum()
Also as a R and python user, I know that is hard to switch from R to pandas , but the most common way is drop_duplicates
len(pd.unique(my_df))
you are looking for unique I guess.

Is spark able to read only column values satisfying some condition from parquet file?

I have a code
val count = spark.read.parquet("data.parquet").select("foo").where("foo > 3").count
I'm interested if spark is able to push down filter somehow and read from parquet file only values satisfying where condition. Can we avoid full scan in this case?
Short answer is yes, in this case, but not all cases.
You can try .explain and see for yourself.
This is an excellent reference document freely available on the Internet that I learnt a few things from in the past: https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Apply a custom function to a spark dataframe group

I have a very big table of time series data that have these columns:
Timestamp
LicensePlate
UberRide#
Speed
Each collection of LicensePlate/UberRide data should be processed considering the whole set of data. In others words, I do not need to proccess the data row by row, but all rows grouped by (LicensePlate/UberRide) together.
I am planning to use spark with dataframe api, but I am confused on how can I perform a custom calculation over spark grouped dataframe.
What I need to do is:
Get all data
Group by some columns
Foreach spark dataframe group apply a f(x). Return a custom object foreach group
Get the results by applying g(x) and returning a single custom object
How can I do steps 3 and 4? Any hints over which spark API (dataframe, dataset, rdd, maybe pandas...) should I use?
The whole workflow can be seen below:
What you are looking for exists since Spark 2.3: Pandas vectorized UDFs. It allows to group a DataFrame and apply custom transformations with pandas, distributed on each group:
df.groupBy("groupColumn").apply(myCustomPandasTransformation)
It is very easy to use so I will just put a link to Databricks' presentation of pandas UDF.
However, I don't know such a practical way to make grouped transformations in Scala yet, so any additional advice is welcome.
EDIT: in Scala, you can achieve the same thing since earlier versions of Spark, using Dataset's groupByKey + mapGroups/flatMapGroups.
While Spark provides some ways to integrate with Pandas it doesn't make Pandas distributed. So whatever you do with Pandas in Spark is simply local (either to driver or executor when used inside transformations) operation.
If you're looking for a distributed system with Pandas-like API you should take a look at dask.
You can define User Defined Aggregate functions or Aggregators to process grouped Datasets but this part of the API is directly accessible only in Scala. It is not that hard to write a Python wrapper when you create one.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Which one is applicable in your case depends on the properties of the function you want to apply (is it associative and commutative, can it work on streams, does it expect specific order).
The most general but inefficient approach can be summarized as follows:
h(rdd.keyBy(f).groupByKey().mapValues(g).collect())
where f maps from value to key, g corresponds to per-group aggregation and h is a final merge. Most of the time you can do much better than that so it should be used only as the last resort.
Relatively complex logic can be expressed using DataFrames / Spark SQL and window functions.
See also Applying UDFs on GroupedData in PySpark (with functioning python example)

Resources