Calculating Kernel Density of every column in a Spark DataFrame - apache-spark

is there a way to calculate KDE of every column of a DataFrame?
I have a DataFrame where each column represents the values of one feature. The KDE function of Spark MLLib needs an RDD[Double] of the sample values. The problem is I need to find a way without collecting the values for each column, because that would slow down the program to much.
Does anyone have an idea how I could solve that? Sadly all my tries failed till now.

Probably you can create a new RDD using sample function (refer here) and then perform your operation to get the optimal performance.

Related

In tensorflow, How to compute mean for each columns of a batch generated from a csv that has NaNs in multiple columns?

I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option

What is the best way to create a new Spark dataframe column based on an existing column that requires an external API call?

I have a dataframe that I am working with in a Python based Jupyter notebook. I want to add an additional column based on the content of an existing column, where the content of the new column is derived from running an external API call on the original column.
The solution I attempted was to use a Python based UDF. The first cell contains something like this:
def analysis(old_column):
new_column = myapi.analyze(text=old_column)
return(new_column)
analysis_udf = udf(analysis)
and the second cell this:
df2 = df1.withColumn("col2",analysis_udf('col1'))
df2.select('col2').show(n=5)
My dataframe is relatively large, with some 70000 rows, and where col1 can have a 100 to 10000+ characters of text. When I ran the code above in cell 2, it actually seemed to run fairly quickly (minutes), and dumped out the 5 rows of the df2 dataframe. So I thought I was in business. However, my next cell had the following code:
df2.cache()
df2.filter(col('col2').isNull()).count()
The intent of this code is to cache the contents of the new dataframe to improve access time to the DF, and then count how many of the entries in the dataframe have null values generated by the UDF. This surprisingly (to me) took many hours to run, and eventually provided an output of 6. Its not clear to me why the second cell ran quickly and the third was slow. I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on all of the rows, and that one would have been slow, and then subsequent calls to access the new column of the dataframe would be quick. But that wasn't the case, so I supposed then that the cache call was the one that was actually causing the UDF to run on all of the rows so any subsequent calls now should be quick. So added another cell with:
df2.show(n=5)
Assuming it would run quickly, but again, it was taking much longer than I expected and it seems like perhaps the UDF was running again. (?)
My questions are
Which Spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on
It is not a correct assumption. Spark will evaluate as little data as possible, given limitations of the API. Because you use Python udf it will evaluate minimum number of partitions required to collect 5 rows.
Which spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
Any evaluation, if data is no longer cached (evicted from memory).
Possibly any usage of the resulting column, unless udf is marked as non-deterministic.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
Unless you want to switch to Scala or RDD API, the only alternative is pandas_udf, which is somewhat more efficient, but supports only a limited subset of types.

How do I create index on pyspark df?

I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Fill missing value in Spark dataframe

I 'm trying to fill missing values in spark dataframe using PySpark. But there is not any proper way to do it. My task is to fill the missing values of some rows with respect to their previous or following rows. Concretely , I would change the 0.0 value of one row to the value of the previous row, while doing nothing on a none-zero row . I did see the Window function in spark, but it only supports some simple operation like max, min, mean, which are not suitable for my case. It would be optimal if we could have a user defined function sliding over the given Window.
Does anybody have a good idea ?
Use Spark window API to access previous row data. If you work on time series data, see also this package for missing data imputation.

Resources