Spark randomly drop rows - apache-spark

I'm testing a classifier on missing data and want to randomly delete rows in Spark.
I want to do something like for every nth row, delete 20 rows.
What would be the best way to do this?

If it is random you can use sample this method lets you take a fraction of a DataFrame. However, if your idea is to split your data into training and validation you can use randomSplit.
Another option which is less elegant is to convert your DataFrame into an RDD and use zipWithIndex and filter by index, maybe something like:
df.rdd.zipWithIndex().filter(lambda x: x[-1] % 20 != 0)

Related

In tensorflow, How to compute mean for each columns of a batch generated from a csv that has NaNs in multiple columns?

I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option

Find row number is a large pandas dataframe is very slow using np.where

i am trying to find the row number of a dataframe having 55k rows. I am having multiple index and hence i tried to search using np.where. But this is very very slow. Is there a better solution to achieve in less than a millisecond. I tried loc, iloc but seems of not better options.
lprowcount = np.where(dfreglist['lineports'] == lineport)[0]
Also any best method to locate and update a row in a large dataframe other loc, iloc methods.

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Spark Delete Rows

I have a DataFrame containing roughly 20k rows.
I want to delete 186 rows randomly in the dataset.
To understand the context - I am testing a classification model on missing data, and each row has a unix timestamp. 186 rows corresponds to 3 seconds (there are 62 rows of data per second.)
My aim for this is, when data is streaming, it is likely that data will
be missing for a number of seconds. I am extracting features from a time window, so I want to see how missing data effects model performance.
I think the best approach to this would be to convert to an rdd and use the filter function, something like this, and put the logic inside the filter function.
dataFrame.rdd.zipWithIndex().filter(lambda x: )
But I am stuck with the logic - how do I implement this? (using PySpark)
Try to do like this:
import random
startVal = random.randint(0,dataFrame.count() - 62)
dataFrame.rdd.zipWithIndex()\
.filter(lambda x: not x[<<index>>] in range(startVal, startVal+62))
This should work!

Resources