Spark Delete Rows - apache-spark

I have a DataFrame containing roughly 20k rows.
I want to delete 186 rows randomly in the dataset.
To understand the context - I am testing a classification model on missing data, and each row has a unix timestamp. 186 rows corresponds to 3 seconds (there are 62 rows of data per second.)
My aim for this is, when data is streaming, it is likely that data will
be missing for a number of seconds. I am extracting features from a time window, so I want to see how missing data effects model performance.
I think the best approach to this would be to convert to an rdd and use the filter function, something like this, and put the logic inside the filter function.
dataFrame.rdd.zipWithIndex().filter(lambda x: )
But I am stuck with the logic - how do I implement this? (using PySpark)

Try to do like this:
import random
startVal = random.randint(0,dataFrame.count() - 62)
dataFrame.rdd.zipWithIndex()\
.filter(lambda x: not x[<<index>>] in range(startVal, startVal+62))
This should work!

Related

In tensorflow, How to compute mean for each columns of a batch generated from a csv that has NaNs in multiple columns?

I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option

Why is row count different when using spark.table().count() and df.count()?

I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.
I want to preprocess the data in the table before I can use it further as a training set for a model I need to train.
When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).
what I have tried:
raised spark.kryoserializer.buffer.max capacity.
load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.
this sample is very similar to the code I ran in order to investigate the problem.
df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961
I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.

Pyspark dataframe.limit is slow

I am trying to work with a large dataset, but just play around with a small part of it. Each operation takes a long time, and I want to look at the head or limit of the dataframe.
So, for example, I call a UDF (user defined function) to add a column, but I only care to do so on the first, say, 10 rows.
sum_cols = F.udf(lambda x:x[0] + x[1], IntegerType())
df_with_sum = df.limit(10).withColumn('C',sum_cols(F.array('A','B')))
However, this still to take the same long time it would take if I did not use limit.
If you work with 10 rows first, I think it is better that to create a new df and cache it
df2 = df.limit(10).cache()
df_with_sum = df2.withColumn('C',sum_cols(F.array('A','B')))
limit will first try to get the required data from single partition. If the it does not get the whole data in one partition then it will get remaining data from next partition.
So please check how many partition you have by using df.rdd.getNumPartition
To prove this I would suggest first coalsce your df to one partition and do a limit. You can see this time limit is faster as it’s filtering data from one partition

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Spark randomly drop rows

I'm testing a classifier on missing data and want to randomly delete rows in Spark.
I want to do something like for every nth row, delete 20 rows.
What would be the best way to do this?
If it is random you can use sample this method lets you take a fraction of a DataFrame. However, if your idea is to split your data into training and validation you can use randomSplit.
Another option which is less elegant is to convert your DataFrame into an RDD and use zipWithIndex and filter by index, maybe something like:
df.rdd.zipWithIndex().filter(lambda x: x[-1] % 20 != 0)

Resources