Is there a way to slice dataframe based on index in pyspark? - apache-spark

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?

Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes

You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

Related

How to efficiently perform a lookup from a column in a large spark dataframe into a small (broadcastable) array

I have a smallish (a couple of thousand) list/array of pairs of doubles and a very large (> 100 million rows) spark dataframe. In the large dataframe I have a column containing an integer which i want to use to index into the smaller list. I want to return a dataframe with all the original columns and the related two values from the list.
I could obviously create a dataframe from the list and do an inner join but that seems inefficient as the optimiser doesn't know it only needs to get the single pair from the small list and that it can index directly into the list using the integer column from the large dataframe.
What's the most efficient way of doing this? Happy for answers using any api - scala, pyspark, sql, dataframe or rdd.

In Pyspark, what happens when you groupBy the same column as you used in partitionBy?

I have a dataset that was partitioned by column ID and written to disk. This results in each partition getting its own folder in the filesystem. Now I am reading this data back in and would like to call groupBy('ID') followed by calling a pandas_udf function. My question is, since the data was partitioned by ID, is groupBy('ID') any faster than if it hadn't been partitioned? Would it be better to e.g. read one ID at a time using the folder structure? I worry the groupBy operation is looking through every record even though they've already been partitioned.
You have partitioned by ID and saved to disk
You read it again and want to groupby and apply a pandas udf
It is obvious the groupby will look through every record, and so will most functions. But using a pandas_udf which groupby("ID") is going to be expensive because it will go through an unnecessary shuffle.
You can optimize performance by using groupby spark_partition_id() since you have already partitioned by the column you want to groupby on.
EDIT:
If you want file names, you can try:
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())

What's the difference between RDD and Dataframe in Spark? [duplicate]

This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 3 years ago.
Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.
For example, I am pulling data from s3 bucket.
df=spark.read.parquet("s3://output/unattributedunattributed*")
In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.
Appreciate if someone can explain the difference between RDD,dataframe and datasets.
df=spark.read.parquet("s3://output/unattributedunattributed*")
With this statement, you are creating a data frame.
To create RDD use
df=spark.textFile("s3://output/unattributedunattributed*")
RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations
In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
If you want to apply a map or filter to the whole dataset, use RDD
If you want to work on an individual column or want to perform operations/calculations on a column then use Dataframe.
for example, if you want to replace 'A' in whole data with 'B'
then RDD is useful.
rdd = rdd.map(lambda x: x.replace('A','B')
if you want to update the data type of the column, then use Dataframe.
dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')
RDD can be converted into Dataframe and vice versa.

pyspark: isin vs join

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs
broadcast?
This question is the spark analogue of the following question in Pig:
Pig: efficient filtering by loaded list
Additional context:
Pyspark isin function
Considering
import pyspark.sql.functions as psf
There are two types of broadcasting:
sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
Both join and isin works well for all my daily workcases.
isin works well both of small and little large (~1M) set of list.
Note - If you have a large dataset (say ~500 GB) and you want to do filtering and then processing of filtered dataset, then
using isin the data read/processing is significantly very low and Fast. Whole 500 GB will not be loaded as you have already filtered the smaller dataset from .isin method.
But for the Join case, whole 500GB will loaded and processing. So Time of Processing will be much higher.
My case, After filtering using
isin, and then processing and converting to Pandas DF. It took < 60 secs
with JOIN and then processing and converting to Pandas DF. It takes > 1 hours.

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Resources