Spark DataFrame Removing duplicates via GroupBy keep first - apache-spark

I am using the groupBy function to remove duplicates from a spark DataFrame. For each group I simply want to take the first row, which will be the most recent one.
I don't want to perform a max() aggregation because I know the results are already stored sorted in Cassandra and want to avoid unnecessary computation. See this approach using pandas, its exactly what I'm after except in Spark.
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="table", keyspace="keyspace")\
.load()\
.groupBy("key")\
#what goes here?

Just dropDuplicates should do the job.
Try df.dropDuplicates(Seq("column")).show.
Check this question for more details.

Related

Pyspark groupBy with custom partitioner

I want to apply some custom partitioning when working with a given DataFrame. I found that the RDD groupBy provides me with the desired functionality. Now when I say
dataframe.rdd.groupBy(lambda row: row[1:3], numPartitions, partitioner)
I end up with a PythonRDD that has a tuple as a key and a ResultIterator as the value. What I want to do next is convert this back to a DataFrame since I want to use the apply function of GroupedData. I have attempted multiple things but have been unlucky so far.
Any help would be appreciated!

In Pyspark, what happens when you groupBy the same column as you used in partitionBy?

I have a dataset that was partitioned by column ID and written to disk. This results in each partition getting its own folder in the filesystem. Now I am reading this data back in and would like to call groupBy('ID') followed by calling a pandas_udf function. My question is, since the data was partitioned by ID, is groupBy('ID') any faster than if it hadn't been partitioned? Would it be better to e.g. read one ID at a time using the folder structure? I worry the groupBy operation is looking through every record even though they've already been partitioned.
You have partitioned by ID and saved to disk
You read it again and want to groupby and apply a pandas udf
It is obvious the groupby will look through every record, and so will most functions. But using a pandas_udf which groupby("ID") is going to be expensive because it will go through an unnecessary shuffle.
You can optimize performance by using groupby spark_partition_id() since you have already partitioned by the column you want to groupby on.
EDIT:
If you want file names, you can try:
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

How do I create index on pyspark df?

I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.

Join Spark dataframe with Cassandra table [duplicate]

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

Resources