Pyspark groupBy with custom partitioner - apache-spark

I want to apply some custom partitioning when working with a given DataFrame. I found that the RDD groupBy provides me with the desired functionality. Now when I say
dataframe.rdd.groupBy(lambda row: row[1:3], numPartitions, partitioner)
I end up with a PythonRDD that has a tuple as a key and a ResultIterator as the value. What I want to do next is convert this back to a DataFrame since I want to use the apply function of GroupedData. I have attempted multiple things but have been unlucky so far.
Any help would be appreciated!

Related

Create XML request from each record of a dataframe

I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Spark DataFrame Removing duplicates via GroupBy keep first

I am using the groupBy function to remove duplicates from a spark DataFrame. For each group I simply want to take the first row, which will be the most recent one.
I don't want to perform a max() aggregation because I know the results are already stored sorted in Cassandra and want to avoid unnecessary computation. See this approach using pandas, its exactly what I'm after except in Spark.
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="table", keyspace="keyspace")\
.load()\
.groupBy("key")\
#what goes here?
Just dropDuplicates should do the job.
Try df.dropDuplicates(Seq("column")).show.
Check this question for more details.

How is the Spark select-explode idiom implemented?

Assume we have a DataFrame with a string column, col1, and an array column, col2. I was wondering what happens behind the scenes in the Spark operation:
df.select('col1', explode('col2'))
It seems that select takes a sequence of Column objects as input, and explode returns a Column so the types match. But the column returned by explode('col2') is logically of different length than col1, so I was wondering how select knows to "sync" them when constructing its output DataFrame. I tried looking at the Column class for clues but couldn't really find anything.
The answer is simple - there is no such data structure as Column. While Spark SQL uses columnar storage for caching and can leverage data layout for some low level operations columns are just descriptions of data and transformations not data containers. So simplifying things a bit explode is yet another flatMap on the Dataset[Row].

Resources