How to process pyspark dataframe as group by column value - apache-spark

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition is the correct approach or there is something am doing wrong ?

after looking around I found this which addresses a similar problem, finally I had to solve it like
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)

Related

How to partition dataframe by column in pyspark for further processing?

I need to partition my dataframe by column. I know that it is possible for saving in separate files. But I need to partition for further processing (I need to sort partitions in a certain order and apply udf to the ordered partitions).
My code is:
df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
.repartitionByRange(2, "name") \
.rdd.glom().collect()
print(df)
# [[Row(name=2), Row(name=1), Row(name=2), Row(name=1), Row(name=2)], []]
I need to get something like that:
[[(2,), (2,), (2,)], [(1,), (1,)]]
You can use repartition instead of repartitionByRange:
df = spark.createDataFrame([(2,), (1,), (2,), (1,), (2,)], ("name",)) \
.repartition(2, "name") \
.rdd.glom().collect()
print(df)
# [[Row(name=2), Row(name=2), Row(name=2)], [Row(name=1), Row(name=1)]]
repartitionByRange uses sampling to estimate ranges and could result in errors as you have observed.

apply window function to multiple columns

I have a DF with over 20 columns. For each column I need to find the lead value and add it to the result.
I've been doing it using with column.
df
.withColumn("lead_col1", lead("col1").over(window))
.withColumn("lead_col2", lead("col2").over(window))
.withColumn("lead_col3", lead("col3").over(window))
and 17 more rows like that. Is there a way to do it using less code? I tried using this exampe, but it doesn't work.
Check below code, it is faster than foldLeft.
import org.apache.spark.sql.expressions._
val windowSpec = ...
val windowColumns = Seq(
("lead_col1", "col1"),
("lead_col2","col2"),
("lead_col3","col3")
).map(c => lead(col(c._2),1).over(windowSpec).as(c._1))
val windowColumns = df.columns ++ windowColumns
Applying windowColumns to DataFrame.
df.select(windowColumns:_*).show(false)
Like Sath suggested, foldleft works.
val columns = df.columns
columns.foldLeft(df){(tempDF, colName) =>
tempDF.withColumn("lag_" + colName, lag($"$colName", 1).over(window))}

Check for empty row within spark dataframe?

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.
You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()
You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

pyspark df.count() taking a very long time (or not working at all)

I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.

Comparing two rows at a time in PySpark

I am new to Spark, and am looking for help with best practices. I have a large DataFrame, and need feed two rows at a time into a function which compares them.
actual_data is a DataFrame with an id column, and several value columns.
rows_to_compare is a DataFrame with two columns: left_id and right_id.
For each pair in rows_to_compare, I'd like to feed the two corresponding rows from actual_data into a function.
My actual data is quite large (~30GB) and has many columns, so I've reduced it to this simpler example:
import pandas as pd
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
import builtins
sqlContext = SQLContext(sc)
# Build DataFrame of Actual Data
data = {
'id': [1,2,3,4,5],
'value': [11,12,13,14,15]}
actual_data_df = sqlContext.createDataFrame(
pd.DataFrame(data, columns=data.keys()))
# Build DataFrame of Rows To Compare
rows_to_compare = {
'left_id': [1,2,3,4,5],
'right_id': [1,1,1,1,1]}
rows_to_compare_df =
sqlContext.createDataFrame(
pd.DataFrame(rows_to_compare, columns=rows_to_compare.keys()))
result = (
rows_to_compare_df
.join(
actual_data_df.alias('a'),
col('left_id') == col('a.id'))
.join(
actual_data_df.alias('b'),
col('right_id') == col('b.id'))
.withColumn(
'total',
builtins.sum(
[col('a.value'),
col('b.value')]))
.select('a.id', 'b.id', 'total')
.collect())
This returns the desired output:
[Row(id=2, id=1, total=23), Row(id=5, id=1, total=26), Row(id=4, id=1, total=25), Row(id=1, id=1, total=22), Row(id=3, id=1, total=24)]
When I run this, it seems quite slow, even for this toy problem. Is this the best way of approaching this problem? The clearest alternative approach I can think of is to make each row of my DataFrame contain the values for both rows I'd like to compare. I'm concerned about this approach though since it will involve a tremendous amount of data duplication.
Any help is much appreciated, thank you.

Resources