Keep only the newest data with Spark structured streaming - apache-spark

I am streaming data like this: time, id, value
I want to keep only one record for each id, with the newest value. What is the best way to deal with this problem?
Prefer to use Pyspark

from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("id").orderBy("time",'tiebreak')
df_s
.withColumn('tiebreak', monotonically_increasing_id())
.withColumn('rank', rank().over(window))
.filter(col('rank') == 1).drop('rank','tiebreak')
.show()
Rank and tiebreaks are added to remove duplicates or ties across and within window partitions.

Related

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

pyspark df.count() taking a very long time (or not working at all)

I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.

Pyspark sql count returns different number of rows than pure sql

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
Take this code:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
hc.table("diamonds").count()
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
Is that pyspark count including the header?
I've tried to look into:
df = hc.sql("select * from diamonds").collect()
df[0]
df[1]
to see if header was included:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
The 0th element doesn't look like the header.
Anyone has an explanation for this?
Thanks!
Ale
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;
Alternatively, refresh the statistics. If your table is not partitioned:
ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;
If it is partitioned:
ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;
Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

User defined function over all rows in window

I have a set of timestamped location data with a set of string feature ids that are attached to each location. I'd like to use a Window in spark to pull together an array of all of these feature id strings across the current N and next N rows, ala:
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
windowSpec = Window \
.partitionBy(df['userid']) \
.orderBy(df['timestamp']) \
.rowsBetween(-50, 50)
dataFrame = sqlContext.table("locations")
featureIds = featuresCollector(dataFrame['featureId']).over(windowSpec)
dataFrame.select(
dataFrame['product'],
dataFrame['category'],
dataFrame['revenue'],
featureIds.alias("allFeatureIds"))
Is this possible with Spark and if so, how do I write a function like featuresCollector that can collect all the feature ids in the window?
Spark UDFs cannot be used for aggregations. Spark provides a number of tools (UserDefinedAggregateFunctions, Aggregators, AggregateExpressions) which can be used for custom aggregations, and some of these can be used with windowing, but none can be defined in Python.
If all you want is to collect records, collect_list should do the trick. Please keep in mind that is a very expensive operation.
from pyspark.sql.functions import collect_list
featureIds = collect_list('featureId').over(windowSpec)

Resources