User defined function over all rows in window - apache-spark

I have a set of timestamped location data with a set of string feature ids that are attached to each location. I'd like to use a Window in spark to pull together an array of all of these feature id strings across the current N and next N rows, ala:
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
windowSpec = Window \
.partitionBy(df['userid']) \
.orderBy(df['timestamp']) \
.rowsBetween(-50, 50)
dataFrame = sqlContext.table("locations")
featureIds = featuresCollector(dataFrame['featureId']).over(windowSpec)
dataFrame.select(
dataFrame['product'],
dataFrame['category'],
dataFrame['revenue'],
featureIds.alias("allFeatureIds"))
Is this possible with Spark and if so, how do I write a function like featuresCollector that can collect all the feature ids in the window?

Spark UDFs cannot be used for aggregations. Spark provides a number of tools (UserDefinedAggregateFunctions, Aggregators, AggregateExpressions) which can be used for custom aggregations, and some of these can be used with windowing, but none can be defined in Python.
If all you want is to collect records, collect_list should do the trick. Please keep in mind that is a very expensive operation.
from pyspark.sql.functions import collect_list
featureIds = collect_list('featureId').over(windowSpec)

Related

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

Keep only the newest data with Spark structured streaming

I am streaming data like this: time, id, value
I want to keep only one record for each id, with the newest value. What is the best way to deal with this problem?
Prefer to use Pyspark
from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("id").orderBy("time",'tiebreak')
df_s
.withColumn('tiebreak', monotonically_increasing_id())
.withColumn('rank', rank().over(window))
.filter(col('rank') == 1).drop('rank','tiebreak')
.show()
Rank and tiebreaks are added to remove duplicates or ties across and within window partitions.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Deleting specific column in Cassandra from Spark

I was able to delete specific column with the RDD API with -
sc.cassandraTable("books_ks", "books")
.deleteFromCassandra("books_ks", "books",SomeColumns("book_price"))
I am struggling to do this with the Dataframe API.
Can someone please share an example?
You cannot delete via the DF API and it's unnatural via the RDD api. RDDs and DFs are immutable, meaning no modification. You can filter them to cut them down but this generates a new RDD / DF.
Having said that what you can do is filter out the rows that you wish to delete and then just build a C* client to carry out that deletion:
// imports for Spark and C* connection
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
spark.setCassandraConf("Test Cluster", CassandraConnectorConf.ConnectionHostParam.option("localhost"))
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "books_ks", "table" -> "books")).load()
val dfToDelete = df.filter($"price" < 3).select($"price");
dfToDelete.show();
// import for C* client
import com.datastax.driver.core._
// build a C* client (part of the dependency of the scala driver)
val clusterBuilder = Cluster.builder().addContactPoints("127.0.0.1");
val cluster = clusterBuilder.build();
val session = cluster.connect();
// loop over everything that you filtered in the DF and delete specified row.
for(price <- dfToDelete.collect())
session.execute("DELETE FROM books_ks.books WHERE price=" + price.get(0).toString);
Few Warnings This wont work well if you're trying to delete a large portion of rows. Using collect here means that this work will be done in Spark's driver program, aka SPOF & bottle-neck.
Better way to do this would be to go a) define a DF UDF to carry out the delete, benefit would be you get parallelization. Option b) to the RDD level and just the delete as you've shown above.
Moral of the story, just because it can be done, doesn't mean it should be done.

Pyspark sql count returns different number of rows than pure sql

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
Take this code:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
hc.table("diamonds").count()
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
Is that pyspark count including the header?
I've tried to look into:
df = hc.sql("select * from diamonds").collect()
df[0]
df[1]
to see if header was included:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
The 0th element doesn't look like the header.
Anyone has an explanation for this?
Thanks!
Ale
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;
Alternatively, refresh the statistics. If your table is not partitioned:
ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;
If it is partitioned:
ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;
Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

Resources