Related
I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.
Pandas approach:
df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1
I tried the following, but it's creating more ids than required:
df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))
Instead of row_number, use dense_rank:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('r1', 'ph1'),
('r1', 'ph1'),
('r1', 'ph2'),
('s4', 'ph3'),
('s3', 'ph2'),
('s3', 'ph2')],
['foo', 'bar'])
df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1| 1|
# | r1|ph1| 1|
# | r1|ph2| 2|
# | s3|ph2| 3|
# | s3|ph2| 3|
# | s4|ph3| 4|
# +---+---+-----+
There are many questions similar to this that are asking a different question with regard to avoid duplicate columns in a join; that is not what I am asking here.
Given that I already have a DataFrame with ambiguous columns, how do I remove a specific column?
For example, given:
df = spark.createDataFrame(
spark.sparkContext.parallelize([
[1, 0.0, "ext-0.0"],
[1, 1.0, "ext-1.0"],
[2, 1.0, "ext-2.0"],
[3, 2.0, "ext-3.0"],
[4, 3.0, "ext-4.0"],
]),
StructType([
StructField("id", IntegerType(), True),
StructField("shared", DoubleType(), True),
StructField("shared", StringType(), True),
])
)
I wish to retain only the numeric columns.
However, attempting to do something like df.select("id", "shared").show() results in:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'shared' is ambiguous, could be: shared, shared.;"
Many related solution to this problem are simply 'avoid ever getting into this situation', eg. by using ['joinkey'] instead of a.joinkey = b.joinkey on the join. I reiterate that this is not the situation here; this relates to a dataframe that has already been converted into this form.
The metadata from the DF disambiguates these columns:
$ df.dtypes
[('id', 'int'), ('shared', 'double'), ('shared', 'string')]
$ df.schema
StructType(List(StructField(id,IntegerType,true),StructField(shared,DoubleType,true),StructField(shared,StringType,true)))
So the data is retained internally... I just can't see how to use it.
How do I pick one column over the other?
I expected to be able to use, eg. col('shared#11') or similar... but there is nothing like that I can see?
Is this simply not possible in spark?
To answer this question, I would ask, please post either a) a working code snippet that solves the problem above, or b) link to something official from the spark developers that this simply isn't supported?
the easiest solution to this problem is to rename using df.toDF(...<new-col-names>...), but if you don't wanted to change the column name then group the duplicated columns by their type as struct<type1, type2> as below-
Please note that below solution is written in scala, but logically similar code can be implemented in python. Also this solution will work for all duplicate columns in the dataframe-
1. Load the test data
val df = Seq((1, 2.0, "shared")).toDF("id", "shared", "shared")
df.show(false)
df.printSchema()
/**
* +---+------+------+
* |id |shared|shared|
* +---+------+------+
* |1 |2.0 |shared|
* +---+------+------+
*
* root
* |-- id: integer (nullable = false)
* |-- shared: double (nullable = false)
* |-- shared: string (nullable = true)
*/
2. get all the duplicated column names
// 1. get all the duplicated column names
val findDupCols = (cols: Array[String]) => cols.map((_ , 1)).groupBy(_._1).filter(_._2.length > 1).keys.toSeq
val dupCols = findDupCols(df.columns)
println(dupCols.mkString(", "))
// shared
3. rename duplicate cols like shared => shared:string, shared:int, without touching the other column names
val renamedDF = df
// 2 rename duplicate cols like shared => shared:string, shared:int
.toDF(df.schema
.map{case StructField(name, dt, _, _) =>
if(dupCols.contains(name)) s"$name:${dt.simpleString}" else name}: _*)
3. create struct of all cols
// 3. create struct of all cols
val structCols = df.schema.map(f => f.name -> f ).groupBy(_._1)
.map{case(name, seq) =>
if (seq.length > 1)
struct(
seq.map { case (_, StructField(fName, dt, _, _)) =>
expr(s"`$fName:${dt.simpleString}` as ${dt.simpleString}")
}: _*
).as(name)
else col(name)
}.toSeq
val structDF = renamedDF.select(structCols: _*)
structDF.show(false)
structDF.printSchema()
/**
* +-------------+---+
* |shared |id |
* +-------------+---+
* |[2.0, shared]|1 |
* +-------------+---+
*
* root
* |-- shared: struct (nullable = false)
* | |-- double: double (nullable = false)
* | |-- string: string (nullable = true)
* |-- id: integer (nullable = false)
*/
4. get column by their type using <column_name>.<datatype>
// Use the dataframe without losing any columns
structDF.selectExpr("id", "shared.double as shared").show(false)
/**
* +---+------+
* |id |shared|
* +---+------+
* |1 |2.0 |
* +---+------+
*/
Hope this is useful to someone!
It seems this is possible by replacing the schema using .rdd.toDf() on the dataframe.
However, I'll still accept any answer that is less convoluted and annoying than the one below:
import random
import string
from pyspark.sql.types import DoubleType, LongType
def makeId():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))
def makeUnique(column):
return "%s---%s" % (column.name, makeId())
def makeNormal(column):
return column.name.split("---")[0]
unique_schema = list(map(makeUnique, df.schema))
df_unique = df.rdd.toDF(schema=unique_schema)
df_unique.show()
numeric_cols = filter(lambda c: c.dataType.__class__ in [LongType, DoubleType], df_unique.schema)
numeric_col_names = list(map(lambda c: c.name, numeric_cols))
df_filtered = df_unique.select(*numeric_col_names)
df_filtered.show()
normal_schema = list(map(makeNormal, df_filtered.schema))
df_fixed = df_filtered.rdd.toDF(schema=normal_schema)
df_fixed.show()
Gives:
+-----------+---------------+---------------+
|id---chjruu|shared---aqboua|shared---ehjxor|
+-----------+---------------+---------------+
| 1| 0.0| ext-0.0|
| 1| 1.0| ext-1.0|
| 2| 1.0| ext-2.0|
| 3| 2.0| ext-3.0|
| 4| 3.0| ext-4.0|
+-----------+---------------+---------------+
+-----------+---------------+
|id---chjruu|shared---aqboua|
+-----------+---------------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+-----------+---------------+
+---+------+
| id|shared|
+---+------+
| 1| 0.0|
| 1| 1.0|
| 2| 1.0|
| 3| 2.0|
| 4| 3.0|
+---+------+
Workaround: Simply rename the columns (in order) and then do whatever you wanted to do!
renamed_df = df.toDF("id", "shared_double", "shared_string")
I am trying to create a udf for my code for generalizing the problem. I run into issues where it seems like I cannot pass a dataframe to the function.
Input DataFrame:
df = sqlContext.createDataFrame([('1', 201001,3400,1600,65,320,400,), ('1', 201002,5200,1600,65,320,400,), ('1', 201003,65,1550,32,320,400,), ('2', 201505,3200,1800,12,1,40,), ('2', 201508,3200,3200,12,1,40,), ('3', 201412,40,40,12,1,3,)],
['ColA', 'Col1','Col2','Col3','Col4','Col5','Col6',])
+----+------+----+----+----+----+----+
|ColA| Col1|Col2|Col3|Col4|Col5|Col6|
+----+------+----+----+----+----+----+
| 1|201001|3400|1600| 65| 320| 400|
| 1|201002|5200|1600| 65| 320| 400|
| 1|201003| 65|1550| 32| 320| 400|
| 2|201505|3200|1800| 12| 1| 40|
| 2|201508|3200|3200| 12| 1| 40|
| 3|201412| 40| 40| 12| 1| 3|
+----+------+----+----+----+----+----+
Expected Ouput:
df = sqlContext.createDataFrame([(1,['201001', '201002', '201003'],[3400, 5200, 65],[1600, 1600, 1550],[65,32],[320],[400],), (2,['201505', '201508'],[3200, 3200],[1800, 3200],[12],[1],[40],),
(3,['201412'],[40],[40],[12],[1],[3],)], ['ColA', 'Col1','Col2','Col3','Col4','Col5','Col6',])
df.show()
+----+--------------------+----------------+------------------+--------+-----+-----+
|ColA| Col1| Col2| Col3| Col4| Col5| Col6|
+----+--------------------+----------------+------------------+--------+-----+-----+
| 1|[201001, 201002, ...|[3400, 5200, 65]|[1600, 1600, 1550]|[65, 32]|[320]|[400]|
| 2| [201505, 201508]| [3200, 3200]| [1800, 3200]| [12]| [1]| [40]|
| 3| [201412]| [40]| [40]| [12]| [1]| [3]|
+----+--------------------+----------------+------------------+--------+-----+-----+
This is the code that works (non-functional)
groupBy = ['ColA']
convert_to_list = ['Col1', 'Col2', 'Col3',]
convert_to_set = ['Col4', 'Col5', 'Col6',]
exprs = [F.collect_set(F.col(c)).alias(c) for c in cols_to_list]\
+ [F.collect_set(F.col(c)).alias(c) in funs_set for c in
df = df.groupby(*groupBy).agg(*exprs)
When I try to create a udf, I get this error:
#F.udf
def aggregation(df, groupby_column, cols_to_list, cols_to_set):
exprs = [F.collect_set(F.col(c)).alias(c) for c in cols_to_list]\
+ [F.collect_set(F.col(c)).alias(c) in funs_set for c in cols_to_set]
return df.groupby(*groupby_column).agg(*exprs)
groupby_column = ['ColA']
cols_to_list = ['Col1', 'Col2', 'Col3',]
cols_to_set = ['Col4', 'Col5', 'Col6',]
exprs = F.concat([f(F.col(c)) for f in fun_list for c in convert_to_list], [f(F.col(c)) for f in funs_set for c in convert_to_set])
df = df.groupby(*groupBy).agg(*exprs)
TypeError: Invalid argument, not a string or column: DataFrame
I have a dataframe similarly to:
+---+-----+-----+
|key|thing|value|
+---+-----+-----+
| u1| foo| 1|
| u1| foo| 2|
| u1| bar| 10|
| u2| foo| 10|
| u2| foo| 2|
| u2| bar| 10|
+---+-----+-----+
And want to get a result of:
+---+-----+---------+----+
|key|thing|sum_value|rank|
+---+-----+---------+----+
| u1| bar| 10| 1|
| u1| foo| 3| 2|
| u2| foo| 12| 1|
| u2| bar| 10| 2|
+---+-----+---------+----+
Currently, there is code similarly to:
val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1", "bar", 10), ("u2", "foo", 10), ("u2", "foo", 2), ("u2", "bar", 10)).toDF("key", "thing", "value")
// calculate sums per key and thing
val aggregated = df.groupBy("key", "thing").agg(sum("value").alias("sum_value"))
// get topk items per key
val k = lit(10)
val topk = aggregated.withColumn("rank", rank over Window.partitionBy("key").orderBy(desc("sum_value"))).filter('rank < k)
However, this code is very inefficient. A window function generates a total order of items and causes a gigantic shuffle.
How can I calculate top-k items more efficiently?
Maybe using approximate functions i.e. sketches similarly to https://datasketches.github.io/ or https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html
This is a classical algorithm of recommender systems.
case class Rating(thing: String, value: Int) extends Ordered[Rating] {
def compare(that: Rating): Int = -this.value.compare(that.value)
}
case class Recommendation(key: Int, ratings: Seq[Rating]) {
def keep(n: Int) = this.copy(ratings = ratings.sorted.take(n))
}
val TOPK = 10
df.groupBy('key)
.agg(collect_list(struct('thing, 'value)) as "ratings")
.as[Recommendation]
.map(_.keep(TOPK))
You can also check the source code at:
Spotify Big Data Rosetta Code / TopItemsPerUser.scala, several solutions here for Spark or Scio
Spark MLLib / TopByKeyAggregator.scala, considered the best practice when using their recommendation algorithm, it looks like their examples still uses RDD though.
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
sc.parallelize(Array(("u1", ("foo", 1)), ("u1", ("foo", 2)), ("u1", ("bar", 10)), ("u2", ("foo", 10)),
("u2", ("foo", 2)), ("u2", ("bar", 10))))
.topByKey(10)(Ordering.by(_._2))
RDD`s to the rescue
aggregated.as[(String, String, Long)].rdd.groupBy(_._1).map{ case (thing, it) => (thing, it.map(e=> (e._2, e._3)).toList.sortBy(sorter => sorter._2).take(1))}.toDF.show
+---+----------+
| _1| _2|
+---+----------+
| u1| [[foo,3]]|
| u2|[[bar,10]]|
+---+----------+
This can most likely be improved using the suggestion from the comment. I.e. when not starting out from aggregated, but rather df. This could look similar to:
df.as[(String, String, Long)].rdd.groupBy(_._1).map{case (thing, it) => {
val aggregatedInner = it.groupBy(e=> (e._2)).mapValues(events=> events.map(value => value._3).sum)
val topk = aggregatedInner.toArray.sortBy(sorter=> sorter._2).take(1)
(thing, topk)
}}.toDF.show
Starting from the following spark data frame:
from io import StringIO
import pandas as pd
from pyspark.sql.functions import col
pd_df = pd.read_csv(StringIO("""device_id,read_date,id,count
device_A,2017-08-05,4041,3
device_A,2017-08-06,4041,3
device_A,2017-08-07,4041,4
device_A,2017-08-08,4041,3
device_A,2017-08-09,4041,3
device_A,2017-08-10,4041,1
device_A,2017-08-10,4045,2
device_A,2017-08-11,4045,3
device_A,2017-08-12,4045,3
device_A,2017-08-13,4045,3"""),infer_datetime_format=True, parse_dates=['read_date'])
df = spark.createDataFrame(pd_df).withColumn('read_date', col('read_date').cast('date'))
df.show()
Output:
+--------------+----------+----+-----+
|device_id | read_date| id|count|
+--------------+----------+----+-----+
| device_A|2017-08-05|4041| 3|
| device_A|2017-08-06|4041| 3|
| device_A|2017-08-07|4041| 4|
| device_A|2017-08-08|4041| 3|
| device_A|2017-08-09|4041| 3|
| device_A|2017-08-10|4041| 1|
| device_A|2017-08-10|4045| 2|
| device_A|2017-08-11|4045| 3|
| device_A|2017-08-12|4045| 3|
| device_A|2017-08-13|4045| 3|
+--------------+----------+----+-----+
I would like to find the most frequent id for each (device_id, read_date) combination, over a 3 day rolling window. For each group of rows selected by the time window, I need to find the most frequent id by summing up the counts per id, then return the top id.
Expected Output:
+--------------+----------+----+
|device_id | read_date| id|
+--------------+----------+----+
| device_A|2017-08-05|4041|
| device_A|2017-08-06|4041|
| device_A|2017-08-07|4041|
| device_A|2017-08-08|4041|
| device_A|2017-08-09|4041|
| device_A|2017-08-10|4041|
| device_A|2017-08-11|4045|
| device_A|2017-08-12|4045|
| device_A|2017-08-13|4045|
+--------------+----------+----+
I am starting to think this is only possible using a custom aggregation function. Since spark 2.3 is not out I will have to write this in Scala or use collect_list. Am I missing something?
Add window:
from pyspark.sql.functions import window, sum as sum_, date_add
df_w = df.withColumn(
"read_date", window("read_date", "3 days", "1 day")["start"].cast("date")
)
# Then handle the counts
df_w = df_w.groupBy('device_id', 'read_date', 'id').agg(sum_('count').alias('count'))
Use one of the solutions from Find maximum row per group in Spark DataFrame for example
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
rolling_window = 3
top_df = (
df_w
.withColumn(
"rn",
row_number().over(
Window.partitionBy("device_id", "read_date")
.orderBy(col("count").desc())
)
)
.where(col("rn") == 1)
.orderBy("read_date")
.drop("rn")
)
# results are calculated on the start of the time window - adjust read_date as needed
final_df = top_df.withColumn('read_date', date_add('read_date', rolling_window - 1))
final_df.show()
# +---------+----------+----+-----+
# |device_id| read_date| id|count|
# +---------+----------+----+-----+
# | device_A|2017-08-05|4041| 3|
# | device_A|2017-08-06|4041| 6|
# | device_A|2017-08-07|4041| 10|
# | device_A|2017-08-08|4041| 10|
# | device_A|2017-08-09|4041| 10|
# | device_A|2017-08-10|4041| 7|
# | device_A|2017-08-11|4045| 5|
# | device_A|2017-08-12|4045| 8|
# | device_A|2017-08-13|4045| 9|
# | device_A|2017-08-14|4045| 6|
# | device_A|2017-08-15|4045| 3|
# +---------+----------+----+-----+
I managed to find a very inefficient solution. Hopefully someone can spot improvements to avoid the python udf and call to collect_list.
from pyspark.sql import Window
from pyspark.sql.functions import col, collect_list, first, udf
from pyspark.sql.types import IntegerType
def top_id(ids, counts):
c = Counter()
for cnid, count in zip(ids, counts):
c[cnid] += count
return c.most_common(1)[0][0]
rolling_window = 3
days = lambda i: i * 86400
# Define a rolling calculation window based on time
window = (
Window()
.partitionBy("device_id")
.orderBy(col("read_date").cast("timestamp").cast("long"))
.rangeBetween(-days(rolling_window - 1), 0)
)
# Use window and collect_list to store data matching the window definition on each row
df_collected = df.select(
'device_id', 'read_date',
collect_list(col('id')).over(window).alias('ids'),
collect_list(col('count')).over(window).alias('counts')
)
# Get rid of duplicate rows where necessary
df_grouped = df_collected.groupBy('device_id', 'read_date').agg(
first('ids').alias('ids'),
first('counts').alias('counts'),
)
# Register and apply udf to return the most frequently seen id
top_id_udf = udf(top_id, IntegerType())
df_mapped = df_grouped.withColumn('top_id', top_id_udf(col('ids'), col('counts')))
df_mapped.show(truncate=False)
returns:
+---------+----------+------------------------+------------+------+
|device_id|read_date |ids |counts |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041] |[3] |4041 |
|device_A |2017-08-06|[4041, 4041] |[3, 3] |4041 |
|device_A |2017-08-07|[4041, 4041, 4041] |[3, 3, 4] |4041 |
|device_A |2017-08-08|[4041, 4041, 4041] |[3, 4, 3] |4041 |
|device_A |2017-08-09|[4041, 4041, 4041] |[4, 3, 3] |4041 |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041 |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045 |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045 |
|device_A |2017-08-13|[4045, 4045, 4045] |[3, 3, 3] |4045 |
+---------+----------+------------------------+------------+------+