bucketing a spark dataframe- pyspark - python-3.x

I have a spark dataframe with column (age). I need to write a pyspark script to bucket the dataframe as a range of 10years of age( for ex age 11-20,age 21-30 ,...) and find the count of each age span entries .Need guidance on how to get through this
for ex :
I have the following dataframe
+-----+
|age |
+-----+
| 21|
| 23|
| 35|
| 39|
+-----+
after bucketing (expected)
+-----+------+
|age | count|
+-----+------+
|21-30| 2 |
|31-40| 2 |
+-----+------+

An easy way to run such a calculation would be to compute the histogram on the underlying RDD.
Given known age ranges (fortunately, this is easy to put together - here, using 1, 11, 21, etc.), it's fairly easy to produce the histogram:
hist = df.rdd\
.map(lambda l: l['age'])\
.histogram([1, 11, 21,31,41,51,61,71,81,91])
This will return a tuple with "age ranges" and their respective observation count, as:
([1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
[10, 10, 10, 10, 10, 10, 10, 10, 11])
Then you can convert that back to a data frame using:
#Use zip to link age_ranges to their counts
countTuples = zip(hist[0], hist[1])
#make a list from that
ageList = list(map(lambda l: Row(age_range=l[0], count=l[1]), countTuples))
sc.parallelize(ageList).toDF()
For more info, check the histogram function's documentation in the RDD API

Related

Get unique elements for every array-based row

I have a dataset which looks somewhat like this:
idx | attributes
--------------------------
101 | ['a','b','c']
102 | ['a','b','d']
103 | ['b','c']
104 | ['c','e','f']
105 | ['a','b','c']
106 | ['c','g','h']
107 | ['b','d']
108 | ['d','g','i']
I wish to transform the above dataframe into something like this:
idx | attributes
--------------------------
101 | [0,1,2]
102 | [0,1,3]
103 | [1,2]
104 | [2,4,5]
105 | [0,1,2]
106 | [2,6,7]
107 | [1,3]
108 | [3,6,8]
Here, 'a' is replaced by 0, 'b' is replaced by 1 and so. Essentially, I wish to find all unique elements and assign them numbers so that integer operations can be made on them. My current approach is by using RDDs to maintain a single set and loop across rows but it's highly memory and time-intensive. Is there any other method for this in PySpark?
Thanks in advance
Annotated code
from pyspark.ml.feature import StringIndexer
# Explode the dataframe by `attributes`
df1 = df.selectExpr('idx', "explode(attributes) as attributes")
# Create a StringIndexer to encode the labels
idx = StringIndexer(inputCol='attributes', outputCol='encoded', stringOrderType='alphabetAsc')
df1 = idx.fit(df1).transform(df1)
# group the encoded column by idx and aggregate using `collect_list`
df1 = df1.groupBy('idx').agg(F.collect_list(F.col('encoded').cast('int')).alias('attributes'))
Result
df1.show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [0, 1, 2]|
|102| [0, 1, 3]|
|103| [1, 2]|
|104| [2, 4, 5]|
|105| [0, 1, 2]|
|106| [2, 6, 7]|
|107| [1, 3]|
|108| [3, 6, 8]|
+---+----------+
This can be done in spark 2.4 as a one liner.
In spark 3.0 this can be done without expr.
df = spark.createDataFrame(data=[(101,['a','b','c']),
(102,['a','b','d']),
(103,['b','c']),
(104,['c','e','f']),
(105,['a','b','c']),
(106,['c','g','h']),
(107,['b','d']),
(108,['d','g','i']),],schema = ["idx","attributes"])
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [1, 2, 3]|
|102| [1, 2, 4]|
|103| [2, 3]|
|104| [3, 5, 6]|
|105| [1, 2, 3]|
|106| [3, 7, 8]|
|107| [2, 4]|
|108| [4, 7, 9]|
+---+----------+
The tricky bit: expr("transform( attributes, x -> ascii(x)-96)")
expr is used to say this is a SQL expression
transform takes a column [that is an array] and applies a function to each element in the array ( x is the lambda parameter for the element of the array. -> function start and ) function end.
ascii(x)-96) convert ascii code into integer.
If you are considering performance you may consider the explain plan for my answer vs the other one provided so far:
df1.groupBy('idx').agg(collect_list(col('encoded').cast('int')).alias('attributes')).explain()
== Physical Plan ==
ObjectHashAggregate(keys=[idx#24L], functions=[collect_list(cast(encoded#140 as int), 0, 0)])
+- Exchange hashpartitioning(idx#24L, 200)
+- ObjectHashAggregate(keys=[idx#24L], functions=[partial_collect_list(cast(encoded#140 as int), 0, 0)])
+- *(1) Project [idx#24L, UDF(attributes#132) AS encoded#140]
+- Generate explode(attributes#25), [idx#24L], false, [attributes#132]
+- Scan ExistingRDD[idx#24L,attributes#25]
my answer:
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).explain()
== Physical Plan ==
Project [idx#24L, transform(attributes#25, lambdafunction((ascii(lambda x#128) - 96), lambda x#128, false)) AS attributes#127]

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

collect_set by preserving order

I was referring to this question Here, however it works for collect_list and not collect_set
I have a dataframe like this
data = [(("ID1", 9)),
(("ID1", 9)),
(("ID1", 8)),
(("ID1", 7)),
(("ID1", 5)),
(("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()
+---+------+
| ID|Values|
+---+------+
|ID1| 9|
|ID1| 9|
|ID1| 8|
|ID1| 7|
|ID1| 5|
|ID1| 5|
+---+------+
I am trying to create a new column, collecting it as set
df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()
+---+------------+
| ID| Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+
But the order is not maintained, my order should be [9, 8, 7, 5]
I solved it like this
df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()
def my_function(x):
return list(dict.fromkeys(x))
udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List"))
df.show(truncate=False)
+---+------------------+------------+
|ID |Values_List |Values_Set |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+
From the pyspark source code, the documentation for collect_set:
_collect_set_doc = """
Aggregate function: returns a set of objects with duplicate elements eliminated.
.. note:: The function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]
"""
This means, you will have unordered sets which are based on a hash table and you can get more information on the 'order' of unordered Python sets
If you data is relatively small , you can coalesce it to 1 and then sort it before using collect_set()
Eg : relation,index
cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6
df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show
names_list
[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]
you can apply the array_sort() function to your column if you use spark 2.4 or above:

How to get the minimum of nested lists in PySpark

See the following data frame for example,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[[1, 2, 3, 4]],[[0, 2, 4]],[[]],[[3]]])
df.show()
Then we have
+------------+
| _1|
+------------+
|[1, 2, 3, 4]|
| [0, 2, 4]|
| []|
| [3]|
+------------+
Then I want to find the minimum of each list; use -1 in case of empty list. I tried the following, which does not work.
import pyspark.sql.functions as F
sim_col = F.col('_1')
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.min(sim_col))).show()
The error is:
AnalysisException: "cannot resolve 'CASE WHEN (_1 IS NULL) THEN -1.0D ELSE min(_1) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;\n'Aggregate [_1#404, CASE WHEN isnull(_1#404) THEN -1.0 ELSE min(_1#404) END AS min_turn_sim#411]\n+- LogicalRDD [_1#404], false\n"
The size function will work. Don't understand why 'min' does not.
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.size(sim_col))).show()
+------------+------------+
| _1|min_turn_sim|
+------------+------------+
|[1, 2, 3, 4]| 4.0|
| [0, 2, 4]| 3.0|
| []| -1.0|
| [3]| 1.0|
+------------+------------+
min is an aggregate function - it operates on columns, not values. Therefore min(sim_col) means minimum array value across all rows in the scoper, according to array ordering, not minimum value in each row.
To find a minimum for each row you'll need a non-aggregate function. In the latest Spark versions (2.4.0 and later) this would be array_min (similarly array_max to get the maximum value):
df.withColumn("min_turn_sim", F.coalesce(F.array_min(sim_col), F.lit(-1)))
Earlier versions will require an UDF:
#F.udf("long")
def long_array_min(xs):
return min(xs) if xs else -1
df.withColumn("min_turn_sim", F.coalesce(long_array_min(sim_col), F.lit(-1))

Filtering nested arrays based on values PySpark

I'm trying to filter GA Sessions on PySpark based on the customDimensions. The data is like
+--------------------+--------------------+
| fullVisitorId| cd|
+--------------------+--------------------+
| 5823179578207509663|[[1, app_tv], [36...|
| 5220700153870728639|[[107, live], [10...|
|16421406313456036559|[[1, app_tv], [36...|
|18135892068782985696|[[1, app_tv], [36...|
| 5865612025708664451|[[1, app_tv], [36...|
| 8103574485485735385|[[1, web], [36, d...|
| 6603732532553270294|[[1, web], [36, m...|
| 70498423600813735|[[1, web], [36, d...|
| 5017675391641460547|[[1, web], [36, d...|
+--------------------+--------------------+
Using the GA Schema, the cd (customDimensions) column has an array containing several tuples of index, value pairs.
How can I, efficiently, select the fullVisitorIds that has, for example, an entry with index = 107 and value = 'live' like in the second entry on the example

Resources