How to find avg length each column in pyspark? [duplicate] - apache-spark

This question already has an answer here:
Apply a transformation to multiple columns pyspark dataframe
(1 answer)
Closed 4 years ago.
I have created data frame like below:
from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 2.2|Jalfaizy| aa|Jalfaizy|
| 230| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+----+--------+-----+--------+
I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
|2.5 | 5.5 | 2.75 | 6 |
+----+--------+-----+--------+

This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :
from pyspark.sql.functions import length, col, avg
selection = ['lname','mname','name']
schemaPeople \
.select(*(length(col(c)).alias(c) for c in selection)) \
.agg(*(avg(col(c)).alias(c) for c in selection)).show()
# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# | 5.5| 2.75| 6.0|
# +-----+-----+----+
This way, you'll be able to pass the names of the columns dynamically.
What we are doing here is actually unpacking the argument list (selection)
Reference : Control Flow Tools - Unpacking Argument Lists.

I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:
df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))
df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()
Result:
+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
| 2.5| 5.5| 2.75| 6.0|
+------------+--------------+--------------+-------------+

In Scala can be done in this way, guess, can be converted to Python by author:
val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)

Related

Spark: use value of a groupBy column as a name for an aggregate column [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I want to give aggregate column name which contains a value of one of the groupBy columns:
dataset
.groupBy("user", "action")
.agg(collect_list("timestamp").name($"action" + "timestamps")
this part: .name($"action") does not work because name expects a String, not a Column.
Base on: How to pivot Spark DataFrame?
val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")
val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()
+----+------+---+---+
|user|action| a| b|
+----+------+---+---+
| U1| a|[1]| []|
| U2| b| []|[2]|
+----+------+---+---+
Fun part with column renaming. We should rename all but first 2 columns
val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show
+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
| U1| [1]| []|
| U2| []| [2]|
+----+-----------+-----------+

How to capture frequency of words after group by with pyspark

I have a tabular data with keys and values and the keys are not unique.
for example:
+-----+------+
| key | value|
--------------
| 1 | the |
| 2 | i |
| 1 | me |
| 1 | me |
| 2 | book |
| 1 |table |
+-----+------+
Now assume this table is distributed across the different nodes in spark cluster.
How do I use pyspark to calculate frequencies of the words with respect to the different keys? for instance, in the above example I wish to output:
+-----+------+-------------+
| key | value| frequencies |
---------------------------+
| 1 | the | 1/4 |
| 2 | i | 1/2 |
| 1 | me | 2/4 |
| 2 | book | 1/2 |
| 1 |table | 1/4 |
+-----+------+-------------+
Not sure if you can combine multi-level operations with DFs, but doing it in 2 steps and leaving concat to you, this works:
# Running in Databricks, not all stuff required
# You may want to do to upper or lowercase for better results.
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [("1", "the"), ("2", "I"), ("1", "me"),
("1", "me"), ("2", "book"), ("1", "table")]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1]))
df = sqlContext.createDataFrame(someschema)
df1 = df.groupBy("c1", "c2") \
.count()
df2 = df1.groupBy('c1') \
.sum('count')
df3 = df1.join(df2,'c1')
df3.show()
returns:
+---+-----+-----+----------+
| c1| c2|count|sum(count)|
+---+-----+-----+----------+
| 1|table| 1| 4|
| 1| the| 1| 4|
| 1| me| 2| 4|
| 2| I| 1| 2|
| 2| book| 1| 2|
+---+-----+-----+----------+
You can reformat last 2 cols, but am curious if we can do all in 1 go. In normal SQL we would use inline views and combine I suspect.
This works across cluster standardly, what Spark is generally all about. The groupBy takes it all into account.
minor edit
As it is rather hot outside, I looked into this in a little more depth. This is a good overview: http://stevendavistechnotes.blogspot.com/2018/06/apache-spark-bi-level-aggregation.html. After reading this and experimenting I could not get it any more elegant, reducing to 5 rows of output all in 1 go appears not to be possible.
Another viable option is with window functions.
First, define the number of occurrences per values-keys and for key. Then just add another column with the Fraction (you will have reduced fractions)
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
#udf (StringType())
def getFraction(frequency):
return str(Fraction(frequency))
schema = StructType([StructField("key", IntegerType(), True),
StructField("value", StringType(), True)])
data = [(1, "the"), (2, "I"), (1, "me"),
(1, "me"), (2, "book"), (1, "table")]
spark = SparkSession.builder.appName('myPython').getOrCreate()
input_df = spark.createDataFrame(data, schema)
(input_df.withColumn("key_occurrence",
F.count(F.lit(1)).over(Window.partitionBy(F.col("key"))))
.withColumn("value_occurrence", F.count(F.lit(1)).over(Window.partitionBy(F.col("value"), F.col('key'))))
.withColumn("frequency", getFraction(F.col("value_occurrence"), F.col("key_occurrence"))).dropDuplicates().show())

MySQL sum over a window that contains a null value returns null

I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?

How to find max value Alphabet from DataFrame apache spark?

i am trying to get the max value Alphabet from a panda dataframe as whole. I am not interested in what row or column it came from. I am just interested in a single max value within the dataframe.
This is what it looks like:
id conditionName
1 C
2 b
3 A
4 A
5 A
expected result is:
|id|conditionName|
+--+-------------+
| 3| A |
| 4| A |
| 5| A |
+----------------+
because 'A' is the first letter of the alphabet
df= df.withColumn("conditionName", col("conditionName").cast("String"))
.groupBy("id,conditionName").max("conditionName");
df.show(false);
Exception: "conditionName" is not a numeric column. Aggregation function can only be applied on a numeric column.;
I need the max from an entire dataframe Alphabet character.
What should I use, so that the desired results?
Thank advance !
You can sort your DataFrame by your string column, grab the first value and use it to filter your original data:
from pyspark.sql.functions import lower, desc, first
# we need lower() because ordering strings is case sensitive
first_letter = df.orderBy((lower(df["condition"]))) \
.groupBy() \
.agg(first("condition").alias("condition")) \
.collect()[0][0]
df.filter(df["condition"] == first_letter).show()
#+---+---------+
#| id|condition|
#+---+---------+
#| 3| A|
#| 4| A|
#| 5| A|
#+---+---------+
Or more elegantly using Spark SQL:
df.registerTempTable("table")
sqlContext.sql("SELECT *
FROM table
WHERE lower(condition) = (SELECT min(lower(condition))
FROM table)
")

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?
Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.
Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+
The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}
This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Resources