spark groupby on several columns at same time - apache-spark

I’m using Spark2.0
I have a dataframe having several columns like id, latitude, longitude, time,
I want to do a groupby and keep [“latitude”,” longitude”] always together,
Could I do the following?
df.groupBy('id',[“latitude”,” longitude”] ,'time')
I want to calculate records number for each user , at each different time, with each different location [“latitude”,” longitude”].

You can combine "latitude" and "longitude" columns and then can use groupBy. Below sample is using Scala.
val df = Seq(("1","33.33","35.35","8:00"),("2","31.33","39.35","9:00"),("1","33.33","35.35","8:00")).toDF("id","latitude","longitude","time")
df.show()
val df1 = df.withColumn("lat-long",array($"latitude",$"longitude"))
df1.show()
val df2 = df1.groupBy("id","lat-long","time").count()
df2.show()
Output will be like below.
+---+--------------+----+-----+
| id| lat-long|time|count|
+---+--------------+----+-----+
| 2|[31.33, 39.35]|9:00| 1|
| 1|[33.33, 35.35]|8:00| 2|
+---+--------------+----+-----+

You can just use:
df.groupBy('id', 'latitude', 'longitude','time').agg(...)
This will work as expected without any additional steps.

Related

Extract values from a complex column in PySpark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]
You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

How to select distinct rows from a Spark Window partition

I have a sample DF with duplicate rows like this:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
val w = Window.partitionBy($"ENCOUNTER_ID")
Using the above Spark Window partition, is it possible to select distinct rows? I am expecting the output DF as:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
I don't want to use DF.DISTINCT or DF.DROPDUPLICATES as it would involve shuffling.
I prefer not to use lag or lead because, in real-time, the order of rows can't be guaranteed.
Window function also shuffle data. So if your all columns are duplicate then df.dropDuplicates will be better option to use. If your use case want to use Window function then you can use below approach.
scala> df.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
//You can use column in partitionBy that need to check for duplicate and also use respective orderBy also as of now I have use sample Window
scala> val W = Window.partitionBy(col("ID"),col("CL_ID"),col("NBR"),col("DT"), col("TYP"), col("KEY")).orderBy(lit(1))
scala> df.withColumn("duplicate", when(row_number.over(W) === lit(1), lit("Y")).otherwise(lit("N")))
.filter(col("duplicate") === lit("Y"))
.drop("duplicate")
.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
An answer to your question that scales up well with big data :
df.dropDuplicates(include your key cols here = ID in this case).
Window function shuffles data, but if you have duplicate entries and want to choose which one to keep for example, or want to sum the value of the duplicates then window function is the way to go
w = Window.PartitionBy('id')
df.agg(first( value col ).over(w)) #you can use max, min, sum, first, last depending on how you want to treat duplicates
An interesting third possibility if you want to keep the values of duplicates (for record)
is the below before
df.withColumn('dup_values', collect(value_col).over(w))
this will create an extra column with an array per row to keep duplicate values after you've got rid of the rows

Grouping by name and then adding up the number of another column [duplicate]

I am using pyspark to read a parquet file like below:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
The show method does what you're looking for.
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)
which yields:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
+---+---+
only showing top 2 rows
As mentioned by #Brent in the comment of #maxymoo's answer, you can try
df.limit(10).toPandas()
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.
Let's say we have the following Spark DataFrame:
df = sqlContext.createDataFrame(
[
(1, "Mark", "Brown"),
(2, "Tom", "Anderson"),
(3, "Joshua", "Peterson")
],
('id', 'firstName', 'lastName')
)
There are typically three different ways you can use to print the content of the dataframe:
Print Spark DataFrame
The most common way is to use show() function:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Print Spark DataFrame vertically
Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.
>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
id | 1
firstName | Mark
lastName | Brown
-RECORD 1-------------
id | 2
firstName | Tom
lastName | Anderson
only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.
>>> df_pd = df.toPandas()
>>> print(df_pd)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames
Yes: call the toPandas method on your dataframe and you'll get an actual pandas dataframe !
By default show() function prints 20 records of DataFrame. You can define number of rows you want to print by providing argument to show() function. You never know, what will be the total number of rows DataFrame will have. So, we can pass df.count() as argument to show function, which will print all records of DataFrame.
df.show() --> prints 20 records by default
df.show(30) --> prints 30 records according to argument
df.show(df.count()) --> get total row count and pass it as argument to show
If you are using Jupyter, this is what worked for me:
[1]
df= spark.read.parquet("s3://df/*")
[2]
dsp = users
[3]
%%display
dsp
This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.
Maybe something like this is a tad more elegant:
df.display()
# OR
df.select('column1').display()

Python Spark: difference between .distinct().count() and countDistinct() [duplicate]

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?
Is the difference the same as between select count(distinct member_id) and select distinct count(member_id)?
Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?
Because .distinct.count is the same:
SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)
while ..agg(countDistinct("member_id") as "count") is
SELECT COUNT(DISTINCT member_id) FROM table
and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.
df.agg(countDistinct("member_id") as "count")
returns the number of distinct values of the member_id column, ignoring all other columns, while
df.distinct.count
will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.
So, for example, the DataFrame:
+-----------+---------+
|member_name|member_id|
+-----------+---------+
| a| 1|
| b| 1|
| b| 1|
+-----------+---------+
has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.
1st command :
DF.agg(countDistinct("member_id") as "count")
return the same as that of select count distinct(member_id) from DF.
2nd command :
DF.distinct.count
is actually getting distinct records or removing al duplicates from the DF and then taking the count.

Summing multiple columns in Spark

How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df, I get an error.
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
Either SparkR or PySpark code will work.
For PySpark, if you don't want to explicitly type out the columns:
from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))
you can do something like the below in pyspark
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
| d| 44|
| c| 33|
| b| 22|
| a| 11|
+----+------------------+
org.apache.spark.sql.functions.sum(Column e)
Aggregate function: returns the sum of all values in the expression.
As you can see, sum takes just one column as input so sum(df$waiting, df$eruptions) wont work.Since you wan to sum up the numeric fields, you can dosum(df("waiting") + df("eruptions")).If you wan to sum up values for individual columns then, you can df.agg(sum(df$waiting),sum(df$eruptions)).show
sparkR code:
library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])
You can use expr():
import pyspark.sql.functions as f
numeric_cols = ['col_a','col_b','col_c']
df = df.withColumn('total', f.expr('+'.join(cols)))
PySpark expr() is a SQL function to execute SQL-like expressions.

Resources