Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?
Is the difference the same as between select count(distinct member_id) and select distinct count(member_id)?
Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?
Because .distinct.count is the same:
SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)
while ..agg(countDistinct("member_id") as "count") is
SELECT COUNT(DISTINCT member_id) FROM table
and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.
df.agg(countDistinct("member_id") as "count")
returns the number of distinct values of the member_id column, ignoring all other columns, while
df.distinct.count
will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.
So, for example, the DataFrame:
+-----------+---------+
|member_name|member_id|
+-----------+---------+
| a| 1|
| b| 1|
| b| 1|
+-----------+---------+
has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.
1st command :
DF.agg(countDistinct("member_id") as "count")
return the same as that of select count distinct(member_id) from DF.
2nd command :
DF.distinct.count
is actually getting distinct records or removing al duplicates from the DF and then taking the count.
Related
I have a use case which I am trying to implement using spark solution in AWS glue.
I have one table which has query stored as column value which I need to run from script .
For exmaple > Select src_query from table;
This give me another query mentioned below :
select tabl2.col1,tabl3.col2 from table2 join table 3 ;
Now I want to collect information of this second query in dataframe and proceed further.
source_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", "table1").option("user", Oracle_Username).option("password", Oracle_Password).load()
Now when we run this data from table1 gets stored in source_df . One of column of table1 is storing some sql query . select col1,col2 from tabl2;
Now I want to run query mentioned above and store its result in dataframe .Something like
final_df2 = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("query", "select col1,col2 from tabl2").option("user", Oracle_Username).option("password", Oracle_Password).load()
How can I get query from data frame and run it as query to fetch another result in another dataframe .
You can use the below code when the source table has less number of rows as we will be using collect to get all the queries from the source table.
import org.apache.spark.sql.functions._
// created sample secondary tables found in the query
Seq(("A","01/01/2022",1), ("AXYZ","02/01/2022",1), ("AZYX","03/01/2022",1),("AXYZ","04/01/2022",0), ("AZYX","05/01/2022",0),("AB","06/01/2022",1), ("A","07/01/2022",0) )
.toDF("Category", "date", "Indictor")
.write.mode("overwrite").saveAsTable("table1")
Seq(("A","01/01/2022",1), ("b","02/01/2022",0), ("c","03/01/2022",1) )
.toDF("Category", "date", "Indictor")
.write.mode("overwrite").saveAsTable("table2")
//create the source dataframe
val df=Seq( (1,"select Category from table1"), (2,"select date from table2") )
.toDF("Sno", "Query")
//extract the query from the source table.
val qrys = df.select("Query").collect()
qrys.foreach(println)
//execute the queries in column and save it as tables
qrys.map(elm=>spark.sql(elm.mkString).write.mode("overwrite").saveAsTable("newtbl"+qrys.indexOf(elm)))
//select from the new tables.
spark.sql("select * from newtbl0").show
spark.sql("select * from newtbl1").show
Output:
[select Category from table1]
[select date from table2]
+--------+
|Category|
+--------+
| A|
| AXYZ|
| AZYX|
| AXYZ|
| AZYX|
| AB|
| A|
+--------+
+----------+
| date|
+----------+
|01/01/2022|
|02/01/2022|
|03/01/2022|
+----------+
I have a pyspark dataframe like this:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
|333| null| CT|
+---+-----+-----+
For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
+---+-----+-----+
How can I achieve this in PySpark?
You can do this by grouping by the id column and count the number of names in each group. Null values will be ignored by default in Spark so any group that has 0 in count should be kept. We can now filter away any nulls in groups with a count larger than 0.
In Scala this can be done with a window function as follows:
val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
.filter($"name".isNotNull or $"gCount" === 0)
.drop("gCount")
The PySpark equivalent:
w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
.filter((col("name").isNotNull()) | (col("gCount") == 0))
.drop("gCount")
The above will not remove rows that have multiple nulls for the same id (all these will be kept).
If these should be removed as well, keeping only a single row with name==null, an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code. Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).
I think you can do that in two steps. First, count values by id
import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))
Then filter to remove Null when n<1:
df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))
Edit
As mentioned by #Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them. In that case, the solution proposed by #Shaido is useful: add a post treatment using .dropDuplicates(['id','name']). Or .dropDuplicates(['id','name','state']), following your preference
I have a dataframe having records like below:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |null|
|101|null|303 |
+---+----+----+
Is their a simple way to groupBy and get result like below in Spark SQL:
+---+----+----+
|id |L1 |L2 |
+---+----+----+
|101|202 |303 |
+---+----+----+
Thanks.
Use max or min to aggregate the data. Since you only have a single valid value, this is the one that will be selected. Note that it's not possible to use first here (which is faster) since that can still return null values.
When the columns are of numeric types it can be solved as follows:
df.groupBy("id").agg(max($"L1").as("L1"), max($"L2").as("L2"))
However, if you are dealing with strings, you need to collect all values as a list (or set) and then use coalesce:
df.groupBy("id")
.agg(coalesce(collect_list($"L1")).as("L1"), coalesce(collect_list($"L2")).as("L2"))
Of course, this assumes that the nulls are not strings but actual nulls.
I have a sample DF with duplicate rows like this:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
val w = Window.partitionBy($"ENCOUNTER_ID")
Using the above Spark Window partition, is it possible to select distinct rows? I am expecting the output DF as:
+-------------------+--------------------+----+-----------+-------+----------+
|ID |CL_ID |NBR |DT |TYP |KEY |
+--------------------+--------------------+----+-----------+-------+----------+
|1000031075_20190422 |10017157594301072477|10 |2019-04-24 |N |0000000000|
|1006473016_20190421 |10577157412800147475|11 |2019-04-21 |N |0000000000|
+--------------------+--------------------+----+-----------+-------+----------+
I don't want to use DF.DISTINCT or DF.DROPDUPLICATES as it would involve shuffling.
I prefer not to use lag or lead because, in real-time, the order of rows can't be guaranteed.
Window function also shuffle data. So if your all columns are duplicate then df.dropDuplicates will be better option to use. If your use case want to use Window function then you can use below approach.
scala> df.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
//You can use column in partitionBy that need to check for duplicate and also use respective orderBy also as of now I have use sample Window
scala> val W = Window.partitionBy(col("ID"),col("CL_ID"),col("NBR"),col("DT"), col("TYP"), col("KEY")).orderBy(lit(1))
scala> df.withColumn("duplicate", when(row_number.over(W) === lit(1), lit("Y")).otherwise(lit("N")))
.filter(col("duplicate") === lit("Y"))
.drop("duplicate")
.show()
+-------------------+--------------------+---+----------+---+----------+
| ID| CL_ID|NBR| DT|TYP| KEY|
+-------------------+--------------------+---+----------+---+----------+
|1000031075_20190422|10017157594301072477| 10|2019-04-24| N|0000000000|
|1006473016_20190421|10577157412800147475| 11|2019-04-21| N|0000000000|
+-------------------+--------------------+---+----------+---+----------+
An answer to your question that scales up well with big data :
df.dropDuplicates(include your key cols here = ID in this case).
Window function shuffles data, but if you have duplicate entries and want to choose which one to keep for example, or want to sum the value of the duplicates then window function is the way to go
w = Window.PartitionBy('id')
df.agg(first( value col ).over(w)) #you can use max, min, sum, first, last depending on how you want to treat duplicates
An interesting third possibility if you want to keep the values of duplicates (for record)
is the below before
df.withColumn('dup_values', collect(value_col).over(w))
this will create an extra column with an array per row to keep duplicate values after you've got rid of the rows
I’m using Spark2.0
I have a dataframe having several columns like id, latitude, longitude, time,
I want to do a groupby and keep [“latitude”,” longitude”] always together,
Could I do the following?
df.groupBy('id',[“latitude”,” longitude”] ,'time')
I want to calculate records number for each user , at each different time, with each different location [“latitude”,” longitude”].
You can combine "latitude" and "longitude" columns and then can use groupBy. Below sample is using Scala.
val df = Seq(("1","33.33","35.35","8:00"),("2","31.33","39.35","9:00"),("1","33.33","35.35","8:00")).toDF("id","latitude","longitude","time")
df.show()
val df1 = df.withColumn("lat-long",array($"latitude",$"longitude"))
df1.show()
val df2 = df1.groupBy("id","lat-long","time").count()
df2.show()
Output will be like below.
+---+--------------+----+-----+
| id| lat-long|time|count|
+---+--------------+----+-----+
| 2|[31.33, 39.35]|9:00| 1|
| 1|[33.33, 35.35]|8:00| 2|
+---+--------------+----+-----+
You can just use:
df.groupBy('id', 'latitude', 'longitude','time').agg(...)
This will work as expected without any additional steps.