Distinct Record Count in Spark dataframe

Distinct Record Count in Spark dataframe - apache-spark

I'm trying to display a distinct count of a couple different columns in a spark dataframe, and also the record count after grouping the first column.
So if I had col1, col2, and col3, I want to groupBy col1, and then display a distinct count of col2 and also a distinct count of col3.
Then, I would like to display the record count after that same groupBy of col1.
And finally, do this all in one agg statement..
Any ideas?

Below is the code you are looking for
df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
=======Tested Below============
scala> val lst = List(("a","x","d"),("b","D","s"),("ss","kk","ll"),("a","y","e"),("b","c","y"),("a","x","y"));
lst: List[(String, String, String)] = List((a,x,d), (b,D,s), (ss,kk,ll), (a,y,e), (b,c,y), (a,x,y))
scala> val rdd=sc.makeRDD(lst);
rdd: org.apache.spark.rdd.RDD[(String, String, String)] = ParallelCollectionRDD[7] at makeRDD at <console>:26
scala> val df = rdd.toDF("COL1","COL2","COL3");
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.printSchema
root
|-- COL1: string (nullable = true)
|-- COL2: string (nullable = true)
|-- COL3: string (nullable = true)
scala> df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
+----+--------------------+--------------------+--------+
|COL1|count(DISTINCT COL2)|count(DISTINCT COL3)|count(1)|
+----+--------------------+--------------------+--------+
| ss| 1| 1| 1|
| b| 2| 2| 2|
| a| 2| 3| 3|
+----+--------------------+--------------------+--------+
scala>

Related

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)

Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

spark dataframe select values from multiple columns based on condition

Data schema,
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|id|col1 |col2 |
|1 |["x","y","z"]|[123,"null","null"]|
From above data i want to filter where x exits in col1 and respective value for x from col2.
(values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)
Result:(Need col1 and col2 type array type)
|id |col1 |col2 |
|1 |["x"]|[123]|
If x not present in col1 then need result like
|id| col1 |col2 |
|1 |["null"] |["null"]|
i tried,
val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))

The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by #baitmbarek.
To start, use trim and split to convert col1 and col2 to arrays:
scala> val df = Seq(
| ("1", """["x","y","z"]""","""[123,"null","null"]"""),
| ("2", """["a","y","z"]""","""[123,"null","null"]""")
| ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]
scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
.withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_array.show(false)
+---+---------+-----------------+
|id |col1 |col2 |
+---+---------+-----------------+
|1 |[x, y, z]|[123, null, null]|
|2 |[a, y, z]|[123, null, null]|
+---+---------+-----------------+
scala> df_array.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:
scala> val df_map = df_array.select(
$"id",
map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
)
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]
scala> df_map.show(false)
+---+--------------------------------+
|id |col_map |
+---+--------------------------------+
|1 |[x -> 123, y -> null, z -> null]|
|2 |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
$"id",
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(lit("x")))
.as("col1"),
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(element_at($"col_map", "x")))
.as("col2")
)
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_final.show
+---+------+------+
| id| col1| col2|
+---+------+------+
| 1| [x]| [123]|
| 2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = false)
| |-- element: string (containsNull = false)
|-- col2: array (nullable = false)
| |-- element: string (containsNull = true)

spark drop multiple duplicated columns after join

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>

df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

I wasn't completely satisfied with the answers in this. For the most part, especially #stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.
common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())
df_a.join(df_b.drop(*common_cols))
The select version of this looks similar but you have to add in the aliasing.
unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList
df_a.alias("a")
.join(df_b.alias("b"))
.select(*keep_columns)
I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

How to max per column with nulls that should be replaced (with 0)?

How to get the MAX in the below dataframe?
val df_n = df.select($"ID").filter(($"READ") === "" && ($"ACT"!==""))
I have to find out the MAX of ID and in case if ID is NULL, I have to replace it with 0.

What about the following?
Test Dataset
scala> val df = Seq("0", null, "5", null, null, "-8").toDF("id")
df: org.apache.spark.sql.DataFrame = [id: string]
scala> df.printSchema
root
|-- id: string (nullable = true)
scala> df.withColumn("idAsLong", $"id" cast "long").printSchema
root
|-- id: string (nullable = true)
|-- idAsLong: long (nullable = true)
scala> val testDF = df.withColumn("idAsLong", $"id" cast "long")
testDF: org.apache.spark.sql.DataFrame = [id: string, idAsLong: bigint]
scala> testDF.show
+----+--------+
| id|idAsLong|
+----+--------+
| 0| 0|
|null| null|
| 5| 5|
|null| null|
|null| null|
| -8| -8|
+----+--------+
Solution
scala> testDF.agg(max("idAsLong")).show
+-------------+
|max(idAsLong)|
+-------------+
| 5|
+-------------+
Using na Operator
What if you had only negative values and null and so null is the maximum value? Use na operator on Dataset.
val withNulls = Seq("-1", "-5", null, null, "-333", null)
.toDF("id")
.withColumn("asInt", $"id" cast "int") // <-- column of type int with nulls
scala> withNulls.na.fill(Map("asInt" -> 0)).agg(max("asInt")).show
+----------+
|max(asInt)|
+----------+
| 0|
+----------+
Without na and replacing null it simply won't work.
scala> withNulls.agg(max("asInt")).show
+----------+
|max(asInt)|
+----------+
| -1|
+----------+
See na: DataFrameNaFunctions.

If you want to find out the maximum ID in this dataframe, you just have to add
.agg(max($"ID"))
However, I do not understand why you would want to replace the maximum ID without further grouping by 0. Anyway, if you feel more comfortable with SQL, you can always use the SQL interface:
df.createOrReplaceTempView("DF")
spark.sql("select max(id) from DF").show

Reading CSV into a Spark Dataframe with timestamp and date types

It's CDH with Spark 1.6.
I am trying to import this Hypothetical CSV into a apache Spark DataFrame:
$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
I use databricks-csv jar.
val textData = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
I use inferSchema to make the schema for the resulting DataFrame. printSchema() function gives me the following output for the code above:
scala> textData.printSchema()
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: string (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+----------+---+--------------------+---+
| a| b| c|2016-09-09| a|2016-11-11 09:09:...| a|
| a| b| c|2016-09-10| a|2016-11-11 09:09:...| a|
+---+---+---+----------+---+--------------------+---+
The C3 column has String type. I want C3 to have date type. To get it to date type I tried the following code.
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
scala> textData.printSchema
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: timestamp (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+--------------------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+--------------------+---+--------------------+---+
| a| b| c|2016-09-09 00:00:...| a|2016-11-11 00:00:...| a|
| a| b| c|2016-09-10 00:00:...| a|2016-11-11 00:00:...| a|
+---+---+---+--------------------+---+--------------------+---+
The only difference between this code and the first block is the dateFormat option line (I use "yyyy-MM-dd" instead of "yyyy-MM-dd HH:mm:ss").Now I get both C3 and C5 as timestamps(C3 is still not date). But for C5, the HH::mm:ss part is ignored and shows up as zeroes in the data.
Ideally I want C3 to be of type date, C5 to be of type timestamp and its HH:mm:ss part to be not ignored. My solution right now looks like this. I make the csv by pulling data in parallel from my DB. I make sure that I pull all dates as timestamps (Not ideal). So, the test csv looks like this now:
$ hadoop fs -cat new-test.csv
a,b,c,2016-09-09 00:00:00,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10 00:00:00,a,2016-11-11 09:09:10.0,a
This is my final working code:
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss")
.schema(finalSchema)
.option("nullValue", "null")
.load("new-test.csv")
Here, I use the complete timestamp format ("yyyy-MM-dd HH:mm:ss") in dateFormat. I manually create the finalSchema instance where c3 is date and C5 is Timestamp type(Spark sql types). I apply these schema use the schema() function. The output looks like follows:
scala> finalSchema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(C0,StringType,true), StructField(C1,StringType,true), StructField(C2,StringType,true), StructField(C3,DateType,true), StructField(C4,StringType,true), StructField(C5,TimestampType,true), StructField(C6,StringType,true))
scala> textData.printSchema()
root
|-- C0: string (nullable = true)
|-- C1: string (nullable = true)
|-- C2: string (nullable = true)
|-- C3: date (nullable = true)
|-- C4: string (nullable = true)
|-- C5: timestamp (nullable = true)
|-- C6: string (nullable = true)
scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2| C3| C4| C5| C6|
+---+---+---+----------+---+--------------------+---+
| a| b| c|2016-09-09| a|2016-11-11 09:09:...| a|
| a| b| c|2016-09-10| a|2016-11-11 09:09:...| a|
+---+---+---+----------+---+--------------------+---+
Is there an easier or out of the box way to parse out a csv file (that has both date and timestamp type into a spark dataframe?
Relevant Links:
http://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
https://github.com/databricks/spark-csv

With a infer option for non-trivial cases it will probably not return the expected result. As you can see in InferSchema.scala:
if (field == null || field.isEmpty || field == nullValue) {
typeSoFar
} else {
typeSoFar match {
case NullType => tryParseInteger(field)
case IntegerType => tryParseInteger(field)
case LongType => tryParseLong(field)
case DoubleType => tryParseDouble(field)
case TimestampType => tryParseTimestamp(field)
case BooleanType => tryParseBoolean(field)
case StringType => StringType
case other: DataType =>
throw new UnsupportedOperationException(s"Unexpected data type $other")
It will only try to match each column with a timestamp type, not a date type, so the "out of the box solution" for this case is not possible. But with my experience the "easier" solution, is directly define the schema with the needed type, it will avoid the infer option set a type that only matches for the RDD evaluated not the entire data. Your final schema is an efficient solution.

It's not really elegant but you can convert from timestamp to date like this (check last line):
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
.withColumn("C4", expr("""to_date(C4)"""))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Distinct Record Count in Spark dataframe - apache-spark

Related

pyspark split array type column to multiple columns

spark dataframe select values from multiple columns based on condition

spark drop multiple duplicated columns after join

How to max per column with nulls that should be replaced (with 0)?

Reading CSV into a Spark Dataframe with timestamp and date types

Categories

Resources