How to delete the first few rows in dataframe Scala/sSark? - apache-spark

I hava a DataFrame and I want to delete first and the second row. What should I do?
This is my input:
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
-------
This is the excepted result:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
-------

In my opinion it does not make sense to speak about a first or second record if you cannot define an ordering of your dataframe. The ordering of the records as a result of the show statement is "arbitrary" and depends on partitioning of your data.
Suppose you have a column over which you can order your records, you can use Window-functions. Starting with this dataframe:
+----+-----+
|year|value|
+----+-----+
|2007| 1|
|2008| 4|
|2009| 3|
|2010| 5|
|2011| 4|
|2012| 18|
+----+-----+
You can do
import org.apache.spark.sql.expressions.Window
df
.withColumn("rn",row_number().over(Window.orderBy($"year")))
.where($"rn">2).drop($"rn")
.show

The simple and easy way is to assign a id for each row and filter it
val df = Seq(1,2,3,5,4,18).toDF("value")
df.withColumn("id", monotonically_increasing_id()).filter($"id" > 1).drop("id")
Edit: Since the monotonically_increasing_id() does not grantee consecutive You can use zipWithUniqueId as below
val rows = df.rdd.zipWithUniqueId().map {
case (row, id) => Row.fromSeq(row.toSeq :+ id)
}
val df1 = spark.createDataFrame(rows, StructType(df.schema.fields :+ StructField("id", LongType, false)))
df1.filter($"id" > 1).drop("id")
Output:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
+-----+
This will also help you to drop the nth row in dataframe.
Hope this helps!

Related

How to set the value of a Pyspark column based on two conditions of the value of another column

Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

How to use spark window function as cascading changes of previous row to next row

I tried to use window function to calculate current value based on previous value in dynamic way
rowID | value
------------------
1 | 5
2 | 7
3 | 6
Logic:
If value > pre_value then value
So in row 2, since 7 > 5 then value becomes 5.
The final result should be
rowID | value
------------------
1 | 5
2 | 5
3 | 5
However using lag().over(w) gave the result as
rowID | value
------------------
1 | 5
2 | 5
3 | 6
it compares third row value 6 against the "7" not the new value "5"
Any suggestion how to achieve this?
df.show()
#exampledataframe
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 7|
| 3| 6|
| 4| 9|
| 5| 4|
| 6| 3|
+-----+-----+
Your required logic is too dynamic for window functions, therefore, we have to go row by row updating our values. One solution could be to use normal python udf on collected list and then explode once udf has been applied. If have relatively small data, this should be fine.(spark2.4 only because of arrays_zip).
from pyspark.sql import functions as F
from pyspark.sql.types import *
def add_one(a):
for i in range(1,len(a)):
if a[i]>a[i-1]:
a[i]=a[i-1]
return a
udf1= F.udf(add_one, ArrayType(IntegerType()))
df.agg(F.collect_list("rowID").alias("rowID"),F.collect_list("value").alias("value"))\
.withColumn("value", udf1("value"))\
.withColumn("zipped", F.explode(F.arrays_zip("rowID","value"))).select("zipped.*").show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+
UPDATE:
Better yet, as you have groups of 5000, using a Pandas vectorized udf( grouped MAP) should help a lot with processing. And you do not have to collect_list with 5000 integers and explode or use pivot. I think this should be the optimal solution. Pandas UDAF available for spark2.3+
GroupBy below is empty, but you can add your grouping column in that.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
for i in range(1, len(df1)):
if df1.loc[i, 'value']>df1.loc[i-1,'value']:
df1.loc[i,'value']=df1.loc[i-1,'value']
return df1
df.groupby().apply(grouped_map).show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

How to add new column not based on exist column in dataframe with Scala/Spark? [duplicate]

This question already has answers here:
pyspark add new column field with the data frame row number
(1 answer)
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
(7 answers)
Primary keys with Apache Spark
(4 answers)
DataFrame-ified zipWithIndex
(9 answers)
Closed 5 years ago.
I have a DataFrame and I want to add a new column but not based on exit column,what should I do?
This is my dataframe:
+----+
|time|
+----+
| 1|
| 4|
| 3|
| 2|
| 5|
| 7|
| 3|
| 5|
+----+
This is my expect result:
+----+-----+
|time|index|
+----+-----+
| 1| 1|
| 4| 2|
| 3| 3|
| 2| 4|
| 5| 5|
| 7| 6|
| 3| 7|
| 5| 8|
+----+-----+
use rdd zipWithIndex may be what you want.
val newRdd = yourDF.rdd.zipWithIndex.map{case (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)}
val schema = StructType(Array(StructField("time", IntegerType, nullable = true), StructField("index", LongType, nullable = true)))
val newDF = spark.createDataFrame(newRdd, schema)
newDF.show
+----+-----+
|time|index|
+----+-----+
| 1| 0|
| 4| 1|
| 3| 2|
| 2| 3|
| 5| 4|
| 7| 5|
| 3| 6|
| 8| 7|
+----+-----+
I assume Your time column is IntegerType here.
Rather using Window function and converting to rdd and using zipWithIndex are slower, you can use a built in function monotonically_increasing_id as
import org.apache.spark.sql.functions._
df.withColumn("index", monotonically_increasing_id())
Hope this hepls!

Changing Nulls Ordering in Spark SQL

I need to be able to sort columns in ascending and descending order and also allow nulls to be first or nulls to be last. Using RDDs I could use the sortByKey method with a custom comparator. I was wondering if there is a corresponding approach using the Dataset API. I see how to to add desc/asc to columns but I have no clue on the nulls ordering.
You can also do it with the dataset API:
scala> val df = Seq("a", "b", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]
scala> df.select('*).orderBy('x.asc_nulls_last).show
+----+
| x|
+----+
| a|
| b|
|null|
+----+
scala> df.select('*).orderBy('x.asc_nulls_first).show
+----+
| x|
+----+
|null|
| a|
| b|
+----+
Same thing works with desc_nulls_last and desc_nulls_first.
As mentioned by Oleksandr, there was a pull request for this. Now you can optionally use "nulls first" or "nulls last"
scala> spark.sql("select * from spark_10747 order by col3 nulls last").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 6| 7| 4|
| 6| 11| 4|
| 6| 15| 8|
| 6| 15| 8|
| 6| 7| 8|
| 6| 12| 10|
| 6| 9| 10|
| 6| 13|null|
| 6| 10|null|
+----+----+----+

Resources