Spark - Python: Select rows and dates - apache-spark

I have the follow df in Spark (Python).
I am just trying to select the day that the "datos_acumulados" column is more than 20480. In this case, the output should be a table like the following: (a table format including the nulls):
Results:
grupo_edad| fecha|acumuladosMB|datos_acumulados|
| 1|2020-08-04| 4864| 20921|
| 4| null| null| null|
Dataframe: df_datos_acumulados
grupo_edad| fecha|acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
| 1|2020-08-01| 6185| 6185|
| 1|2020-08-02| 5854| 12039|
| 1|2020-08-03| 4018| 16057|
| 1|2020-08-04| 4864| 20921|
| 1|2020-08-05| 5526| 26447|
| 1|2020-08-06| 4818| 31265|
| 1|2020-08-07| 5359| 36624|
| 4|2020-08-01| 674| 674|
| 4|2020-08-02| 744| 1418|
| 4|2020-08-03| 490| 1908|
| 4|2020-08-04| 355| 2263|
| 4|2020-08-05| 1061| 3324|
| 4|2020-08-06| 752| 4076|
| 4|2020-08-07| 560| 4636|
Thanks!
Thank you to the answer of #pasha701 I could get the final table but It doesn't show the nulls rows that I need too:
grupoDistinctDF = df_datos_acumulados.withColumn("grupo_edad", col("grupo_edad"))
grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")
df_datos_acumulados = df_datos_acumulados.where(col("datos_acumulados") >= 20480) \
.withColumn("row_number", row_number().over(grupoWindow)) \
.where(col("row_number") == 1) \
.drop("row_number")
grupoDistinctDF = grupoDistinctDF.join(df_datos_acumulados,["grupo_edad"], "left")
Output:
grupo_edad| fecha|acumuladosMB|datos_acumulados|
| 1|2020-08-04| 4864| 20921|

If first row where "datos_acumulados" > 20480 is required, Window function "row_number()" can be used for get such first row, and joined with distinct "grupo_edad" (Scala):
val df = Seq(
(1, "2020-08-01", 6185, 6185),
(1, "2020-08-02", 5854, 12039),
(1, "2020-08-03", 4018, 16057),
(1, "2020-08-04", 4864, 20921),
(1, "2020-08-05", 5526, 26447),
(1, "2020-08-06", 4818, 31265),
(1, "2020-08-07", 5359, 36624),
(4, "2020-08-01", 674, 674),
(4, "2020-08-02", 744, 1418),
(4, "2020-08-03", 490, 1908),
(4, "2020-08-04", 355, 2263),
(4, "2020-08-05", 1061, 3324),
(4, "2020-08-06", 752, 4076),
(4, "2020-08-07", 560, 4636)
).toDF("grupo_edad", "fecha", "acumuladosMB", "datos_acumulados")
val grupoDistinctDF = df.select("grupo_edad").distinct()
val grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")
val firstMatchingRowDF = df
.where($"datos_acumulados" > 20480)
.withColumn("row_number", row_number().over(grupoWindow))
.where($"row_number" === 1)
.drop("row_number")
grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left")
Output:
+----------+----------+------------+----------------+
|grupo_edad|fecha |acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
|4 |null |null |null |
|1 |2020-08-04|4864 |20921 |
+----------+----------+------------+----------------+

Related

spark date_format results showing null

I have data source like below:
order_id,order_date,order_customer_id,order_status
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
I am trying to convert to mm/dd/yyyy only for CLOSED orders using the below queries but getting output as null. can you please assist to get required date format using DSL or spark sql method:
closed_df=ord_df.select(date_format(to_date('order_date','yyyy-mm-dd hh:mm:SS.a'),'mm/dd/yyyy') .\
alias("formate_date")).show()
#output:
|formate_date|
+------------+
| null|
| null|
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date("order_date","yyyy-mm-dd hh:mm:ss.a"),'mm/dd/yyyy') as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
#output:
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1| null| 11599| CLOSED|
| 4| null| 8827| CLOSED
The format to date from the string 2013-07-25 00:00:00.0 is yyyy-MM-dd HH:mm:SS.s. Likewise for date formatting the format is MM/dd/yyyy. Here the Spark formatting doc for more information.
data = [(1, "2013-07-25 00:00:00.0", 11599, "CLOSED",),
(2, "2013-07-25 00:00:00.0", 256, "PENDING_PAYMENT",),
(3, "2013-07-25 00:00:00.0", 12111, "COMPLETE",),
(4, "2013-07-25 00:00:00.0", 8827, "CLOSED",), ]
ord_df = spark.createDataFrame(data, ("order_id", "order_date", "order_customer_id", "order_status",))
from pyspark.sql.functions import to_date, date_format
closed_df = (ord_df.where("order_status = 'CLOSED'")
.select(date_format(to_date('order_date','yyyy-MM-dd HH:mm:SS.s'),'MM/dd/yyyy')
.alias("formate_date"))).show()
"""
+------------+
|formate_date|
+------------+
| 07/25/2013|
| 07/25/2013|
+------------+
"""
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date(order_date,"yyyy-MM-dd HH:mm:SS.s"), "MM/dd/yyyy") as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
"""
+--------+----------+-----------------+------------+
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1|07/25/2013| 11599| CLOSED|
| 4|07/25/2013| 8827| CLOSED|
+--------+----------+-----------------+------------+
"""

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

How to remove words that have less than three letters in PySpark?

I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
# in col('text')], ''))
# df_clean.show()
I expect to see:
id | text
1 | [good]
2 | [You, are]
This does it, you can decide to exclude row or not, I added an extra column and filtered out, but options are yours:
from pyspark.sql import functions as f
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
(3, ['ok'])
]
df = spark.createDataFrame(vals, columns)
#df.show()
df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()
# This is the actual piece of logic you are looking for.
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()
returns:
+---+--------------+--------------+
| id| text|text_left_over|
+---+--------------+--------------+
| 1| [I, am, good]| [good]|
| 2|[You, are, ok]| [You, are]|
| 3| [ok]| []|
+---+--------------+--------------+
+---+--------------+
| id|text_left_over|
+---+--------------+
| 1| [good]|
| 2| [You, are]|
+---+--------------+
This is the solution
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))

pyspark. zip arrays in a dataframe

I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?
Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
The number of rows in your DataFrame (df.count())
The length of data (given)
Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():
import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
f.collect_list("id").alias("id"),
f.array(
[
f.array(
[f.collect_list("data").getItem(j).getItem(i)
for j in range(numRows)]
)
for i in range(dataLength)
]
).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id |data |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+
You can simply use a udf function for the zip function but before that you will have to use collect_list function
from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
return zip(*array)
zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))
df.select(
f.collect_list(df.id).alias('id'),
zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)
which would give you
+---------+------------------------------------------------------------------------------+
|id |data |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+

Column name when using concat

I am concatenating some columns in Spark SQL using the concat function. Here is some dummy code:
import org.apache.spark.sql.functions.{concat, lit}
val data1 = sc.parallelize(Array(2, 0, 3, 4, 5))
val data2 = sc.parallelize(Array(4, 0, 0, 6, 7))
val data3 = sc.parallelize(Array(1, 2, 3, 4, 10))
val dataJoin = data1.zip(data2).zip(data3).map((x) => (x._1._1, x._1._2, x._2 )).toDF("a1","a2","a3")
val dataConcat = dataJoin.select($"a1",concat(lit("["),$"a1", lit(","),$"a2", lit(","),$"a3", lit("]")))
Is there a way to specify or to change the label of the columns in order to avoid the default name which is not very practical?
+---+------------------------+
| a1|concat([,a1,,,a2,,,a3,])|
+---+------------------------+
| 2| [2,4,1]|
| 0| [0,0,2]|
| 3| [3,0,3]|
| 4| [4,6,4]|
| 5| [5,7,10]|
+---+------------------------+
Use as or alias methods to give a name to your column.

Resources