Remove empty string from list (Spark Dataframe) [duplicate] - apache-spark

This question already has answers here:
How to remove nulls with array_remove Spark SQL Built-in Function
(5 answers)
Closed 3 years ago.
Current Dataframe
+-----------------+--------------------+
|__index_level_0__| Text_obj_col|
+-----------------+--------------------+
| 1| [ ,entrepreneurs]|
| 2|[eat, , human, poop]|
| 3| [Manafort, case]|
| 4| [Sunar, Khatris, ]|
| 5|[become, arrogant, ]|
| 6| [GPS, get, name, ]|
| 7|[exactly, reality, ]|
+-----------------+--------------------+
I want that empty string from the list removed.
This is test data actual data is pretty big, how can I do this in pyspark?

You could use udf for this task:
from pyspark.sql.functions import udf
def filter_empty(l):
return filter(lambda x: x is not None and len(x) > 0, l)
filter_empty_udf = udf(filter_empty, ArrayType(StringType()))
df.select(filter_empty_udf("Text_obj_col").alias("Text_obj_col")).show(10, False)
Tested on a few rows from your sample:
+------------------+
|Text_obj_col |
+------------------+
|[entrepreneurs] |
|[eat, human, poop]|
+------------------+

Related

Spark: use value of a groupBy column as a name for an aggregate column [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I want to give aggregate column name which contains a value of one of the groupBy columns:
dataset
.groupBy("user", "action")
.agg(collect_list("timestamp").name($"action" + "timestamps")
this part: .name($"action") does not work because name expects a String, not a Column.
Base on: How to pivot Spark DataFrame?
val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")
val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()
+----+------+---+---+
|user|action| a| b|
+----+------+---+---+
| U1| a|[1]| []|
| U2| b| []|[2]|
+----+------+---+---+
Fun part with column renaming. We should rename all but first 2 columns
val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show
+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
| U1| [1]| []|
| U2| []| [2]|
+----+-----------+-----------+

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

How to find avg length each column in pyspark? [duplicate]

This question already has an answer here:
Apply a transformation to multiple columns pyspark dataframe
(1 answer)
Closed 4 years ago.
I have created data frame like below:
from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 2.2|Jalfaizy| aa|Jalfaizy|
| 230| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+----+--------+-----+--------+
I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
|2.5 | 5.5 | 2.75 | 6 |
+----+--------+-----+--------+
This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :
from pyspark.sql.functions import length, col, avg
selection = ['lname','mname','name']
schemaPeople \
.select(*(length(col(c)).alias(c) for c in selection)) \
.agg(*(avg(col(c)).alias(c) for c in selection)).show()
# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# | 5.5| 2.75| 6.0|
# +-----+-----+----+
This way, you'll be able to pass the names of the columns dynamically.
What we are doing here is actually unpacking the argument list (selection)
Reference : Control Flow Tools - Unpacking Argument Lists.
I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:
df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))
df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()
Result:
+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
| 2.5| 5.5| 2.75| 6.0|
+------------+--------------+--------------+-------------+
In Scala can be done in this way, guess, can be converted to Python by author:
val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)

Update Pyspark rows for a column based on other column [duplicate]

This question already has answers here:
Spark Equivalent of IF Then ELSE
(4 answers)
PySpark: modify column values when another column value satisfies a condition
(2 answers)
Closed 4 years ago.
I have a data frame in pyspark like below.
df.show()
+---+----+
| id|name|
+---+----+
| 1| sam|
| 2| Tim|
| 3| Jim|
| 4| sam|
+---+----+
Now I added a new column to the df like below
from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
new_df = df.withColumn('new_column', lit(None).cast(StringType()))
Now when I query the new_df
new_df.show()
+---+----+----------+
| id|name|new_column|
+---+----+----------+
| 1| sam| null|
| 2| Tim| null|
| 3| Jim| null|
| 4| sam| null|
+---+----+----------+
Now I want to update the value in new_column based on a condition.
I am trying to write the below condition but unable to do so.
if name is sam then new_column should be tested else not_tested
if name == sam:
then update new_column to tested
else:
new_column == not_tested
How can I achieve this in pyspark.
Edit
I am not looking for a if else statement but how to update the values of a record in pyspark column
#user9367133 Thank you for reaching out, if you follow my answer on similiar question you pointed , its pretty much same logic -
from pyspark.sql.functions import *
new_df\
.drop(new_df.new_column)\
.withColumn('new_column',when(new_df.name == "sam","tested").otherwise('not_tested'))\
.show()
You dont necessarily have to add new_column before hand as null if you are just going to replace with proper values immediately. But I wasnt sure about use case, so I kept it in my example.
hope this helps, cheers!

how do solve this query in hive and spark? [duplicate]

This question already has answers here:
Explode in PySpark
(2 answers)
Closed 5 years ago.
Write a hivesql and display like below ouput
id name dob
-------------------------
1 anjan 10-16-1989
output:
id name dob
-------------------------
1 a 10-16-1989
1 n 10-16-1989
1 j 10-16-1989
1 a 10-16-1989
1 n 10-16-1989
and above scenario solve in spark and display same as above output
Assuming you have a dataframe (name it data) that comes from Hive like this:
+---+-----+----------+
| id| name| dob|
+---+-----+----------+
| 1|anjan|10-16-1989|
+---+-----+----------+
you can define a user defined function in spark that transform a string into an array :
val toArray = udf((name: String) => name.toArray.map(_.toString))
Having that we can apply it on the name column:
val df = data.withColumn("name", toArray(res0("name")))
+---+---------------+----------+
| id| name| dob|
+---+---------------+----------+
| 1|[a, n, j, a, n]|10-16-1989|
+---+---------------+----------+
We can use now the explode function on the name column
df.withColumn("name", explode(df("name")))
+---+----+----------+
| id|name| dob|
+---+----+----------+
| 1| a|10-16-1989|
| 1| n|10-16-1989|
| 1| j|10-16-1989|
| 1| a|10-16-1989|
| 1| n|10-16-1989|
+---+----+----------+

Resources