This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Closed 4 years ago.
I am using Spark SQL 2.2.0 and DataFrame/DataSet API.
I need to explode several columns one per row.
I have:
+------+------+------+------+------+
|col1 |col2 |col3 |col4 |col5 |
+------+------+------+------+------+
|val11 |val21 |val31 |val41 |val51 |
|val12 |val22 |val32 |val42 |val52 |
+------+------+------+------+------+
And I need to have the following:
+------+------+---------+---------+
|col1 |col2 |col_num |col_new |
+------+------+---------+---------+
|val11 |val21 |col3 |val31 |
|val11 |val21 |col4 |val41 |
|val11 |val21 |col5 |val51 |
|val12 |val21 |col3 |val32 |
|val12 |val21 |col4 |val42 |
|val12 |val21 |col5 |val52 |
+------+------+---------+---------+
I managed to do join and explode like this:
val df2 = df.select(col("col1"), col("col2"), array(col("col3"), col("col4"), col("col5")) as "array")
val df3 = df2.withColumn("array", explode(col("array")))
This works but it does not add col_num column (which I need). I tried to do it with flatMap using custom map function but it fails.
Could you please help me how to do this?
Related
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I want to give aggregate column name which contains a value of one of the groupBy columns:
dataset
.groupBy("user", "action")
.agg(collect_list("timestamp").name($"action" + "timestamps")
this part: .name($"action") does not work because name expects a String, not a Column.
Base on: How to pivot Spark DataFrame?
val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")
val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()
+----+------+---+---+
|user|action| a| b|
+----+------+---+---+
| U1| a|[1]| []|
| U2| b| []|[2]|
+----+------+---+---+
Fun part with column renaming. We should rename all but first 2 columns
val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show
+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
| U1| [1]| []|
| U2| []| [2]|
+----+-----------+-----------+
This question already has answers here:
How to remove nulls with array_remove Spark SQL Built-in Function
(5 answers)
Closed 3 years ago.
Current Dataframe
+-----------------+--------------------+
|__index_level_0__| Text_obj_col|
+-----------------+--------------------+
| 1| [ ,entrepreneurs]|
| 2|[eat, , human, poop]|
| 3| [Manafort, case]|
| 4| [Sunar, Khatris, ]|
| 5|[become, arrogant, ]|
| 6| [GPS, get, name, ]|
| 7|[exactly, reality, ]|
+-----------------+--------------------+
I want that empty string from the list removed.
This is test data actual data is pretty big, how can I do this in pyspark?
You could use udf for this task:
from pyspark.sql.functions import udf
def filter_empty(l):
return filter(lambda x: x is not None and len(x) > 0, l)
filter_empty_udf = udf(filter_empty, ArrayType(StringType()))
df.select(filter_empty_udf("Text_obj_col").alias("Text_obj_col")).show(10, False)
Tested on a few rows from your sample:
+------------------+
|Text_obj_col |
+------------------+
|[entrepreneurs] |
|[eat, human, poop]|
+------------------+
This question already has an answer here:
Apply a transformation to multiple columns pyspark dataframe
(1 answer)
Closed 4 years ago.
I have created data frame like below:
from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 2.2|Jalfaizy| aa|Jalfaizy|
| 230| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+----+--------+-----+--------+
I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
|2.5 | 5.5 | 2.75 | 6 |
+----+--------+-----+--------+
This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :
from pyspark.sql.functions import length, col, avg
selection = ['lname','mname','name']
schemaPeople \
.select(*(length(col(c)).alias(c) for c in selection)) \
.agg(*(avg(col(c)).alias(c) for c in selection)).show()
# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# | 5.5| 2.75| 6.0|
# +-----+-----+----+
This way, you'll be able to pass the names of the columns dynamically.
What we are doing here is actually unpacking the argument list (selection)
Reference : Control Flow Tools - Unpacking Argument Lists.
I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:
df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))
df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()
Result:
+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
| 2.5| 5.5| 2.75| 6.0|
+------------+--------------+--------------+-------------+
In Scala can be done in this way, guess, can be converted to Python by author:
val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)
This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Primary keys with Apache Spark
(4 answers)
Closed 4 years ago.
I want to add a column to a spark dataframe which has been registered as a table. This column needs to have an auto incrementing long.
df = spark.sql(query)
df.createOrReplaceTempView("user_stories")
df = spark.sql("ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT")
df.show(5)
This throws the following error,
Py4JJavaError: An error occurred while calling o72.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'ALTER TABLE user_stories ADD COLUMN'(line 1, pos 29)
== SQL ==
ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT
-----------------------------^^^
What am I missing here?
if you want to add new incremental column to DF, you could do in following ways.
df.show()
+-------+
| name|
+-------+
|gaurnag|
+-------+
from pyspark.sql.functions import monotonically_increasing_id
new_df = df.withColumn("id", monotonically_increasing_id())
new_df.show()
+-------+---+
| name| id|
+-------+---+
|gaurnag| 0|
+-------+---+
This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 5 years ago.
To group by a Spark data-frame with pyspark I use command like that:
df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})
As a result I get output like that:
+-----------------+-------------+------------------+--------+
| _c1| _c3| avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
| Local-gov| HS-grad| 644952.5714285715| 9|
| Local-gov| Assoc-acdm|365081.64285714284| 12|
| Never-worked| Some-college| 462294.0| 10|
| Local-gov| Bachelors| 398296.35| 13|
| Federal-gov| HS-grad| 493293.0| 9|
| Private| 12th| 632520.5454545454| 8|
| State-gov| Assoc-voc| 412814.0| 11|
| ?| HS-grad| 545870.9230769231| 9|
| Private| Prof-school|340322.89130434784| 15|
+-----------------+-------------+------------------+--------+
Which is nice but there are two things that I miss:
I would like to have a control over the names of the columns. For example I want a new column to be named avg_c2 instead avg(_c2).
I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column _c4. I tried that following and it does not work:
df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})
Is there a way to achieve what I need?
you have to use withColumn api and generate new columns or replace the old ones
Or you can use alias to have the required column name instead of default avg(_c2)
I haven't used pyspark yet but in scala I do something like
import org.apache.spark.sql.functions._
df2 = df.groupBy("_c1","_c3").agg(max(col("_c4")).alias("max_c4"), min(col("_c4")).alias("min_c4"), avg(col("_c2")).alias("avg_c2"))