pyspark max string length for each column in the dataframe - apache-spark

I am trying this in databricks .
Please let me know the pyspark libraries needed to be imported and code to get the below output in Azure databricks
pyspark
example:-
input dataframe :-
| column1 | column2 | column3 | column4 |
| a | bbbbb | cc | >dddddddd |
| >aaaaaaaaaaaaaa | bb | c | dddd |
| aa | >bbbbbbbbbbbb | >ccccccc | ddddd |
| aaaaa | bbbb | ccc | d |
output dataframe :-
| column | maxLength |
| column1 | 14 |
| column2 | 12 |
| column3 | 7 |
| column4 | 8 |

>>> from pyspark.sql import functions as sf
>>> df = sc.parallelize([['a','bbbbb','ccc','ddd'],['aaaa','bbb','ccccccc', 'dddd']]).toDF(["column1", "column2", "column3", "column4"])
>>> df1 = df.select([sf.length(col).alias(col) for col in df.columns])
>>> df1.groupby().max().show()
+------------+------------+------------+------------+
|max(column1)|max(column2)|max(column3)|max(column4)|
+------------+------------+------------+------------+
| 4| 5| 7| 4|
+------------+------------+------------+------------+
then use this link to melt previous dataframe
Edit: (From Iterate through each column and find the max length)
Single line select
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))).alias(name) for name in df.schema.names])
Output
As Rows
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:

Related

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

Get max length in column for each column in a dataframe [duplicate]

This question already has answers here:
In spark iterate through each column and find the max length
(3 answers)
Closed 3 years ago.
I have a Spark dataframe like that
+-----------------+---------------+----------+-----------+
| column1 | column2 | column3 | column4 |
+-----------------+---------------+----------+-----------+
| a | bbbbb | cc | >dddddddd |
| >aaaaaaaaaaaaaa | bb | c | dddd |
| aa | >bbbbbbbbbbbb | >ccccccc | ddddd |
| aaaaa | bbbb | ccc | d |
+-----------------+---------------+----------+-----------+
I would like to find a length of the longest element in each column to obtain something like that
+---------+-----------+
| column | maxLength |
+---------+-----------+
| column1 | 14 |
| column2 | 12 |
| column3 | 7 |
| column4 | 8 |
+---------+-----------+
I know how to do it column by column but don't know how to tell Spark - Do it for all columns .
I am using Scala Spark.
You can use agg function max and length function to achieve it as
val x = df.columns.map(colName => {
(colName, df.agg(max(length(col(colName)))).head().getAs[Integer](0))
}).toSeq.toDF("column", "maxLength")
Output:
+-------+---------+
|column |maxLength|
+-------+---------+
|column1|14 |
|column2|13 |
|column3|8 |
|column4|9 |
+-------+---------+
Other way is
df.select(df.columns.map(c => max(length(col(c))).as(s"max_${c}")): _*)
Output:
+-----------+-----------+-----------+-----------+
|max_column1|max_column2|max_column3|max_column4|
+-----------+-----------+-----------+-----------+
|14 |13 |8 |9 |
+-----------+-----------+-----------+-----------+

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

Getting a column as concatenated column from a reference table and primary id's from a Dataset

I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.
Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+

Spark SQL Column Manipulation

I have a Dataset which has below Below Cols .
df.show();
+--------+---------+---------+---------+---------+
| Col1 | Col2 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | Cvalue1 | 123 | 2254 | 22 |
| Value1 | Cvalue2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
I want that to be changed to this below format using some joins or cube or any Operations.
1.
+--------+---------+------+
| Value1 | Cvalue1 | 123 |
| Value1 | Cvalue1 | 2254 |
| Value1 | Cvalue1 | 22 |
| Value1 | Cvalue1 | 124 |
| Value1 | Cvalue1 | 2255 |
| Value1 | Cvalue1 | 23 |
+--------+---------+------+
or Better if this format
2.
+--------+---------+---------+------+
| Value1 | Cvalue1 | Expend1 | 123 |
| Value1 | Cvalue1 | Expend2 | 2254 |
| Value1 | Cvalue1 | Expend3 | 22 |
| Value1 | Cvalue1 | Expend1 | 124 |
| Value1 | Cvalue1 | Expend2 | 2255 |
| Value1 | Cvalue1 | Expend3 | 23 |
+--------+---------+---------+------+
Can I be able to achieve this above two Possible format. If In case of #1 , can I get the Column name of Last value , whether it is Expend1 or Expend 2 or Expend3.
Functions map and then explode can be used:
val data = List(
("Value1", "Cvalue1", 123, 2254, 22),
("Value1", "Cvalue2", 124, 2255, 23)
)
val df = data.toDF("Col1", "Col2", "Expend1", "Expend2", "Expend3")
// action
val unpivotedColumns = List("Expend1", "Expend2", "Expend3")
val columnMapping = unpivotedColumns.foldLeft(new ArrayBuffer[Column]())((acc, current) => {
acc += lit(current)
acc += col(current)
})
val mapped = df.select($"Col1", $"Col2", map(columnMapping: _*).alias("result"))
val result = mapped.select($"Col1", $"Col2", explode($"result"))
result.show(false)
Result is:
+------+-------+-------+-----+
|Col1 |Col2 |key |value|
+------+-------+-------+-----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254 |
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255 |
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+-----+
You can do this with the Hive function stack:
df.selectExpr("col1",
"col2",
"stack(3 , 'Expend1' , Expend1,
'Expend2' , Expend2,
'Expend3' , Expend3)
as (Expend, Value) "
).show(false)
+------+-------+-------+-----+
|col1 |col2 |Expend |Value|
+------+-------+-------+-----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254 |
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255 |
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+-----+
You can convert to array for three columns and explode it
import org.apache.spark.sql.functions._
df.withColumn("Expand", explode(array("Expand1", "Expand2", "Expand3")))
.drop("Expand1", "Expand2", "Expand3")
To keep the column value you can do as below
data.withColumn("Expand1", concat_ws(":", lit("Expand1"), $"Expand1"))
.withColumn("Expand2", concat_ws(":", lit("Expand2") , $"Expand2"))
.withColumn("Expand3", concat_ws(":", lit("Expand3") , $"Expand3"))
.withColumn("Expand", explode(array("Expand1", "Expand2", "Expand3")))
.drop("Expand1", "Expand2", "Expand3")
.withColumn("ExpandColumn", split($"Expand", ":")(0))
.withColumn("Expand", split($"Expand", ":")(1))
.drop("Expand1", "Expand2", "Expand3")
.show(false)
I hope this was helpful
Using udf function you can achieve you 2nd required dataframe.
val columns = df.select("Expend1","Expend2","Expend3").columns
import org.apache.spark.sql.functions._
def arrayStructUdf = udf((columnNames: collection.mutable.WrappedArray[String], columnValues: collection.mutable.WrappedArray[String]) => columnNames.zip(columnValues).map(x => (x._1, x._2)).toArray)
Then just call the udf function, drop the three extra columns and then explode the new formed column and finally select the desired columns
df.withColumn("new", arrayStructUdf(array(columns.map(x => lit(x)): _*), array(columns.map(col): _*)))
.drop("Expend1","Expend2","Expend3")
.withColumn("new", explode(col("new")))
.select("Col1","Col2", "new.*")
You should have the 2nd required dataframe
+------+-------+-------+----+
|Col1 |Col2 |_1 |_2 |
+------+-------+-------+----+
|Value1|Cvalue1|Expend1|123 |
|Value1|Cvalue1|Expend2|2254|
|Value1|Cvalue1|Expend3|22 |
|Value1|Cvalue2|Expend1|124 |
|Value1|Cvalue2|Expend2|2255|
|Value1|Cvalue2|Expend3|23 |
+------+-------+-------+----+

Resources