In Spark dataframe how to Transpose rows to columns? - apache-spark

this may be a very simple question. I want to transpose all the rows of dataframe to columns. I want to convert this df as shown below output DF. What are the ways in spark to achieve this?
Note : I have single column in input DF
import sparkSession.sqlContext.implicits._
val df = Seq(("row1"), ("row2"), ("row3"), ("row4"), ("row5")).toDF("COLUMN_NAME")
df.show(false)
Input DF:
+-----------+
|COLUMN_NAME|
+-----------+
|row1 |
|row2 |
|row3 |
|row4 |
|row5 |
+-----------+
Output DF
+----+----+----+----+----+
|row1|row2|row3|row4|row5|
+----+----+----+----+----+

Does this help you ?
df.withColumn("group",monotonicallyIncreasingId ).groupBy("group").pivot("COLUMN_NAME").agg(first("COLUMN_NAME")).show

Related

Round all columns in dataframe - two decimal place pyspark

I have this command for all columns in my dataframe to round to 2 decimal places:
data = data.withColumn("columnName1", func.round(data["columnName1"], 2))
I have no idea how to round all Dataframe by the one command (not every column separate). Could somebody help me, please? I don't want to have the same command 50times with different column name.
There is not a function or command for applying all functions to the columns but you can iterate.
+-----+-----+
| col1| col2|
+-----+-----+
|1.111|2.222|
+-----+-----+
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
for c in df.columns:
df = df.withColumn(c, round(c, 2))
df.show()
+----+----+
|col1|col2|
+----+----+
|1.11|2.22|
+----+----+
To avoid converting non-FP columns:
import pyspark.sql.functions as F
for c_name, c_type in df.dtypes:
if c_type in ('double', 'float'):
df = df.withColumn(c_name, F.round(c_name, 2))

Dataframe in Pyspark

I was just dropping a column from dataframe. it was dropped. after calling show method, it seems like column is not dropped in dataframe.
Code:
df.drop('Salary').show()
+-----+
| Name|
+-----+
| Arun|
| Joe|
|Jerry|
+-----+
df.show()
+-----+------+
| Name|Salary|
+-----+------+
| Arun| 5000|
| Joe| 6300|
|Jerry| 9600|
+-----+------+
I am using spark 2.4.4 version. could you please tell why its not dropped? And I thought that its like a dropping column form table in oracle database.
The drop method returns a new DataFrame. The original df is not changed by this transformation, so calling df.show() a second time will return the original data with your Salary column.
You need to save the dataframe after dropping.
df2 = df.drop('Salary')
df2.show()

How to add multiple empty columns to a PySpark Dataframe at specific locations

I tried researching for this a lot but I am unable to find a way to execute and add multiple columns to a PySpark Dataframe at specific positions.
I have the dataframe that looks like this:
Customer_id First_Name Last_Name
I want to add 3 empty columns at 3 different positions and my final resulting dataframe needs to look like this:
Customer_id Address First_Name Email_address Last_Name Phone_no
Is there an easy way around it, like the way you can do with reindex on python?
# Creating a DataFrame.
from pyspark.sql.functions import col, lit
df = sqlContext.createDataFrame(
[('1','Moritz','Schulz'),('2','Sandra','Schröder')],
('Customer_id','First_Name','Last_Name')
)
df.show()
+-----------+----------+---------+
|Customer_id|First_Name|Last_Name|
+-----------+----------+---------+
| 1| Moritz| Schulz|
| 2| Sandra| Schröder|
+-----------+----------+---------+
You can use lit() function to add empty columns and once created you can use SQL's select to reorder the columns in the order you wish.
df = df.withColumn('Address',lit(''))\
.withColumn('Email_address',lit(''))\
.withColumn('Phone_no',lit(''))\
.select(
'Customer_id', 'Address', 'First_Name',
'Email_address', 'Last_Name', 'Phone_no'
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+
As suggested by user #Pault, a more concise & succinct way -
df = df.select(
"Customer_id", lit('').alias("Address"), "First_Name",
lit("").alias("Email_address"), "Last_Name", lit("").alias("Phone_no")
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+
If you want even more succinct, that I feel shorter :
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
df = df.withColumn(col, F.lit(None))
You can then select the same array for the order.
(edit) Note : withColumn in a for loop is usually quite slow. Don't do that for a high number of columns and prefer a select statement, like :
select_statement = []
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
select_statement.append(F.lit(None).alias(col))
df = df.select(*df.columns, *select_statement)

Easy way to center a column in a Spark DataFrame

I want to center a column in a Spark DataFrame, i.e., subtract each element in the column by the mean of the column. Currently, I do it manually, i.e., first calculate the mean of a column, get the value out of the reduced DataFrame, and then subtract the column by the average. I wonder whether there is an easy way to do this in Spark? Any built-in function to do it?
There is no inbuilt function for this but you can use user defined function [ udf ] as below
import org.apache.spark.sql.DataFrame
val df = spark.sparkContext.parallelize(List(
(2.06,0.56),
(1.96,0.72),
(1.70,0.87),
(1.90,0.64))).toDF("c1","c2")
def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
def getCenterDF(df: DataFrame, col: String): DataFrame = {
val avg = df.select(mean(col)).first().getAs[Double](0);
df.withColumn(col, subMean(avg)(df(col)))
}
scala> df.show(false)
+----+----+
|c1 |c2 |
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.7 |0.87|
|1.9 |0.64|
+----+----+
scala> getCenterDF(df, "c2").show(false)
+----+--------------------+
|c1 |c2 |
+----+--------------------+
|2.06|-0.13750000000000007|
|1.96|0.022499999999999853|
|1.7 |0.17249999999999988 |
|1.9 |-0.05750000000000011|
+----+--------------------+

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources