Round all columns in dataframe - two decimal place pyspark - apache-spark

I have this command for all columns in my dataframe to round to 2 decimal places:
data = data.withColumn("columnName1", func.round(data["columnName1"], 2))
I have no idea how to round all Dataframe by the one command (not every column separate). Could somebody help me, please? I don't want to have the same command 50times with different column name.

There is not a function or command for applying all functions to the columns but you can iterate.
+-----+-----+
| col1| col2|
+-----+-----+
|1.111|2.222|
+-----+-----+
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
for c in df.columns:
df = df.withColumn(c, round(c, 2))
df.show()
+----+----+
|col1|col2|
+----+----+
|1.11|2.22|
+----+----+

To avoid converting non-FP columns:
import pyspark.sql.functions as F
for c_name, c_type in df.dtypes:
if c_type in ('double', 'float'):
df = df.withColumn(c_name, F.round(c_name, 2))

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+
You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

Pyspark - Find sub-string from a column of data-frame with another data-frame

I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe
Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)
You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()

Merge two column in spark dataframe to form single column

I have a Spark dataframe with two columns; src_edge and dest_edge. I simply want to create new spark dataframe so that it contains a single column id with values from src_edge and dest_edge.
src dst
1 2
1 3
I want to create df2 as:
id
1
1
2
3
If possible, I would also like to create df2 with no duplicates values. Does anyone have any idea how to do this?
id
1
2
3
Update
The simplest thing may be to select each column, union them, and call distinct:
from pyspark.sql.functions import col
df2 = df.select(col("src").alias("id")).union(df.select(col("dst").alias("id"))).distinct()
df2.show()
#+---+
#| id|
#+---+
#| 1|
#| 3|
#| 2|
#+---+
You can also accomplish this with an outer join:
df2 = df.select(col("src").alias("id"))\
.join(
df.select(col("dst").alias("id")),
on="id",
how="outer"
)\
.distinct()
Create a new column using array and explode to combine and flatten the two columns. Then, to remove duplicates use dropDuplicates:
from pyspark.sql.functions import array, explode
df2 = df.select(explode(array("src", "dst")).alias("id"))
.dropDuplicates()

Pyspark Replicate Row based on column value

I would like to replicate all rows in my DataFrame based on the value of a given column on each row, and than index each new row. Suppose I have:
Column A Column B
T1 3
T2 2
I want the result to be:
Column A Column B Index
T1 3 1
T1 3 2
T1 3 3
T2 2 1
T2 2 2
I was able to to something similar with fixed values, but not by using the information found on the column. My current working code for fixed values is:
idx = [lit(i) for i in range(1, 10)]
df = df.withColumn('Index', explode(array( idx ) ))
I tried to change:
lit(i) for i in range(1, 10)
to
lit(i) for i in range(1, df['Column B'])
and add it into my array() function:
df = df.withColumn('Index', explode(array( lit(i) for i in range(1, df['Column B']) ) ))
but it does not work (TypeError: 'Column' object cannot be interpreted as an integer).
How should I implement this?
Unfortunately you can't iterate over a Column like that. You can always use a udf, but I do have a non-udf hack solution that should work for you if you're using Spark version 2.1 or higher.
The trick is to take advantage of pyspark.sql.functions.posexplode() to get the index value. We do this by creating a string by repeating a comma Column B times. Then we split this string on the comma, and use posexplode to get the index.
df.createOrReplaceTempView("df") # first register the DataFrame as a temp table
query = 'SELECT '\
'`Column A`,'\
'`Column B`,'\
'pos AS Index '\
'FROM ( '\
'SELECT DISTINCT '\
'`Column A`,'\
'`Column B`,'\
'posexplode(split(repeat(",", `Column B`), ",")) '\
'FROM df) AS a '\
'WHERE a.pos > 0'
newDF = sqlCtx.sql(query).sort("Column A", "Column B", "Index")
newDF.show()
#+--------+--------+-----+
#|Column A|Column B|Index|
#+--------+--------+-----+
#| T1| 3| 1|
#| T1| 3| 2|
#| T1| 3| 3|
#| T2| 2| 1|
#| T2| 2| 2|
#+--------+--------+-----+
Note: You need to wrap the column names in backticks since they have spaces in them as explained in this post: How to express a column which name contains spaces in Spark SQL
You can try this:
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack1.csv', header = 'True', inferSchema = 'True')
w = Window.orderBy("Column A")
df = df.select(row_number().over(w).alias("Index"), col("*"))
n_to_array = udf(lambda n : [n] * n ,ArrayType(IntegerType()))
df2 = df.withColumn('Column B', n_to_array('Column B'))
df3= df2.withColumn('Column B', explode('Column B'))
df3.show()

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources