Trim in a Pyspark Dataframe

Trim in a Pyspark Dataframe - apache-spark

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+

Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))

You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed

This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)

You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Related

Replace all strings with escape characters across all columns with NULLs in Pyspark

I have a spark dataframe with approximately 100 columns. NULL instances are currently recorded as \N. I want to replace all instances of \N with NULL however, because the backslash is an escape character, I'm having difficulty. I've found this article that uses regex for a single column, however, I need to iterate over all columns
I've even tried the solution in the article on a single column, however, I still cannot get it to work. Ordinarily, I'd use R and am to solve this issue in R using the following code:
df <- sapply(df,function(x) {x <- gsub("\\\\N",NA,x)})
However, given I'm new to Pyspark I'm having quite a lot of difficulty.

Will this work for you?
import pyspark.sql.functions as psf
data = [ ('0','\\N','\\N','3')]
df = spark.createDataFrame(data, ['col1','col2','col3','col4'])
print('before:')
df.show()
for col in df.columns:
df = df.withColumn(col, psf.when(psf.col(col)==psf.lit('\\N'), psf.lit(None)).otherwise(psf.col(col)))
print('after:')
df.show()
before:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 0| \N| \N| 3|
+----+----+----+----+
after:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 0|null|null| 3|
+----+----+----+----+

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+

You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.

Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

Removing somepart of a dataframe column

I have a dataframe named DF like this
Dataframe DF
I have the below code
def func(row):
temp=row.asDict()
temp["concat_val"]="|".join([str(x) for x in row])
put=Row(**temp)
return put
DF.show()
row_rdd=DF.rdd.map(func)
concat_df=row_rdd.toDF().show()
I am getting a result like this
However I want an output which will remove id and nm colume values from concat_val column.
The table should look like below
Please suggest a way to remove id and nm value

So here you are trying to concat the column txt and uppertx and the values should be delimited by "|". You can try below code.
# Load required libraries
from pyspark.sql.functions import *
# Create DataFrame
df = spark.createDataFrame([(1,"a","foo","qwe"), (2,"b","bar","poi"), (3,"c","mnc","qwe")], ["id", "nm", "txt", "uppertxt"])
# Concat column txt and uppertxt delimited by "|"
# Approach - 1 : using concat function.
df1 = df.withColumn("concat_val", concat(df["txt"] , lit("|"), df["uppertxt"]))
# Approach - 2 : Using concat_ws function
df1 = df.withColumn("concat_val", concat_ws("|", df["txt"] , df["uppertxt"]))
# Display Output
df1.show()
Output
+---+---+---+--------+----------+
| id| nm|txt|uppertxt|concat_val|
+---+---+---+--------+----------+
| 1| a|foo| qwe| foo|qwe|
| 2| b|bar| poi| bar|poi|
| 3| c|mnc| qwe| mnc|qwe|
+---+---+---+--------+----------+
You can fnd more info on concat and concat_ws in spark docs.
I hope this helps.

Pyspark substring of one column based on the length of another column

Using Pyspark 2.2
I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column
Input:
+-----+------+
|col_A| col_B|
+-----+------+
| abc|abcdef|
| abc| a|
+-----+------+
Both col_A and col_B are StringType()
Desired output:
+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
| abc|abcdef| abc|
| abc| a| a|
+-----+------+-------+
I want new_col to be a substring of col_A with the length of col_B.
I tried
udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()
But it gives the TypeError: Column is not iterable.
Any idea how to do such manipulation?

There are two major things wrong here.
First, you defined your udf to take in one input parameter when it should take 2.
Secondly, you can't use the API functions within the udf. (Calling the udf serializes to python so you need to use python syntax and functions.)
Here's a proper udf implementation for this problem:
import pyspark.sql.functions as F
def my_substring(a, b):
# You should add in your own error checking
return a[:len(b)]
udf_substring = F.udf(lambda x, y: my_substring(a, b), StringType())
And then call it by passing in the two columns as arguments:
df.withColumn('new_col', udf_substring(F.col('col_A'),F.col('col_B')))
However, in this case you can do this without a udf using the method described in this post.
df.withColumn(
'new_col',
F.expr("substring(col_A,0,length(col_B))")
)

How to find max value Alphabet from DataFrame apache spark?

i am trying to get the max value Alphabet from a panda dataframe as whole. I am not interested in what row or column it came from. I am just interested in a single max value within the dataframe.
This is what it looks like:
id conditionName
1 C
2 b
3 A
4 A
5 A
expected result is:
|id|conditionName|
+--+-------------+
| 3| A |
| 4| A |
| 5| A |
+----------------+
because 'A' is the first letter of the alphabet
df= df.withColumn("conditionName", col("conditionName").cast("String"))
.groupBy("id,conditionName").max("conditionName");
df.show(false);
Exception: "conditionName" is not a numeric column. Aggregation function can only be applied on a numeric column.;
I need the max from an entire dataframe Alphabet character.
What should I use, so that the desired results?
Thank advance !

You can sort your DataFrame by your string column, grab the first value and use it to filter your original data:
from pyspark.sql.functions import lower, desc, first
# we need lower() because ordering strings is case sensitive
first_letter = df.orderBy((lower(df["condition"]))) \
.groupBy() \
.agg(first("condition").alias("condition")) \
.collect()[0][0]
df.filter(df["condition"] == first_letter).show()
#+---+---------+
#| id|condition|
#+---+---------+
#| 3| A|
#| 4| A|
#| 5| A|
#+---+---------+
Or more elegantly using Spark SQL:
df.registerTempTable("table")
sqlContext.sql("SELECT *
FROM table
WHERE lower(condition) = (SELECT min(lower(condition))
FROM table)
")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trim in a Pyspark Dataframe - apache-spark

Using trim() function in #osbon123's answer. from pyspark.sql.functions import trim for c_name in df.columns: df = df.withColumn(c_name, trim(col(c_name)))

You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values. Regards, Neeraj

Related

Replace all strings with escape characters across all columns with NULLs in Pyspark

Pyspark replace string in every column name

Removing somepart of a dataframe column

Pyspark substring of one column based on the length of another column

How to find max value Alphabet from DataFrame apache spark?

Categories

Resources