Pyspark substring of one column based on the length of another column - apache-spark

Using Pyspark 2.2
I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column
Input:
+-----+------+
|col_A| col_B|
+-----+------+
| abc|abcdef|
| abc| a|
+-----+------+
Both col_A and col_B are StringType()
Desired output:
+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
| abc|abcdef| abc|
| abc| a| a|
+-----+------+-------+
I want new_col to be a substring of col_A with the length of col_B.
I tried
udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()
But it gives the TypeError: Column is not iterable.
Any idea how to do such manipulation?

There are two major things wrong here.
First, you defined your udf to take in one input parameter when it should take 2.
Secondly, you can't use the API functions within the udf. (Calling the udf serializes to python so you need to use python syntax and functions.)
Here's a proper udf implementation for this problem:
import pyspark.sql.functions as F
def my_substring(a, b):
# You should add in your own error checking
return a[:len(b)]
udf_substring = F.udf(lambda x, y: my_substring(a, b), StringType())
And then call it by passing in the two columns as arguments:
df.withColumn('new_col', udf_substring(F.col('col_A'),F.col('col_B')))
However, in this case you can do this without a udf using the method described in this post.
df.withColumn(
'new_col',
F.expr("substring(col_A,0,length(col_B))")
)

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+
You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

How to find if a spark column contains a certain value?

I have the following spark dataframe -
+----+----+
|col1|col2|
+----+----+
| a| 1|
| b|null|
| c| 3|
+----+----+
Is there a way in spark API to detect if col2 contains, say, 3? Please note that the answer should be just one indicator value - yes/no - and not the set of records that have 3 in col2.
The PySpark recommended way of finding if a DataFrame contains a particular value is to use pyspak.sql.Column.contains API. You can use a boolean value on top of this to get a True/False boolean value.
For your example:
bool(df.filter(df.col2.contains(3)).collect())
#Output
>>>True
bool(df.filter(df.col2.contains(100)).collect())
#Output
>>>False
Source : https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.contains.html
By counting the number of values in col2 that are equal to 3:
import pyspark.sql.functions as f
df.agg(f.expr('sum(case when col2 = 3 then 1 else 0 end)')).first()[0] > 0
You can use when as a conditional statement
from pyspark.sql.functions import when
df.select(
(when(col("col2") == '3', 'yes')
.otherwise('no')
).alias('col3')
)

Add an index to a dataframe. Pyspark 2.4.4 [duplicate]

This question already has answers here:
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
(7 answers)
Closed 2 years ago.
There are a lot of examples that all give the same basic example.
dfWithIndex = df.withColumn('f_index', \
pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType()))
rdd = df.rdd.zipWithIndex().map(lambda row, rowId: (list(row) + [rowId + 1]))
dfIndexed = sqlContext.createDataFrame(rdd, schema=dfWithIndex.schema)
Really new to working with these lambdas, but printScema-ing that rdd with a plain zipEithIndex() gave me a two column dataframe.. _1 (struct) and a _2 long for the index itself. That's what the lambda appears to be referencing. However I'm getting this error:
TypeError: <lambda>() missing 1 required positional argument: 'rowId'
You're close. You just need to modify the lambda function slightly. It should take in 1 argument, which is like (Row, id), and return a single Row object.
from pyspark.sql import Row
from pyspark.sql.types import StructField, LongType
df = spark.createDataFrame([['a'],['b'],['c']],['val'])
df2 = df.rdd.zipWithIndex().map(
lambda r: Row(*r[0], r[1])
).toDF(df.schema.add(StructField('id', LongType(), False)))
df2.show()
+---+---+
|val| id|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+

Pyspark parallelized loop of dataframe column

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.
Example of Data:
Timestamp | Layers
1456982 | [[1, 2],[3,4]]
1486542 | [[3,5], [5,5]]
In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:
Timestamp | label | number1 | text | value
1456982 | 1 | 2 |3 |4
1486542 | 3 | 5 |5 |5
How can I make a loop on columns with pyspark function?
Thanks for advice
You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:
from functools import reduce
from pyspark.sql import functions as F
def add_1(df, col_name):
return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column
reduce(add_1, df.columns, df)
Edit:
I am not sure about solving it without converting rdd. Maybe this can be helpful:
from pyspark.sql import Row
flatF = lambda col: [item for item in l for l in col]
df \
.rdd \
.map(row: Row(timestamp=row['timestamp'],
**dict(zip(col_names, flatF(row['layers']))))) \
.toDF()

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources