Removing somepart of a dataframe column

Removing somepart of a dataframe column - apache-spark

I have a dataframe named DF like this
Dataframe DF
I have the below code
def func(row):
temp=row.asDict()
temp["concat_val"]="|".join([str(x) for x in row])
put=Row(**temp)
return put
DF.show()
row_rdd=DF.rdd.map(func)
concat_df=row_rdd.toDF().show()
I am getting a result like this
However I want an output which will remove id and nm colume values from concat_val column.
The table should look like below
Please suggest a way to remove id and nm value

So here you are trying to concat the column txt and uppertx and the values should be delimited by "|". You can try below code.
# Load required libraries
from pyspark.sql.functions import *
# Create DataFrame
df = spark.createDataFrame([(1,"a","foo","qwe"), (2,"b","bar","poi"), (3,"c","mnc","qwe")], ["id", "nm", "txt", "uppertxt"])
# Concat column txt and uppertxt delimited by "|"
# Approach - 1 : using concat function.
df1 = df.withColumn("concat_val", concat(df["txt"] , lit("|"), df["uppertxt"]))
# Approach - 2 : Using concat_ws function
df1 = df.withColumn("concat_val", concat_ws("|", df["txt"] , df["uppertxt"]))
# Display Output
df1.show()
Output
+---+---+---+--------+----------+
| id| nm|txt|uppertxt|concat_val|
+---+---+---+--------+----------+
| 1| a|foo| qwe| foo|qwe|
| 2| b|bar| poi| bar|poi|
| 3| c|mnc| qwe| mnc|qwe|
+---+---+---+--------+----------+
You can fnd more info on concat and concat_ws in spark docs.
I hope this helps.

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+

You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.

Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

Pyspark - Find sub-string from a column of data-frame with another data-frame

I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe

Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)

You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()

How to add multiple empty columns to a PySpark Dataframe at specific locations

I tried researching for this a lot but I am unable to find a way to execute and add multiple columns to a PySpark Dataframe at specific positions.
I have the dataframe that looks like this:
Customer_id First_Name Last_Name
I want to add 3 empty columns at 3 different positions and my final resulting dataframe needs to look like this:
Customer_id Address First_Name Email_address Last_Name Phone_no
Is there an easy way around it, like the way you can do with reindex on python?

# Creating a DataFrame.
from pyspark.sql.functions import col, lit
df = sqlContext.createDataFrame(
[('1','Moritz','Schulz'),('2','Sandra','Schröder')],
('Customer_id','First_Name','Last_Name')
)
df.show()
+-----------+----------+---------+
|Customer_id|First_Name|Last_Name|
+-----------+----------+---------+
| 1| Moritz| Schulz|
| 2| Sandra| Schröder|
+-----------+----------+---------+
You can use lit() function to add empty columns and once created you can use SQL's select to reorder the columns in the order you wish.
df = df.withColumn('Address',lit(''))\
.withColumn('Email_address',lit(''))\
.withColumn('Phone_no',lit(''))\
.select(
'Customer_id', 'Address', 'First_Name',
'Email_address', 'Last_Name', 'Phone_no'
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+
As suggested by user #Pault, a more concise & succinct way -
df = df.select(
"Customer_id", lit('').alias("Address"), "First_Name",
lit("").alias("Email_address"), "Last_Name", lit("").alias("Phone_no")
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+

If you want even more succinct, that I feel shorter :
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
df = df.withColumn(col, F.lit(None))
You can then select the same array for the order.
(edit) Note : withColumn in a for loop is usually quite slow. Don't do that for a high number of columns and prefer a select statement, like :
select_statement = []
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
select_statement.append(F.lit(None).alias(col))
df = df.select(*df.columns, *select_statement)

Pyspark Replicate Row based on column value

I would like to replicate all rows in my DataFrame based on the value of a given column on each row, and than index each new row. Suppose I have:
Column A Column B
T1 3
T2 2
I want the result to be:
Column A Column B Index
T1 3 1
T1 3 2
T1 3 3
T2 2 1
T2 2 2
I was able to to something similar with fixed values, but not by using the information found on the column. My current working code for fixed values is:
idx = [lit(i) for i in range(1, 10)]
df = df.withColumn('Index', explode(array( idx ) ))
I tried to change:
lit(i) for i in range(1, 10)
to
lit(i) for i in range(1, df['Column B'])
and add it into my array() function:
df = df.withColumn('Index', explode(array( lit(i) for i in range(1, df['Column B']) ) ))
but it does not work (TypeError: 'Column' object cannot be interpreted as an integer).
How should I implement this?

Unfortunately you can't iterate over a Column like that. You can always use a udf, but I do have a non-udf hack solution that should work for you if you're using Spark version 2.1 or higher.
The trick is to take advantage of pyspark.sql.functions.posexplode() to get the index value. We do this by creating a string by repeating a comma Column B times. Then we split this string on the comma, and use posexplode to get the index.
df.createOrReplaceTempView("df") # first register the DataFrame as a temp table
query = 'SELECT '\
'`Column A`,'\
'`Column B`,'\
'pos AS Index '\
'FROM ( '\
'SELECT DISTINCT '\
'`Column A`,'\
'`Column B`,'\
'posexplode(split(repeat(",", `Column B`), ",")) '\
'FROM df) AS a '\
'WHERE a.pos > 0'
newDF = sqlCtx.sql(query).sort("Column A", "Column B", "Index")
newDF.show()
#+--------+--------+-----+
#|Column A|Column B|Index|
#+--------+--------+-----+
#| T1| 3| 1|
#| T1| 3| 2|
#| T1| 3| 3|
#| T2| 2| 1|
#| T2| 2| 2|
#+--------+--------+-----+
Note: You need to wrap the column names in backticks since they have spaces in them as explained in this post: How to express a column which name contains spaces in Spark SQL

You can try this:
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack1.csv', header = 'True', inferSchema = 'True')
w = Window.orderBy("Column A")
df = df.select(row_number().over(w).alias("Index"), col("*"))
n_to_array = udf(lambda n : [n] * n ,ArrayType(IntegerType()))
df2 = df.withColumn('Column B', n_to_array('Column B'))
df3= df2.withColumn('Column B', explode('Column B'))
df3.show()

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+

Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))

You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed

This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)

You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing somepart of a dataframe column - apache-spark

Related

Pyspark replace string in every column name

Pyspark - Find sub-string from a column of data-frame with another data-frame

How to add multiple empty columns to a PySpark Dataframe at specific locations

Pyspark Replicate Row based on column value

Trim in a Pyspark Dataframe

Categories

Resources