Pyspark DataFrame: Split column with multiple values into rows

Pyspark DataFrame: Split column with multiple values into rows - apache-spark

I have a dataframe (with more rows and columns) as shown below.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
What I want:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows
(df
.rdd
.flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
.toDF(["col1", "col2", "col3"]))
However, it is not giving the required result
Edit: The explode option does not work because it is currently stored as string and the explode function expects an array

You can use explode but first you'll have to convert the string representation of the array into an array.
One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".
from pyspark.sql.functions import col, explode, regexp_replace, split
df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+

Pault's solution should work perfectly fine although here is another solution which uses regexp_extract instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:
from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split
df.withColumn("col2",
explode(
split(
regexp_extract(
regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()
Explanation:
Initially regexp_replace(col("col2"), "\s", "") will replace all spaces with empty string.
Next regexp_extract will extract the content of the column which start with [ and ends with ].
Then we execute split for the comma separated values and finally explode.

Related

Replace all strings with escape characters across all columns with NULLs in Pyspark

I have a spark dataframe with approximately 100 columns. NULL instances are currently recorded as \N. I want to replace all instances of \N with NULL however, because the backslash is an escape character, I'm having difficulty. I've found this article that uses regex for a single column, however, I need to iterate over all columns
I've even tried the solution in the article on a single column, however, I still cannot get it to work. Ordinarily, I'd use R and am to solve this issue in R using the following code:
df <- sapply(df,function(x) {x <- gsub("\\\\N",NA,x)})
However, given I'm new to Pyspark I'm having quite a lot of difficulty.

Will this work for you?
import pyspark.sql.functions as psf
data = [ ('0','\\N','\\N','3')]
df = spark.createDataFrame(data, ['col1','col2','col3','col4'])
print('before:')
df.show()
for col in df.columns:
df = df.withColumn(col, psf.when(psf.col(col)==psf.lit('\\N'), psf.lit(None)).otherwise(psf.col(col)))
print('after:')
df.show()
before:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 0| \N| \N| 3|
+----+----+----+----+
after:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 0|null|null| 3|
+----+----+----+----+

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+

You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.

Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

How to filter using concatenation of two columns in pyspark

I have read in a parquet file and I would like to filter
rows using prepared dict. There are two columns in the dataframe called col1 and col2 which are of type string. My dictionary has a set of strings in it and I want rows where the concatenation of the strings in columns col1 and col2 are in the dictionary. I tried
df.filter((df['col1']+df['col2']) in my_dict)
but it seems that df['col1']+df['col2'] is not a string even though that is the type of columns.
I also tried
df.filter(lambda x: (x['col1']+df['col2']) in my_dict)
What's the right way to do this?

So, there are 2 components in your issue:
The string column concatenation
The filtering using a dictionary
Regarding the first part - here is an example of string column concatenation using a toy dataframe:
spark.version
# u'2.1.1'
from pyspark.sql.functions import concat, col, lit
df = spark.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.show()
# +---+---+
# | k| v|
# +---+---+
# |foo| 1|
# |bar| 2|
# +---+---+
df2 = df.select(concat(col("k"), lit(" "), col("v")).alias('joined_colname'))
df2.show()
# +--------------+
# |joined_colname|
# +--------------+
# | foo 1|
# | bar 2|
# +--------------+
Regarding the second part, you need the .isin method - not sure it will work with dictionaries, but it definitely works with lists (['foo 1', 'foo 2']) or sets ({'foo 1', 'foo 2'}):
df2.filter(col('joined_colname').isin({'foo 1', 'foo 2'})).show() # works with lists, too
# +--------------+
# |joined_colname|
# +--------------+
# | foo 1|
# +--------------+
Hope this is helpful enough...
EDIT (after comment): to keep the joined column together with the columns of your initial df:
df3 = df.withColumn('joined_colname', concat(col("k"), lit(" "), col("v")))
df3.show()
# +---+---+--------------+
# | k| v|joined_colname|
# +---+---+--------------+
# |foo| 1| foo 1|
# |bar| 2| bar 2|
# +---+---+--------------+

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.

input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+

Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))

You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed

This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)

You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?

Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))

I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.

Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame

I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+

The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}

This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark DataFrame: Split column with multiple values into rows - apache-spark

Related

Replace all strings with escape characters across all columns with NULLs in Pyspark

Pyspark replace string in every column name

How to filter using concatenation of two columns in pyspark

Trim in a Pyspark Dataframe

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Categories

Resources