PySpark, save unique letters in strings of a column - apache-spark

I'm using PySpark, and I want a simple way of doing the next process without it being overcomplicated.
Suppose I have a table that looks like this:
ID
Letters
1
a,b,c,d
2
b,d,b
3
c,y,u
I want to get the unique letters in this dataframe from the column "Letters", this would be:
List = [a,b,c,d,y,u].
I tried using the in operator, I don't really know how to iterate through each register, but I don't wanna make a mess because the original plan is for a big dataset.

You can try something like this:
import pyspark.sql.functions as F
data1 = [
[1, "a,b,c,d"],
[2, "b,d,b"],
[3, "c,y,u"],
]
df = spark.createDataFrame(data1).toDF("ID", "Letters")
dfWithDistinctValues = df.select(
F.array_distinct(
F.flatten(F.collect_set(F.array_distinct(F.split(df.Letters, ","))))
).alias("unique_letters")
)
defaultValues = [
data[0] for data in dfWithDistinctValues.select("unique_letters").collect()
]
print(defaultValues)
What is happening here:
First i am splitting string by "," with F.split and droping duplicates at row level with F.array_distinct
I am using collect_set to get all distinct arrays into one row which is array of arrays at this stage and it looks like this:
[[b, d], [a, b, c, d], [c, y, u]]
Then i am using flatten to get all values as separate strings:
[b, d, a, b, c, d, c, y, u]
There are still some duplicates which are removed by array_distinct so at the end the output looks like this:
[b, d, a, c, y, u]
If you need also counts you may use explode as Koedit mentioned, you may change part of his code to something like this:
# Unique letters with counts
uniqueLettersDf = (
df.select(explode(array_distinct("Letters")).alias("Letter"))
.groupBy("Letter")
.count()
.show()
)
Now you will get something like this:
+------+-----+
|Letter|count|
+------+-----+
| d| 2|
| c| 2|
| b| 2|
| a| 1|
| y| 1|
| u| 1|
+------+-----+

Depending on how large your dataset and your arrays are (if they are very large, this might not be the route you want to take), you can use the explode function to easily get what you want:
from pyspark.sql.functions import explode
df = spark.createDataFrame(
[
(1, ["a", "b", "c", "d"]),
(2, ["b", "d", "b"]),
(3, ["c", "y", "u"])
],
["ID", "Letters"]
)
# Creating a dataframe with 1 column, "letters", with distinct values per row
uniqueLettersDf = df.select(explode("Letters").alias("letters")).distinct()
# Using list comprehension and the .collect() method to turn our dataframe into a Python list
output = [row['letters'] for row in uniqueLettersDf.collect()]
output
['d', 'c', 'b', 'a', 'y', 'u']
EDIT: To make it a bit safer, we can use array_distinct before using explode: this will limit the amount of rows that will get made by removing the doubles before exploding.
The code would be identical, except for these lines:
from pyspark.sql.functions import explode, array_distinct
...
uniqueLettersDf = df.select(explode(array_distinct("Letters")).alias("letters")).distinct()
...

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+
You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

How to trim a list of selective fields in pyspark dataframe

i have a spark dataframe and i have a selective list of fields which are required to be trimmed. In production this list of fields will vary for each data set. I am trying to write a generic piece of code which will do it for me. Here is what i have done so far-
df = sqlContext.createDataFrame([('abcd ','123 ','x ')], ['s', 'd', 'n'])
df.show()
+--------+-------+---+
| s| d| n|
+--------+-------+---+
|abcd |123 |x |
+--------+-------+---+
All of my 3 attributes have trailing spaces. However i only want to trim the spoaces from column "s" and col "d".
>>> col_list=['s','d']
>>> df.select(*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+----+---+
| s| d|
+----+---+
|abcd|123|
+----+---+
This above operation does trim the spaces for me if i pass the list to this lambda.
How do i choose the remaining columns? i have tried these-
>>> df.select('*',*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+--------+-------+---+----+---+
| s| d| n| s| d|
+--------+-------+---+----+---+
|abcd |123 |x |abcd|123|
+--------+-------+---+----+---+
>>> df.select(*map(lambda x: trim(col(x)),col_list),'*').show()
File "<stdin>", line 1
SyntaxError: only named arguments may follow *expression
How do i select other attributes from this Dataframe without hardcoding?
You could do something like this:
#create a list of all columns which aren't in col_list and concat it with your map
df.select(*([item for item in df.columns if item not in col_list] + list(map(lambda x: F.trim(col(x)).alias(x),col_list))) ).show()
but for readability purposes I would recommend withColumn
for c in col_list:
df = df.withColumn(c, F.trim(F.col(c)))
df.show()

Drop rows containing specific value in PySpark dataframe

I have a pyspark dataframe like:
A B C
1 NA 9
4 2 5
6 4 2
5 1 NA
I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?
Update based on comment:
Looking for a solution that removes rows that have the string: NA in any of the many columns.
Just use a dataframe filter expression:
l = [('1','NA','9')
,('4','2', '5')
,('6','4','2')
,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 4| 2| 5|
| 6| 4| 2|
+---+---+---+
#bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:
#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()
In case if you want to remove the row
df = df.filter((df.A != 'NA') | (df.B != 'NA'))
But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"
from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))
In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.
Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.
from pyspark.sql import functions as f
data = [(1, None, 4, None),
(2, 'c', 3, 'd'),
(None, None, None, None),
(3, None, None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])
columns = df.columns
columns_Count = len(df.columns)
# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~")) == columns_Count).drop("colCompare")
df3.show()
returns:
+---+---+---+---+
| k| v1| v2| v3|
+---+---+---+---+
| 2| c| 3| d|
+---+---+---+---+

Get index of item in array that is a column in a Spark dataframe

I am able to filter a Spark dataframe (in PySpark) based on if a particular value exists within an array field by doing the following:
from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want")).show()
Is there a way to get the index of where in the array the item was found? It seems like that should exist, but I am not finding it. Thank you.
In spark 2.4+, there's the array_position function:
df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#| data|
#+---------+
#|[c, b, a]|
#| []|
#+---------+
from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#| data|a_pos|
#+---------+-----+
#|[c, b, a]| 3|
#| []| 0|
#+---------+-----+
Notes from the docs:
Locates the position of only the first occurrence of the given value in the given array;
The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.
I am using spark 2.3 version, so I tried this using udf.
df = spark.createDataFrame([(["c", "b", "a","e","f"],)], ['arraydata'])
+---------------+
| arraydata|
+---------------+
|[c, b, a, e, f]|
+---------------+
user_func = udf (lambda x,y: [i for i, e in enumerate(x) if e==y ])
checking index position for item 'b':
newdf = df.withColumn('item_position',user_func(df.arraydata,lit('b')))
>>> newdf.show();
+---------------+-------------+
| arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]| [1]|
+---------------+-------------+
checking index position for item 'e':
newdf = df.withColumn('item_position',user_func(df.arraydata,lit('e')))
>>> newdf.show();
+---------------+-------------+
| arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]| [3]|
+---------------+-------------+

Creating a column based upon a list and column in Pyspark

I have a pyspark DataFrame, say df1, with multiple columns.
I also have a list, say, l = ['a','b','c','d'] and these values are the subset of the values present in one of the columns in the DataFrame.
Now, I would like to do something like this:
df2 = df1.withColumn('new_column', expr("case when col_1 in l then 'yes' else 'no' end"))
But this is throwing the following error:
failure: "(" expected but identifier l found.
Any idea how to resolve this error or any better way of doing it?
You can achieve that with the isin function of the Column object:
df1 = sqlContext.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ('col1', 'col2'))
l = ['a', 'b']
from pyspark.sql.functions import *
df2 = df1.withColumn('new_column', when(col('col1').isin(l), 'yes').otherwise('no'))
df2.show()
+----+----+----------+
|col1|col2|new_column|
+----+----+----------+
| a| 1| yes|
| b| 2| yes|
| c| 3| no|
+----+----+----------+
Note: For Spark < 1.5, use inSet instead of isin.
Reference: pyspark.sql.Column documentation

Resources