Split Text in Dataframe and Check if Contains Substring - apache-spark

So I want to check if my text contains the word 'baby' and not any other word that contains 'baby'. For example, "maybaby" would not be a match. I already have piece of code that works, but I wanted to see if there was a better way to format so that I don't have to go through the data twice. Here is what I have thus far:
import pyspark.sql.functions as F
rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])
rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')
rows_df = rows_df.withColumn('fruit', split)
+----------+-------------+
| ID| fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
| 54-baby| [54, baby]|
+----------+-------------+
from pyspark.sql.types import StringType
def func(col):
for item in col:
if item == "baby":
return "yes"
return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))
+----------+-------------+---+
| ID| fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
| 54-baby| [54, baby]|yes|
+----------+-------------+---+
Ultimately, I just want the "ID" and "new" column only.

I'll show two ways to resolve this. Probably there's a lot other ways to reach the same result.
See the examples below:
from pyspark.shell import sc
from pyspark.sql.functions import split, when
rows = sc.parallelize(
[
['14-banana'], ['12-cheese'], ['13-olives'],
['11-almonds'], ['23-maybaby'], ['54-baby']
]
)
# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])
rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()
# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
'new',
when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()
# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()
All outputs are the same:
+----------+---+
| ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
| 54-baby|yes|
+----------+---+

I'd use pyspark.sql.functions.regexp_extract for this. Make the column new equal to "yes" if you're able to extract the word "baby" with a word boundary on both sides, and "no" otherwise.
from pyspark.sql.functions import regexp_extract, when
rows_df.withColumn(
'new',
when(
regexp_extract("ID", "(?<=(\b|\-))baby(?=(\b|$))", 0) == "baby",
"yes"
).otherwise("no")
).show()
#+----------+-------------+---+
#| ID| fruit|new|
#+----------+-------------+---+
#| 14-banana| [14, banana]| no|
#| 12-cheese| [12, cheese]| no|
#| 13-olives| [13, olives]| no|
#|11-almonds|[11, almonds]| no|
#|23-maybaby|[23, maybaby]| no|
#| 54-baby| [54, baby]|yes|
#+----------+-------------+---+
The last argument to regexp_extract is the index of the match to extract. We pick the first index (index 0). If the pattern doesn't match, an empty string is returned. Finally use when() to check if the extracted string equals the desired value.
The regex pattern means:
(?<=(\b|\-)): Positive look-behind for either a word boundary (\b) or a literal hyphen (-).
baby: The literal word "baby"
(?=(\b|$)): Positive look-ahead for either a word boundary or the end of the line ($).
This method also doesn't require you to first split the string, because it's unclear if that part is needed for your purposes.

Related

PySpark Compare Empty Map Literal

I want to drop rows in a PySpark DataFrame where a certain column contains an empty map. How do I do this? I can't seem to declare a typed empty MapType against which to compare my column. I have seen that in Scala, you can use typedLit, but there seems to be no such equivalent in PySpark. I have also tried using lit(...) and casting to a struct<string,int> but I have found no acceptable argument for lit() (tried using None which returns null and {} which is an error).
I'm sure this is trivial but I haven't seen any docs on this!
Here is a solution using pyspark size build-in function:
from pyspark.sql.functions import col, size
df = spark.createDataFrame(
[(1, {1:'A'} ),
(2, {2:'B'} ),
(3, {3:'C'} ),
(4, {}),
(5, None)]
).toDF("id", "map")
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- map: map (nullable = true)
# | |-- key: long
# | |-- value: string (valueContainsNull = true)
df.withColumn("is_empty", size(col("map")) <= 0).show()
# +---+--------+--------+
# | id| map|is_empty|
# +---+--------+--------+
# | 1|[1 -> A]| false|
# | 2|[2 -> B]| false|
# | 3|[3 -> C]| false|
# | 4| []| true|
# | 5| null| true|
# +---+--------+--------+
Note that the condition is size <= 0 since in the case of null the function returns -1 (if the spark.sql.legacy.sizeOfNull setting is true otherwise it will return null). Here you can find more details.
Generic solution: comparing Map column and literal Map
For a more generic solution we can use the build-in function size in combination with a UDF which append the string key + value of each item into a sorted list (thank you #jxc for pointing out the problem with the previous version). The hypothesis here will be that two maps are equal when:
they have the same size
the string representation of key + value is identical between the items of the maps
The literal map is created from an arbitrary python dictionary combining keys and values via map_from_arrays:
from pyspark.sql.functions import udf, lit, size, when, map_from_arrays, array
df = spark.createDataFrame([
[1, {}],
[2, {1:'A', 2:'B', 3:'C'}],
[3, {1:'A', 2:'B'}]
]).toDF("key", "map")
dict = { 1:'A' , 2:'B' }
map_keys_ = array([lit(k) for k in dict.keys()])
map_values_ = array([lit(v) for v in dict.values()])
tmp_map = map_from_arrays(map_keys_, map_values_)
to_strlist_udf = udf(lambda d: sorted([str(k) + str(d[k]) for k in d.keys()]))
def map_equals(m1, m2):
return when(
(size(m1) == size(m2)) &
(to_strlist_udf(m1) == to_strlist_udf(m2)), True
).otherwise(False)
df = df.withColumn("equals", map_equals(df["map"], tmp_map))
df.show(10, False)
# +---+------------------------+------+
# |key|map |equals|
# +---+------------------------+------+
# |1 |[] |false |
# |2 |[1 -> A, 2 -> B, 3 -> C]|false |
# |3 |[1 -> A, 2 -> B] |true |
# +---+------------------------+------+
Note: As you can see the pyspark == operator works pretty well for array comparison as well.

Does pyspark hash guarantee unique result for different input? [duplicate]

I am working with spark 2.2.0 and pyspark2.
I have created a DataFrame df and now trying to add a new column "rowhash" that is the sha2 hash of specific columns in the DataFrame.
For example, say that df has the columns: (column1, column2, ..., column10)
I require sha2((column2||column3||column4||...... column8), 256) in a new column "rowhash".
For now, I tried using below methods:
1) Used hash() function but since it gives an integer output it is of not much use
2) Tried using sha2() function but it is failing.
Say columnarray has array of columns I need.
def concat(columnarray):
concat_str = ''
for val in columnarray:
concat_str = concat_str + '||' + str(val)
concat_str = concat_str[2:]
return concat_str
and then
df1 = df1.withColumn("row_sha2", sha2(concat(columnarray),256))
This is failing with "cannot resolve" error.
Thanks gaw for your answer. Since I have to hash only specific columns, I created a list of those column names (in hash_col) and changed your function as :
def sha_concat(row, columnarray):
row_dict = row.asDict() #transform row to a dict
concat_str = ''
for v in columnarray:
concat_str = concat_str + '||' + str(row_dict.get(v))
concat_str = concat_str[2:]
#preserve concatenated value for testing (this can be removed later)
row_dict["sha_values"] = concat_str
row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest()
return Row(**row_dict)
Then passed as :
df1.rdd.map(lambda row: sha_concat(row,hash_col)).toDF().show(truncate=False)
It is now however failing with error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 8: ordinal not in range(128)
I can see value of \ufffd in one of the column so I am unsure if there is a way to handle this ?
You can use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.
Using the data from #gaw:
from pyspark.sql.functions import sha2, concat_ws
df = spark.createDataFrame(
[(1,"2",5,1),(3,"4",7,8)],
("col1","col2","col3","col4")
)
df.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
#+----+----+----+----+----------------------------------------------------------------+
#|col1|col2|col3|col4|row_sha2 |
#+----+----+----+----+----------------------------------------------------------------+
#|1 |2 |5 |1 |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|
#|3 |4 |7 |8 |57f057bdc4178b69b1b6ab9d78eabee47133790cba8cf503ac1658fa7a496db1|
#+----+----+----+----+----------------------------------------------------------------+
You can pass in either 0 or 256 as the second argument to sha2(), as per the docs:
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).
The function concat_ws takes in a separator, and a list of columns to join. I am passing in || as the separator and df.columns as the list of columns.
I am using all of the columns here, but you can specify whatever subset of columns you'd like- in your case that would be columnarray. (You need to use the * to unpack the list.)
If you want to have the hash for each value in the different columns of your dataset you can apply a self-designed function via map to the rdd of your dataframe.
import hashlib
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def sha_concat(row):
row_dict = row.asDict() #transform row to a dict
columnarray = row_dict.keys() #get the column names
concat_str = ''
for v in row_dict.values():
concat_str = concat_str + '||' + str(v) #concatenate values
concat_str = concat_str[2:]
row_dict["sha_values"] = concat_str #preserve concatenated value for testing (this can be removed later)
row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest() #calculate sha256
return Row(**row_dict)
test_df.rdd.map(sha_concat).toDF().show(truncate=False)
The Results would look like:
+----+----+----+----+----------------------------------------------------------------+----------+
|col1|col2|col3|col4|sha_hash |sha_values|
+----+----+----+----+----------------------------------------------------------------+----------+
|1 |2 |5 |1 |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|1||2||5||1|
|3 |4 |7 |8 |cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5|8||4||7||3|
+----+----+----+----+----------------------------------------------------------------+----------+
New in version 2.0 is the hash function.
from pyspark.sql.functions import hash
(
spark
.createDataFrame([(1,'Abe'),(2,'Ben'),(3,'Cas')], ('id','name'))
.withColumn('hashed_name', hash('name'))
).show()
wich results in:
+---+----+-----------+
| id|name|hashed_name|
+---+----+-----------+
| 1| Abe| 1567000248|
| 2| Ben| 1604243918|
| 3| Cas| -586163893|
+---+----+-----------+
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash
if you want to control how the IDs should look like then we can use this code below.
import pyspark.sql.functions as F
from pyspark.sql import Window
SRIDAbbrev = "SOD" # could be any abbreviation that identifys the table or object on the table name
max_ID = 00000000 # control how long you want your numbering to be, i chose 8.
if max_ID == None:
max_ID = 0 # helps identify where you start numbering from.
dataframe_new = dataframe.orderBy(
F.lit('name')
).withColumn(
'hashed_name',
F.concat(
F.lit(SRIDAbbrev),
F.lpad(
(
F.dense_rank().over(
Window.orderBy(name)
)
+ F.lit(max_ID)
),
8,
"0"
)
)
)
which results to
+---+----+-----------+
| id|name|hashed_name|
+---+----+-----------+
| 1| Abe| SOD0000001|
| 2| Ben| SOD0000002|
| 3| Cas| SOD0000003|
| 3| Cas| SOD0000003|
+---+----+-----------+
Let me know if this helps :)

pyspark. zip arrays in a dataframe

I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?
Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
The number of rows in your DataFrame (df.count())
The length of data (given)
Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():
import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
f.collect_list("id").alias("id"),
f.array(
[
f.array(
[f.collect_list("data").getItem(j).getItem(i)
for j in range(numRows)]
)
for i in range(dataLength)
]
).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id |data |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+
You can simply use a udf function for the zip function but before that you will have to use collect_list function
from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
return zip(*array)
zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))
df.select(
f.collect_list(df.id).alias('id'),
zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)
which would give you
+---------+------------------------------------------------------------------------------+
|id |data |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?
NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+
Maybe in your case it is not important but this code (modifed answer of Alper t. Turker) can handle different datatypes accordingly. The dataTypes can vary according your DataFrame of course. (tested on Spark version: 2.4)
from pyspark.sql.functions import col, isnan, when, trim
# Find out dataType and act accordingly
def to_null_bool(c, dt):
if df == "double":
return c.isNull() | isnan(c)
elif df == "string":
return ~c.isNull() & (trim(c) != "")
else:
return ~c.isNull()
# Only keep columns with not empty strings
def to_null(c, dt):
c = col(c)
return when(to_null_bool(c, dt), c)
df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

How to detect null column in pyspark

I have a dataframe defined with some null values. Some Columns are fully null values.
>> df.show()
+---+---+---+----+
| A| B| C| D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+
In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).
this is how I did it:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
but this does no consider null columns as constant, it works only with values.
How should I then do it ?
Extend the condition to
from pyspark.sql.functions import min, max
((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)
or use eqNullSafe (PySpark 2.3):
(min(c).eqNullSafe(max(c))).alias(c)
One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:
spark.version
# u'2.2.0'
from pyspark.sql.functions import col
nullColumns = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if nullRows == numRows: # i.e. if ALL values are NULL
nullColumns.append(k)
nullColumns
# ['D']
But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):
from pyspark.sql.functions import countDistinct
df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]
So the for loop now can be:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
How about this? In order to guarantee the column are all nulls, two properties must be satisfied:
(1) The min value is equal to the max value
(2) The min or max is null
Or, equivalently
(1) The min AND max are both equal to None
Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.
import pyspark.sql.functions as F
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
Here's an example in practice:
>>> rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
>>> schema = "a: int, b: int, c: float, d:int"
>>> df = spark.createDataFrame(data=rows, schema=schema)
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
|null| 18|null|null|
| 1|null|null|null|
| 1| 9| 4.0|null|
|null| 0| 0.0|null|
+----+----+----+----+
>>> get_null_column_names(df)
['d']

Resources