I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']
Using pandas_udf, I have:
#pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf input is pd.series while wordninja.split expects string.
My df looks like this:
+-------------+
|sld |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
+-------------+
I want something like this:
+-------------+---------------------+
| sld | test |
+-------------+---------------------+
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
+-------------+---------------------+
Just use .apply to perform computation on each element of the Pandas series, something like this:
#pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
splitted = x.apply(lambda s: wordninja.split(s))
return splitted
One way is using udf.
import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])
#F.udf
def split_word(x):
return wordninja.split(x)
df.withColumn('col2', split_word('sld')).show()
# +-----------+-----------------+
# | sld| col2|
# +-----------+-----------------+
# |hellofriend| [hello, friend]|
# |restinpeace|[rest, in, peace]|
# | this| [this]|
# | that| [that]|
# +-----------+-----------------+
Related
I have a dataframe with one column like this:
Locations
Germany:city_Berlin
France:town_Montpellier
Italy:village_Amalfi
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
Locations
Germany:Berlin
France:tMontpellier
Italy:Amalfi
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.
I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_]+','')).show(truncate=False)
+---+-----------------------+------------------+
|id |Locations |new_Locations |
+---+-----------------------+------------------+
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
+---+-----------------------+------------------+
F.regexp_replace('Locations', r'(?<=:).*_', '')
.* tells that you will match all characters. But it is located between (?<=:) and _.
_ is the symbol which must follow all the characters matched by .*.
(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.
Another option - list of strings to remove
strings = ['city', 'big_city', 'town', 'big_town', 'village']
F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", '')
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Germany:big_city_Berlin',),
('France:big_town_Montpellier',),
('Italy:village_Amalfi',)],
['Locations'])
df = df.withColumn('loc1', F.regexp_replace('Locations', r'(?<=:).*_', ''))
strings = ['city', 'big_city', 'town', 'big_town', 'village']
df = df.withColumn('loc2', F.regexp_replace('Locations', fr"(?<=:)({'|'.join(strings)})_", ''))
df.show(truncate=0)
# +---------------------------+------------------+------------------+
# |Locations |loc1 |loc2 |
# +---------------------------+------------------+------------------+
# |Germany:big_city_Berlin |Germany:Berlin |Germany:Berlin |
# |France:big_town_Montpellier|France:Montpellier|France:Montpellier|
# |Italy:village_Amalfi |Italy:Amalfi |Italy:Amalfi |
# +---------------------------+------------------+------------------+
Not 100% sure about your case although please find here another solution for your problem (Spark 3.x).
from pyspark.sql.functions import expr
df.withColumn("loc_map", expr("str_to_map(Locations, ',', ':')")) \
.withColumn("Locations", expr("transform_values(loc_map, (k, v) -> element_at(split(v, '_'), size(split(v, '_'))))")) \
.drop("loc_map") \
.show(10, False)
First convert Locations into a map using str_to_map
Then iterate through the values of the map and transform each value using transform_values.
Inside transform_values, we use element_at, split and size to identify and return the last item of an array separated by _
Eventually I went with this solution:
df = df.withColumn('Locations', F.regexp_replace('Locations', '(city_)|(town_)|(village_)', ''))
I wasn't aware that I could include several search strings in the regexp_replace function.
I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
I'm having trouble spliting a dataframe's column into more columns in PySpark:
I have a list of lists and I want to transform it into a dataframe, each value in one column.
What I have tried:
I created a dataframe from this list:
[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
Using this code:
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
The result I got is:
+----+--------------------+
|col1| col2|
+----+--------------------+
| 0|[COL-4560, COL-96...|
| 1|[DLL-7760, NAT-98...|
+----+--------------------+
I want to separate the values by comma into columns, so I tried:
from pyspark.sql import functions as F
df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):
AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;
My ideal final output would be like this:
+----------+----------+----------+----------+----------+
| SKU | REC_01 | REC_02 | REC_03 | REC_04 |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+
Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.
Does anyone have any idea of what is happening? Thank you very much in advance.
Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).
Here are the steps that will work for you.
recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
from pyspark.sql import functions as F
df2 = df_from_list
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
cols=['SKU','REC_01','REC_02','REC_03','REC_04']
df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
df_result.show()
#+--------+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03| REC_04|
#+--------+--------+--------+--------+--------+
#|COL-4560|COL-9655|NWG-0610|D81-3754| null|
#|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
#+--------+--------+--------+--------+--------+
I'm trying to write a function as a Pandas UDF, that would check if any element of a string array starts with a particular value. The result i'm looking for would be something like this :
df.filter(list_contains(val, df.stringArray_column)).show()
The function list_contains would return True on every row where any element of df.stringArray starts with val.
Just an example:
df = spark.read.csv(path)
display(df.filter(list_contains('50', df.stringArray_column)))
This code above would display every row of df where an element of the stringArray column starts with 50.
I have written a function in python, which is very slow
def list_contains(val):
# Perfom what ListContains generated
def list_contains_udf(column_list):
for element in column_list:
if element.startswith(val):
return True
return False
return udf(list_contains_udf, BooleanType())
Thank you for your help.
EDIT: Here is some sample Data and also an output example that i'm looking for:
df.LIST: ["408000","641100"]
["633400","641100"]
["633400","791100"]
["633400","408100"]
["633400","641100"]
["408110","641230"]
["633400","647200"]
display(df.select('LIST').filter(list_contains('408')(df.LIST)))
output: LIST
["408000","641100"]
["633400","408100"]
["408110","641230"]
Updated Answer:
It's possible to do that without an UDF if we assume arrays are of same length. Let's try the following.
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('prefix_finder').getOrCreate()
# sample data creation
my_df = spark.createDataFrame(
[('scooby', ['cartoon', 'kidfriendly']),
('batman', ['dark', 'cars']),
('meshuggah', ['heavy', 'dark']),
('guthrie', ['god', 'guitar'])
]
, schema=('character', 'tags'))
The dataframe my_df would look like following:
+---------+----------------------+
|character|tags |
+---------+----------------------+
|scooby |[cartoon, kidfriendly]|
|batman |[dark, cars] |
|meshuggah|[heavy, dark] |
|guthrie |[god, guitar] |
+---------+----------------------+
If we're searching for the prefix car, only the 1st and 2nd row should be returned, because car is a prefix of cartoon and cars.
Here comes the native spark operations to achieve that.
num_items_in_arr = 2 # this was the assumption
prefix = 'car'
my_df2 = my_df.select(col('character'), col('tags'), *(col('tags').getItem(i).alias(f'tag{i}') for i in range(num_items_in_arr)))
The dataframe my_df2 would look like:
+---------+----------------------+-------+-----------+
|character|tags |tag0 |tag1 |
+---------+----------------------+-------+-----------+
|scooby |[cartoon, kidfriendly]|cartoon|kidfriendly|
|batman |[dark, cars] |dark |cars |
|meshuggah|[insane, heavy] |insane |heavy |
|guthrie |[god, guitar] |god |guitar |
+---------+----------------------+-------+-----------+
Let's create a column concat_tags on my_df2, which we're going to use for regex match.
cols_of_interest = [f'tag{i}' for i in range(num_items_in_arr)]
for idx, col_name in enumerate(cols_of_interest):
my_df2 = my_df2.withColumn(col_name, f.substring(col_name, 1, prefix_len))
if idx == 0:
my_df2 = my_df2.withColumn(col_name, f.concat(lit("("), col_name, lit(".*")))
elif idx == len(cols_to_update_concat) - 1:
my_df2 = my_df2.withColumn(col_name, f.concat(col_name, lit(".*)")))
else:
my_df2 = my_df2.withColumn(col_name, f.concat(col_name, lit(".*")))
my_df3 = my_df2.withColumn('concat_tags', f.concat_ws('|', *cols_of_interest)).drop(*cols_of_interest)
my_df3 is like the following:
+---------+----------------------+-------------+
|character|tags |concat_tags |
+---------+----------------------+-------------+
|scooby |[cartoon, kidfriendly]|(car.*|kid.*)|
|batman |[dark, cars] |(dar.*|car.*)|
|meshuggah|[insane, heavy] |(ins.*|hea.*)|
|guthrie |[god, guitar] |(god.*|gui.*)|
+---------+----------------------+-------------+
Now we need to apply regex mathing on the column concat_tags.
my_df4 = my_df3.withColumn('matched', f.expr(r"regexp_extract(prefix, concat_tags, 0)"))
The result look like this:
+---------+----------------------+-------------+-------+
|character|tags |concat_tags |matched|
+---------+----------------------+-------------+-------+
|scooby |[cartoon, kidfriendly]|(car.*|kid.*)|car |
|batman |[dark, cars] |(dar.*|car.*)|car |
|meshuggah|[insane, heavy] |(ins.*|hea.*)| |
|guthrie |[god, guitar] |(god.*|gui.*)| |
+---------+----------------------+-------------+-------+
A little bit of cleanup.
my_df5 = my_df4.filter(my_df4.matched != "").drop('concat_tags', 'matched')
And here we are, with the final dataframe:
+---------+----------------------+
|character|tags |
+---------+----------------------+
|scooby |[cartoon, kidfriendly]|
|batman |[dark, cars] |
+---------+----------------------+
I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)