How to replace accented characters in PySpark? - string

I have a string column in a dataframe with values with accents, like
'México', 'Albânia', 'Japão'
How to replace letters with accents to get this:
'Mexico', 'Albania', 'Japao'
I tried many solutions available in Stack OverFlow, like this:
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
But disappointed returns
strip_accents('México')
>>> 'M?xico'

You can use translate:
df = spark.createDataFrame(
[
('1','Japão'),
('2','Irã'),
('3','São Paulo'),
('5','Canadá'),
('6','Tókio'),
('7','México'),
('8','Albânia')
],
["id", "Local"]
)
df.show(truncate = False)
+---+---------+
|id |Local |
+---+---------+
|1 |Japão |
|2 |Irã |
|3 |São Paulo|
|5 |Canadá |
|6 |Tókio |
|7 |México |
|8 |Albânia |
+---+---------+
from pyspark.sql import functions as F
df\
.withColumn('Loc_norm', F.translate('Local',
'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))\
.show(truncate=False)
+---+---------+---------+
|id |Local |Loc_norm |
+---+---------+---------+
|1 |Japão |Japao |
|2 |Irã |Ira |
|3 |São Paulo|Sao Paulo|
|5 |Canadá |Canada |
|6 |Tókio |Tokio |
|7 |México |Mexico |
|8 |Albânia |Albânia |
+---+---------+---------+

In PySpark, you can create a pandas_udf which is vectorized, so it's preferred to a regular udf.
This seems to be the best way to do it in pandas. So, we can use it to create a pandas_udf for PySpark application.
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('string')
def strip_accents(s: pd.Series) -> pd.Series:
return s.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Test:
df = df.withColumn('country2', strip_accents('country'))
df.show()
# +-------+--------+
# |country|country2|
# +-------+--------+
# | México| Mexico|
# |Albânia| Albania|
# | Japão| Japao|
# +-------+--------+

Related

Pyspark - generate a distinct values when adding a new column to a dataframe [duplicate]

I would like to add a column with a generated id to my data frame. I have tried:
uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())
however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.
I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).
How can I do this?
Please Try this:
import uuid
from pyspark.sql.functions import udf
uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
Df1 = Df.withColumn("id",uuidUdf())
Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)
A simple way:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.range(10)
df.withColumn("uuid", f.expr("uuid()")).show(truncate=False)
From pyspark's functions.py:
note: The user-defined functions are considered deterministic by default. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. If your function is not deterministic, call asNondeterministic on the user defined function. E.g.:
from pyspark.sql.types import IntegerType
import random
random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()
So for a UUID this would be:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import uuid
random_udf = udf(lambda: str(uuid.uuid4()), StringType()).asNondeterministic()
and the usage:
df = df.withColumn('id', random_udf())
Please use lit function so that you generate same id for all the records.
lit performs the function only once and gets the column value and adds it to every record.
>>> df.show(truncate=False)
+---+
|x |
+---+
|0 |
|1 |
|2 |
|3 |
|4 |
|5 |
|6 |
|7 |
|8 |
|9 |
+---+
>>> import uuid
>>> id = str(uuid.uuid4())
>>> df = df.withColumn("id", lit(id))
>>> df.show(truncate=False)
+---+------------------------------------+
|x |id |
+---+------------------------------------+
|0 |923b69d6-4bee-423d-a892-79162df5684d|
|1 |923b69d6-4bee-423d-a892-79162df5684d|
|2 |923b69d6-4bee-423d-a892-79162df5684d|
|3 |923b69d6-4bee-423d-a892-79162df5684d|
|4 |923b69d6-4bee-423d-a892-79162df5684d|
|5 |923b69d6-4bee-423d-a892-79162df5684d|
|6 |923b69d6-4bee-423d-a892-79162df5684d|
|7 |923b69d6-4bee-423d-a892-79162df5684d|
|8 |923b69d6-4bee-423d-a892-79162df5684d|
|9 |923b69d6-4bee-423d-a892-79162df5684d|
+---+------------------------------------+
Using udf won't solve the function as it gets called for every row, and we end up getting new uuid's for each call.
>>> df1 = df.withColumn("id",uuidUdf())
>>> uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
>>> df1 = df.withColumn("id",uuidUdf())
>>> df1.show(truncate=False)
+---+------------------------------------+
|x |id |
+---+------------------------------------+
|0 |6d051ec6-b91a-4c42-b37c-707a293f1dc8|
|1 |cd3c75b1-8a06-461b-82ae-51f4354296bd|
|2 |3996a022-de99-4403-9346-74e66210f9ef|
|3 |ad57a9c4-5c67-4545-bef6-77d89cff70d5|
|4 |5c9a82a1-323e-4ce0-9082-e36c5a6f61db|
|5 |7a64ee81-4c84-43d0-ab7d-0a79ed694950|
|6 |a0fb26e7-cf1a-445d-bd26-10dc453ddc1e|
|7 |435a7e6a-da22-4add-8953-b5c56b01c790|
|8 |fd3c5fd8-c9d5-4725-b32a-f3ce9386b9b8|
|9 |2291cc67-47cf-4921-80ec-b4180c73533c|
+---+------------------------------------+
I'm using pyspark= "==3.2.1", you can add your uuid version easly like the following
import uuid
from pyspark.sql import functions as f
df.withColumn("uuid", f.lit(str(uuid.uuid4())))

How to filter or delete the row in spark dataframe by a specific number?

I want to make a manipulate to a spark dataframe. For example, there is a dataframe with two columns.
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|1 |Bob |
|2 |Bob |
|3 |Alice |
|4 |Alice |
|5 |Alice |
............
There are two kinds of name in the column value and the number of Alice is more than Bob, what I want to modify is to delete some row containing Alice to make the number of row with Alice same of the row with Bob. The row should be deleted ramdomly but I found no API supporting such manipulation. What should I do to delete the row to a specific number?
Perhaps you can use spark window function with row_count and subsequent filtering, something like this:
>>> df.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
|5 |Alice|
+---+-----+
>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> window = Window.orderBy("value").partitionBy("value")
>>> df2 = df.withColumn("seq",row_number().over(window))
>>> df2.show(truncate=False)
+---+-----+---+
|key|value|seq|
+---+-----+---+
|1 |Bob |1 |
|2 |Bob |2 |
|3 |Alice|1 |
|4 |Alice|2 |
|5 |Alice|3 |
+---+-----+---+
>>> N = 2
>>> df3 = df2.where("seq <= %d" % N).drop("seq")
>>> df3.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
+---+-----+
>>>
Here's your sudo code:
Count "BOB"
[repartition the data]/[groupby] (partionBy/GroupBy)
[use iteration to cut off data at "BOB's" Count] (mapParitions/mapGroups)
You must remember that technically spark does not guarantee ordering on a dataset, so adding new data can randomly change the order of the data. So you could consider this random and just cut the count when your done. This should be faster than creating a window. If you really felt compelled you could create your own random probability function to return a fraction of each partition.
You can also use a window with this, paritionBy("value").orderBy("value") and use row_count & where to filter the partitions to "Bob's" Count.

How to get the info in table header (schema)?

env: spark2.4.5
source: id-name.json
{"1": "a", "2": "b", "3":, "c"..., "n": "z"}
I load the .json file into spark Dataset with Json format and it is stored like:
+---+---+---+---+---+
| 1 | 2 | 3 |...| n |
+---+---+---+---+---+
| a | b | c |...| z |
+---+---+---+---+---+
And I want it to be generated like such result:
+------------+------+
| id | name |
+------------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| . | . |
| . | . |
| . | . |
| n | z |
+------------+------+
My solution using spark-sql:
select stack(n, '1', `1`, '2', `2`... ,'n', `n`) as ('id', 'name') from table_name;
It doesn't meet my demand because I don't want to hard-code all the 'id' in sql.
Maybe using 'show columns from table_name' with 'stack()' can help?
I would be very grateful if you could give me some suggestion.
Create required values for stack dynamic & use it where ever it required. Please check below code to generate same values dynamic.
scala> val js = Seq("""{"1": "a", "2": "b","3":"c","4":"d","5":"e"}""").toDS
js: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.json(js)
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
exprC: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala>
Updated Code to fetch data from json file
scala> "hdfs dfs -cat /tmp/sample.json".!
{"1": "a", "2": "b","3":"c","4":"d","5":"e"}
res4: Int = 0
scala> val df = spark.read.json("/tmp/sample.json")
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
stack: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> df.createTempView("table")
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+

Sorting DataFrame within rows and getting the ranking

I have the following PySpark DataFrame :
+----+----------+----------+----------+
| id| a| b| c|
+----+----------+----------+----------+
|2346|2017-05-26| null|2016-12-18|
|5678|2013-05-07|2018-05-12| null|
+----+----------+----------+----------+
My ideal output is :
+----+---+---+---+
|id |a |b |c |
+----+---+---+---+
|2346|2 |0 |1 |
|5678|1 |2 |0 |
+----+---+---+---+
Ie the more recent the date within the row, the higher the score
I have looked at similar posts suggesting to use window function. The problem is that I need to order my values within the row, not the column.
You can put the values in each row into an array and use pyspark.sql.functions.sort_array() to sort it.
import pyspark.sql.functions as f
cols = ["a", "b", "c"]
df = df.select("*", f.sort_array(f.array([f.col(c) for c in cols])).alias("sorted"))
df.show(truncate=False)
#+----+----------+----------+----------+------------------------------+
#|id |a |b |c |sorted |
#+----+----------+----------+----------+------------------------------+
#|2346|2017-05-26|null |2016-12-18|[null, 2016-12-18, 2017-05-26]|
#|5678|2013-05-07|2018-05-12|null |[null, 2013-05-07, 2018-05-12]|
#+----+----------+----------+----------+------------------------------+
Now you can use a combination of pyspark.sql.functions.coalesce() and pyspark.sql.functions.when() to loop over each of the columns in cols and find the corresponding index in the sorted array.
df = df.select(
"id",
*[
f.coalesce(
*[
f.when(
f.col("sorted").getItem(i) == f.col(c),
f.lit(i)
)
for i in range(len(cols))
]
).alias(c)
for c in cols
]
)
df.show(truncate=False)
#+----+---+----+----+
#|id |a |b |c |
#+----+---+----+----+
#|2346|2 |null|1 |
#|5678|1 |2 |null|
#+----+---+----+----+
Finally fill the null values with 0:
df = df.na.fill(0)
df.show(truncate=False)
#+----+---+---+---+
#|id |a |b |c |
#+----+---+---+---+
#|2346|2 |0 |1 |
#|5678|1 |2 |0 |
#+----+---+---+---+

Split Contents of String column in PySpark Dataframe

I have a pyspark data frame whih has a column containing strings. I want to split this column into words
Code:
>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+
Expected Output
---------------
>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
How can I achieve this?
Use split function:
from pyspark.sql.functions import split
df.withColumn("desc", split("desc", "\s+"))

Resources