Split Contents of String column in PySpark Dataframe - apache-spark

I have a pyspark data frame whih has a column containing strings. I want to split this column into words
Code:
>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+
Expected Output
---------------
>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
How can I achieve this?

Use split function:
from pyspark.sql.functions import split
df.withColumn("desc", split("desc", "\s+"))

Related

Filter rows with minimum and maximum count

This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

How to replace accented characters in PySpark?

I have a string column in a dataframe with values with accents, like
'México', 'Albânia', 'Japão'
How to replace letters with accents to get this:
'Mexico', 'Albania', 'Japao'
I tried many solutions available in Stack OverFlow, like this:
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
But disappointed returns
strip_accents('México')
>>> 'M?xico'
You can use translate:
df = spark.createDataFrame(
[
('1','Japão'),
('2','Irã'),
('3','São Paulo'),
('5','Canadá'),
('6','Tókio'),
('7','México'),
('8','Albânia')
],
["id", "Local"]
)
df.show(truncate = False)
+---+---------+
|id |Local |
+---+---------+
|1 |Japão |
|2 |Irã |
|3 |São Paulo|
|5 |Canadá |
|6 |Tókio |
|7 |México |
|8 |Albânia |
+---+---------+
from pyspark.sql import functions as F
df\
.withColumn('Loc_norm', F.translate('Local',
'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))\
.show(truncate=False)
+---+---------+---------+
|id |Local |Loc_norm |
+---+---------+---------+
|1 |Japão |Japao |
|2 |Irã |Ira |
|3 |São Paulo|Sao Paulo|
|5 |Canadá |Canada |
|6 |Tókio |Tokio |
|7 |México |Mexico |
|8 |Albânia |Albânia |
+---+---------+---------+
In PySpark, you can create a pandas_udf which is vectorized, so it's preferred to a regular udf.
This seems to be the best way to do it in pandas. So, we can use it to create a pandas_udf for PySpark application.
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('string')
def strip_accents(s: pd.Series) -> pd.Series:
return s.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Test:
df = df.withColumn('country2', strip_accents('country'))
df.show()
# +-------+--------+
# |country|country2|
# +-------+--------+
# | México| Mexico|
# |Albânia| Albania|
# | Japão| Japao|
# +-------+--------+

How to filter or delete the row in spark dataframe by a specific number?

I want to make a manipulate to a spark dataframe. For example, there is a dataframe with two columns.
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|1 |Bob |
|2 |Bob |
|3 |Alice |
|4 |Alice |
|5 |Alice |
............
There are two kinds of name in the column value and the number of Alice is more than Bob, what I want to modify is to delete some row containing Alice to make the number of row with Alice same of the row with Bob. The row should be deleted ramdomly but I found no API supporting such manipulation. What should I do to delete the row to a specific number?
Perhaps you can use spark window function with row_count and subsequent filtering, something like this:
>>> df.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
|5 |Alice|
+---+-----+
>>> from pyspark.sql import Window
>>> from pyspark.sql.functions import *
>>> window = Window.orderBy("value").partitionBy("value")
>>> df2 = df.withColumn("seq",row_number().over(window))
>>> df2.show(truncate=False)
+---+-----+---+
|key|value|seq|
+---+-----+---+
|1 |Bob |1 |
|2 |Bob |2 |
|3 |Alice|1 |
|4 |Alice|2 |
|5 |Alice|3 |
+---+-----+---+
>>> N = 2
>>> df3 = df2.where("seq <= %d" % N).drop("seq")
>>> df3.show(truncate=False)
+---+-----+
|key|value|
+---+-----+
|1 |Bob |
|2 |Bob |
|3 |Alice|
|4 |Alice|
+---+-----+
>>>
Here's your sudo code:
Count "BOB"
[repartition the data]/[groupby] (partionBy/GroupBy)
[use iteration to cut off data at "BOB's" Count] (mapParitions/mapGroups)
You must remember that technically spark does not guarantee ordering on a dataset, so adding new data can randomly change the order of the data. So you could consider this random and just cut the count when your done. This should be faster than creating a window. If you really felt compelled you could create your own random probability function to return a fraction of each partition.
You can also use a window with this, paritionBy("value").orderBy("value") and use row_count & where to filter the partitions to "Bob's" Count.

Difference between describe() and summary() in Apache Spark

What is the difference between summary() and describe() ?
It seems that they both serve the same purpose. I didn't manage to find any differences (if any).
If we are passing any args then these functions works for different purposes:
.describe() function takes cols:String*(columns in df) as optional args.
.summary() function takes statistics:String*(count,mean,stddev..etc) as optional args.
Example:
scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
scala> df_des.describe().show(false) //without args
//Result:
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|max |3 |c |
//+-------+---+----+
scala> df_des.summary().show(false) //without args
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//|mean |2.0|null|
//|stddev |1.0|null|
//|min |1 |a |
//|25% |1 |null|
//|50% |2 |null|
//|75% |3 |null|
//|max |3 |c |
//+-------+---+----+
scala> df_des.describe("id").show(false) //descibe on id column only
//+-------+---+
//|summary|id |
//+-------+---+
//|count |3 |
//|mean |2.0|
//|stddev |1.0|
//|min |1 |
//|max |3 |
//+-------+---+
scala> df_des.summary("count").show(false) //get count summary only
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count |3 |3 |
//+-------+---+----+
The first operation to perform after importing data is to get some sense of what it looks like. For numerical columns, knowing the descriptive summary statistics can help a lot in understanding the distribution of your data. The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Hope it helps.
Both have same functionality but the api syntax is just different. Hope this helps

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources