Aggregate sparse vector in PySpark - apache-spark

I have a Hive table that contains text data and some metadata associated to each document. Looks like this.
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer
df = sc.parallelize([
("1", "doc_1", "fruit is good for you"),
("2", "doc_2", "you should eat fruit and veggies"),
("2", "doc_3", "kids eat fruit but not veggies")
]).toDF(["month","doc_id", "text"])
+-----+------+--------------------+
|month|doc_id| text|
+-----+------+--------------------+
| 1| doc_1|fruit is good for...|
| 2| doc_2|you should eat fr...|
| 2| doc_3|kids eat fruit bu...|
+-----+------+--------------------+
I want to count words by month.
So far I've taken a CountVectorizer approach:
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
tokenized = tokenizer.transform(df)
cvModel = CountVectorizer().setInputCol("words").setOutputCol("features").fit(tokenized)
counted = cvModel.transform(tokenized)
+-----+------+--------------------+--------------------+--------------------+
|month|doc_id| text| words| features|
+-----+------+--------------------+--------------------+--------------------+
| 1| doc_1|fruit is good for...|[fruit, is, good,...|(12,[0,3,4,7,8],[...|
| 2| doc_2|you should eat fr...|[you, should, eat...|(12,[0,1,2,3,9,11...|
| 2| doc_3|kids eat fruit bu...|[kids, eat, fruit...|(12,[0,1,2,5,6,10...|
+-----+------+--------------------+--------------------+--------------------+
Now I want to group by month and return something that looks like:
month word count
1 fruit 1
1 is 1
...
2 fruit 2
2 kids 1
2 eat 2
...
How could I do that?

There is no built-in mechanism for Vector* aggregation but you don't need one here. Once you have tokenized data you can just explode and aggregate:
from pyspark.sql.functions import explode
(counted
.select("month", explode("words").alias("word"))
.groupBy("month", "word")
.count())
If you prefer to limit the results to the vocabulary just add a filter:
from pyspark.sql.functions import col
(counted
.select("month", explode("words").alias("word"))
.where(col("word").isin(cvModel.vocabulary))
.groupBy("month", "word")
.count())
* Since Spark 2.4 we have access to Summarizer but it won't be useful here.

Related

Join Pyspark Dataframes on substring match

I have two dataframes df1 and df2 somewhat like this:
import pandas as pd
from spark.sql import SparkSession
spark = SparkSession.builder.appName("someAppname").getOrCreate()
df1 = spark.createDataFrame(pd.DataFrame({"entity_nm": ["Joe B", "Donald", "Barack Obama"]}))
df2 = spark.createDataFrame(pd.DataFrame({"aliases": ["Joe Biden; Biden Joe", "Donald Trump; Donald J. Trump", "Barack Obama", "Joe Burrow"], "id": [1, 2, 3, 4]}))
I want to join df2 on df1 based on a string contains match, it does work when I do it like this:
df_joined = df1.join(df2, df2.aliases.contains(df1.entity_nm), how="left")
That join gives me my desired result:
+------------+--------------------+---+
| entity_nm| aliases| id|
+------------+--------------------+---+
| Joe B|Joe Biden; Biden Joe| 1|
| Joe B|Joe Burrow | 4|
| Donald|Donald Trump; Don...| 2|
|Barack Obama| Barack Obama| 3|
Problem here: I tried to do this with a list of 60k entity names in df1 and around 6 million aliases in df2 and this approach takes like forever until at some point my Spark session will just crash due to memory errors. I'm pretty sure that my approach is very naive and far from optimized.
I've read this blog post which suggests to use a udf but I don't have any Scala knowledge and struggle to understand and recreate it in PySpark.
Any suggestions or help on how to optimize my approach? I need to do tasks like this a lot, so any help would be greatly appreciated.

how can we force dataframe repartitioning to be balanced in spark?

I created a synthetic dataset and I trying to experiment with repartitioning based on a one column. The objective is to end up with a balanced (equal size) number of partitions, but I cannot achieve this. Is there a way it could be done, preferably without resorting to RDDs and saving the dataframe?
Example code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('learn').getOrCreate()
import pandas as pd
import random
from pyspark.sql.types import *
nr = 500
data = {'id': [random.randint(0,5) for _ in range(nr)], 'id2': [random.randint(0,5) for _ in range(nr)]}
data = pd.DataFrame(data)
df = spark.createDataFrame(data)
# df.show()
df = df.repartition(3, 'id')
# see the different partitions
for ipart in range(3):
print(f'partition {ipart}')
def fpart(partition_idx, iterator, target_partition_idx=ipart):
if partition_idx == target_partition_idx:
return iterator
else:
return iter(())
res = df.rdd.mapPartitionsWithIndex(fpart)
res = res.toDF(schema=schema)
# res.show(n=5, truncate=False)
print(f"number of rows {res.count()}, unique ids {res.select('id').drop_duplicates().toPandas()['id'].tolist()}")
It produces:
partition 0
number of rows 79, unique ids [3]
partition 1
number of rows 82, unique ids [0]
partition 2
number of rows 339, unique ids [5, 1, 2, 4]
so the partitions are clearly not balanced.
I saw in How to guarantee repartitioning in Spark Dataframe that this is explainable because assigning to partitions is based on the hash of column id modulo 3 (the number of partitions):
df.select('id', f.expr("hash(id)"), f.expr("pmod(hash(id), 3)")).drop_duplicates().show()
that produces
+---+-----------+-----------------+
| id| hash(id)|pmod(hash(id), 3)|
+---+-----------+-----------------+
| 3| 519220707| 0|
| 0|-1670924195| 1|
| 1|-1712319331| 2|
| 5| 1607884268| 2|
| 4| 1344313940| 2|
| 2| -797927272| 2|
+---+-----------+-----------------+
but I find this strange. The point of specifying the column in the repartition function is to somehow split the values of id to different partitions. If the column id had more unique values than 6 in this example it would work better, but still.
Is there a way to achieve this?

How to capture frequency of words after group by with pyspark

I have a tabular data with keys and values and the keys are not unique.
for example:
+-----+------+
| key | value|
--------------
| 1 | the |
| 2 | i |
| 1 | me |
| 1 | me |
| 2 | book |
| 1 |table |
+-----+------+
Now assume this table is distributed across the different nodes in spark cluster.
How do I use pyspark to calculate frequencies of the words with respect to the different keys? for instance, in the above example I wish to output:
+-----+------+-------------+
| key | value| frequencies |
---------------------------+
| 1 | the | 1/4 |
| 2 | i | 1/2 |
| 1 | me | 2/4 |
| 2 | book | 1/2 |
| 1 |table | 1/4 |
+-----+------+-------------+
Not sure if you can combine multi-level operations with DFs, but doing it in 2 steps and leaving concat to you, this works:
# Running in Databricks, not all stuff required
# You may want to do to upper or lowercase for better results.
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [("1", "the"), ("2", "I"), ("1", "me"),
("1", "me"), ("2", "book"), ("1", "table")]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1]))
df = sqlContext.createDataFrame(someschema)
df1 = df.groupBy("c1", "c2") \
.count()
df2 = df1.groupBy('c1') \
.sum('count')
df3 = df1.join(df2,'c1')
df3.show()
returns:
+---+-----+-----+----------+
| c1| c2|count|sum(count)|
+---+-----+-----+----------+
| 1|table| 1| 4|
| 1| the| 1| 4|
| 1| me| 2| 4|
| 2| I| 1| 2|
| 2| book| 1| 2|
+---+-----+-----+----------+
You can reformat last 2 cols, but am curious if we can do all in 1 go. In normal SQL we would use inline views and combine I suspect.
This works across cluster standardly, what Spark is generally all about. The groupBy takes it all into account.
minor edit
As it is rather hot outside, I looked into this in a little more depth. This is a good overview: http://stevendavistechnotes.blogspot.com/2018/06/apache-spark-bi-level-aggregation.html. After reading this and experimenting I could not get it any more elegant, reducing to 5 rows of output all in 1 go appears not to be possible.
Another viable option is with window functions.
First, define the number of occurrences per values-keys and for key. Then just add another column with the Fraction (you will have reduced fractions)
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
#udf (StringType())
def getFraction(frequency):
return str(Fraction(frequency))
schema = StructType([StructField("key", IntegerType(), True),
StructField("value", StringType(), True)])
data = [(1, "the"), (2, "I"), (1, "me"),
(1, "me"), (2, "book"), (1, "table")]
spark = SparkSession.builder.appName('myPython').getOrCreate()
input_df = spark.createDataFrame(data, schema)
(input_df.withColumn("key_occurrence",
F.count(F.lit(1)).over(Window.partitionBy(F.col("key"))))
.withColumn("value_occurrence", F.count(F.lit(1)).over(Window.partitionBy(F.col("value"), F.col('key'))))
.withColumn("frequency", getFraction(F.col("value_occurrence"), F.col("key_occurrence"))).dropDuplicates().show())

split my dataframe depending on the number of nodes pyspark

I'm trying to split my dataframe depending on the number of nodes (of my cluster),
my dataframe looks like :
If i had node=2, and dataframe.count=7 :
So, to apply an iterative approach the result of split will be :
My question is : how can i do this ?
You can do that (have a look at the code below) with one of the rdd partition functions, but I don't recommend it as
long as you are not fully aware of what you are doing and the reason why you are doing this. In general (or better for most usecase) it is better to let spark handle the data distribution.
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
Output:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+

spark program to find the city with maximum population [duplicate]

This question already has answers here:
Find maximum row per group in Spark DataFrame
(2 answers)
Closed 1 year ago.
Input files contains rows like below (state,city,population):
west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000
I have to write a Spark (Apache Spark) program in both Python and Scala to find the city with maximum population. Output will be like this:
west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000
So I need a three column output for each state. It's easy for me to get the output like this:
west bengal,15000
karnataka,200000
delhi,100000
But to get the city having maximum population is getting difficult.
In vanilla pyspark, map your data to a pair RDD where the state is the key, and the value is the tuple (city, population). Then reduceByKey to keep the largest city. Beware, in the case of cities with the same population it will keep the first one it encounters.
rdd.map(lambda reg: (reg[0],[reg[1],reg[2]]))
.reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))
The results with your data look like this:
[('delhi', ['gurgaon', 200000]),
('west bengal', ['kolkata', 150000]),
('karnataka', ['bangalore', 200000])]
This should do the trick:
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
['west bengal','kolkata',150000],
['karnataka','bangalore',200000],
['karnataka','mangalore',80000],
['west bengal','bongaon',50000],
['delhi','new delhi',100000],
['delhi','gurgaon',200000],
])
>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
| state| city|population|
+-----------+---------+----------+
|west bengal| kolkata| 150000|
| karnataka|bangalore| 200000|
| karnataka|mangalore| 80000|
|west bengal| bongaon| 50000|
| delhi|new delhi| 100000|
| delhi| gurgaon| 200000|
+-----------+---------+----------+
>>> df.groupBy('city').max('population').show()
+---------+---------------+
| city|max(population)|
+---------+---------------+
|bangalore| 200000|
| kolkata| 150000|
| gurgaon| 200000|
|mangalore| 80000|
|new delhi| 100000|
| bongaon| 50000|
+---------+---------------+
>>> df.groupBy('state').max('population').show()
+-----------+---------------+
| state|max(population)|
+-----------+---------------+
| delhi| 200000|
|west bengal| 150000|
| karnataka| 200000|
+-----------+---------------+

Resources