Wordcount Program for Apache PySpark

Wordcount Program for Apache PySpark - python-3.x

I have been given a task to create a wordcount program in Python Spark. I am supposed to count the number of words starting with each alphabet.
Here's the code I have written but I can't seem to get the result. Could anyone help me with troubleshooting?
in.txt content:
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
import re
import sys
from pyspark import SparkConf, SparkContext
conf = SparkConf()
sc = SparkContext(conf=conf)
inRDD = sc.textFile("in.txt")
words = inRDD.flatMap(lambda l: re.split(" ",l))
LetterCount = words.map(lambda s : (s[0],1))
result = LetterCount.reduceByKey(lambda n1, n2 : n1 + n2)

Your code is OK.
Just need to add the collect at the end :
result.collect()
[('s', 1),
('l', 2),
('a', 10),
('n', 1),
('t', 8),
('c', 1),
('p', 1),
('b', 2),
('w', 1),
('o', 2)]
And you can replace
import re
words = inRDD.flatMap(lambda l: re.split(" ",l))
with
words = inRDD.flatMap(str.split)

Wordcount Program for Apache PySpark using sparkSQL function Easiest way
import pyspark.sql.functions as f
wordsDF = spark.read.text("path/log.txt")
df = wordsDF.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df.createOrReplaceTempView("wc")
df2 = spark.sql("SELECT SUM(wordCount) as Total FROM wc").show()
+-----+
|Total|
+-----+
| 147|
+-----+

Related

How to find the second most repetitive character in string using python

Here in the program how can you find the second repetitive character in the string. for ex:abcdaabdefaggcbd"
Output : d (because 'd' occurred 3 times where 'a' occurred 4 times)
how can I get the output, please help me.
Given below is my code:
s="abcdaabdefaggcbd"
d={}
for i in s:
d[i] = d.get(i,0)+1
print(d,"ddddd")
max2 = 0
for k,v in d.items():
if(v>max2 and v<max(d.values())):
max2=v
if max2 in d.values():
print k,"kkk"

The magnificent Python Counter and its most_common() method are very handy here.
import collections
my_string = "abcdaabdefaggcbd"
result = collections.Counter(my_string).most_common()
print(result[1])
Output
('b', 3)
In case you need to capture all the second values (if you have more than one entry) you can use the following:
import collections
my_string = "abcdaabdefaggcbd"
result = collections.Counter(my_string).most_common()
second_value = result[1][1]
seconds = []
for item in result:
if item[1] == second_value:
seconds.append(item)
print(seconds)
Output
[('b', 3), ('d', 3)]
I also wanted to add an example of solving the problem using a methodology more similar to the one that you showed in your question:
my_string="abcdaabdefaggcbd"
result={}
for character in my_string:
if character in result:
result[character] = result.get(character) + 1
else:
result[character] = 1
sorted_data = sorted([(value,key) for (key,value) in result.items()])
second_value = sorted_data[-2][0]
result = []
for item in sorted_data:
if item[0] == second_value:
result.append(item)
print(result)
Output
[(3, 'b'), (3, 'd')]
Ps
Please forgive me if I took the freedom to change variable names but I think that in this way my answer will be more readable for a broader audience.

Sort the dict's items on their values (descending) and get the second item:
>>> from collections import Counter
>>> c = Counter("abcdaabdefaggcbd")
>>> vals = sorted(c.items(), key=lambda item:item[1], reverse=True)
>>> vals
[('a', 4), ('b', 3), ('d', 3), ('c', 2), ('g', 2), ('e', 1), ('f', 1)]
>>> print(vals[1])
('b', 3)
>>>
EDIT:
or just use Counter.most_common():
>>> from collections import Counter
>>> c = Counter("abcdaabdefaggcbd")
>>> print(c.most_common()[1])

Both b and d are second most repetitive. I would think that both should be displayed. This is how I would do it:
Code:
s="abcdaabdefaggcbd"
d={}
for i in s:
ctr=s.count(i)
d[i]=ctr
fir = max(d.values())
sec = 0
for j in d.values():
if(j>sec and j<fir):
sec = j
for k,v in d.items():
if v == sec:
print(k,v)
Output:
b 3
d 3

in order to find the second most repetitive character in string you can very well use collections.Counter()
Here's an example:
import collections
s='abcdaabdefaggcbd'
count=collections.Counter(s)
print(count.most_common(2)[1])
Output: ('b', 3)
You can do a lot with Counter(). Here's a link for a further read:
More about Counter()
I hope this answers your question. Cheers!

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates.
Example
val df1 = Seq(
("a", 1L),
("a", 1L),
("a", 1L),
("b", 2L)
).toDF("id", "value")
val df2 = Seq(
("a", 1L),
("b", 2L)
).toDF("id", "value")
df1.exceptAll(df2).collect()
// will return
Seq(("a", 1L),("a", 1L))
However I can only use Spark 2.3.0.
What is the best way to implement this using only functions from Spark 2.3.0?

One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows.
PySpark solution shown here.
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w1 = Window.partitionBy(df1.id).orderBy(df1.value)
w2 = Window.partitionBy(df2.id).orderBy(df2.value)
df1 = df1.withColumn("rnum", row_number().over(w1))
df2 = df2.withColumn("rnum", row_number().over(w2))
res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
.filter(df2.id.isNull()) \ #Identifies missing rows
.select(df1.id,df1.value)
res_like_exceptAll.show()

Iterating over a grouped dataset in Spark 1.6

In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key when a user stops
"shouting" (the 2nd char in a string is not uppercase).
Dataset example:
ID, text, timestamps
1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999
The expected result would be:
1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"
via groupby and agg I can't seem to set a condition to "stop when an uppercase char" is found.

This only works in Spark 2.1 or above
What you want to do is possible, but it may be very expensive.
First, let's create some test data. As general advice, when you ask something on Stackoverflow please provide something similar to this so people have somewhere to start.
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = List(
(1, "OMG I like bananas", 1),
(1, "Bananas are the best", 2),
(1, "MAN I love banana", 3),
(2, "ORLY? I'm more into grapes", 1),
(2, "BUT I like apples too", 2),
(2, "unless you count veggies", 3),
(2, "THEN don't forget tomatoes", 4)
).toDF("ID", "text", "timestamps")
In order to get a column with the collected texts in order, we need to add a new column using a window function.
Using the spark shell:
scala> val df2 = df.withColumn("coll", collect_list("text").over(Window.partitionBy("id").orderBy("timestamps")))
df2: org.apache.spark.sql.DataFrame = [ID: int, text: string ... 2 more fields]
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> x.collect.foreach(println)
[1,WrappedArray(OMG I like bananas, Bananas are the best, MAN I love banana)]
[2,WrappedArray(ORLY? I'm more into grapes, BUT I like apples too, unless you count veggies, THEN don't forget tomatoes)]
To get the actual text we may need a UDF. Here's mine (I'm far from an expert in Scala, so bear with me)
import scala.collection.mutable
val aggText: Seq[String] => String = (list: Seq[String]) => {
def tex(arr: Seq[String], accum: Seq[String]): Seq[String] = arr match {
case Seq() => accum
case Seq(single) => accum :+ single
case Seq(str, xs #_*) => if (str.length >= 2 && !(str.charAt(0).isUpper && str.charAt(1).isUpper))
tex(Nil, accum :+ str )
else
tex(xs, accum :+ str)
}
val res = tex(list, Seq())
res.mkString(" ")
}
val textUDF = udf(aggText(_: mutable.WrappedArray[String]))
So, we have a dataframe with the collected texts in the proper order, and a Scala function (wrapped as a UDF). Let's piece it together:
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> val y = x.select($"ID", textUDF($"texts"))
y: org.apache.spark.sql.DataFrame = [ID: int, UDF(texts): string]
scala> y.collect.foreach(println)
[1,OMG I like bananas Bananas are the best]
[2,ORLY? I'm more into grapes BUT I like apples too unless you count veggies]
scala>
I think this is the result you want.

Efficient way to reduceByKey ignoring keys not in another RDD?

I have a large collection of data in pyspark. The format is key-value pairs, and I need to do a reducebykey operation, but ignoring all data whose key isn't in an RDD of 'interesting' keys that I also have.
I found a solution on SO that utilizes the subtractbykey operation to achieve this. It works, but crashes due to low memory on my cluster. I have not been able to change this with tweaking the settings, so I'm hoping there's a more efficient solution.
Here's my solution that works on smaller datasets:
# The keys I'm interested in
edges = sc.parallelize([("a", "b"), ("b", "c"), ("a", "d")])
# Data containing both interesting and uninteresting stuff
data1 = sc.parallelize([(("a", "b"), [42]), (("a", "c"), [60]), (("a", "d"), [13, 37])])
data2 = sc.parallelize([(("a", "b"), [43]), (("b", "c"), [23, 24]), (("a", "c"), [13, 37])])
all_data = [data1, data2]
mask = edges.map(lambda t: (tuple(t), None))
rdds = []
for datum in all_data:
combined = datum.reduceByKey(lambda a, b: a+b)
unwanted = combined.subtractByKey(mask)
wanted = combined.subtractByKey(unwanted)
rdds.append(wanted)
edge_alltimes = sc.union(rdds).reduceByKey(lambda a,b: a+b)
edge_alltimes.collect()
As desired, this outputs [(('a', 'd'), [13, 37]), (('a', 'b'), [42, 43]), (('b', 'c'), [23, 24])]
(i.e. data for the 'interesting' key tuples have been combined and the rest has been dropped).
The reason I have the data in several RDDs is to mimic behavior on my cluster where I can't load all the data simultaneously due to its size.
Any help would be great.

Example with join. A small drawback is that you need to have RDD of pairs before join and you need to strip extra data after join.
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val goodKeys = sc.parallelize(Seq(1, 2))
val allData = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "c")))
val goodPairs = goodKeys.map(v => (v, 0))
val goodData = allData.join(goodPairs).mapValues(p => p._1)
goodData.collect().foreach(println)
}
}
Output:
(1,a)
(2,b)

How to associate unique id with text in word counting with spark

I have an RDD that is populated as
id txt
1 A B C
2 A B C
1 A B C
The result of my word count (pyspark) should be for a combination of string and id associated with it. Example:
[(u'1_A',2), (u'1_B',2), (u'1_C',2),(u'2_A',1),(u'2_B',1),(u'2_C',1)]
I tried using a user defined function to combine id with string splits from text. It, however, complains that the append function is unavailable in this context.
Appreciate any code samples that will set me in the right direction.

Here is an alternative solution using PySpark Dataframe. Mainly, the code uses explode and split to split the txt column. Then, use groupby and count to count number of pairs.
import pyspark.sql.functions as func
rdd = spark.sparkContext.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
df = rdd.toDF(['id', 'txt'])
df_agg = df.select('id', func.explode(func.split('txt', ' '))).\
groupby(['id', 'col']).\
count().\
sort(['id', 'col'], ascending=True)
df_agg.rdd.map(lambda x:(str(x['id']) + '_' + x['col'], x['count'] )).collect()
Output
[('1_A', 2), ('1_B', 2), ('1_C', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

the below snippet should work
rdd = sc.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
result = rdd \
.map(lambda x: (x[0],x[1].split(' '))) \
.flatMap(lambda x: [ '%s_%s'% (x[0],y) for y in x[1] ]) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y: x + y)
result.collect()
Output
[('1_C', 2), ('1_B', 2), ('1_A', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Wordcount Program for Apache PySpark - python-3.x

Your code is OK. Just need to add the collect at the end : result.collect() [('s', 1), ('l', 2), ('a', 10), ('n', 1), ('t', 8), ('c', 1), ('p', 1), ('b', 2), ('w', 1), ('o', 2)] And you can replace import re words = inRDD.flatMap(lambda l: re.split(" ",l)) with words = inRDD.flatMap(str.split)

Related

How to find the second most repetitive character in string using python

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Iterating over a grouped dataset in Spark 1.6

Efficient way to reduceByKey ignoring keys not in another RDD?

How to associate unique id with text in word counting with spark

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Wordcount Program for Apache PySpark - python-3.x

Your code is OK. Just need to add the collect at the end : result.collect() [('s', 1), ('l', 2), ('a', 10), ('n', 1), ('t', 8), ('c', 1), ('p', 1), ('b', 2), ('w', 1), ('o', 2)] And you can replace import re words = inRDD.flatMap(lambda l: re.split(" ",l)) with words = inRDD.flatMap(str.split)

Related

How to find the second most repetitive character​ ​in string using python

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Iterating over a grouped dataset in Spark 1.6

Efficient way to reduceByKey ignoring keys not in another RDD?

How to associate unique id with text in word counting with spark

Categories

Resources

How to find the second most repetitive character in string using python