Implementation of apriori with spark - apache-spark

I intend to implement the Apriori algorithm according to YAFIM article with pySpark. It contains with two phases in processing workflow:
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
[1]: http://bayanbox.ir/download/2410687383784490939/phase1.png phase-1
Phase 2: in this phase, we iteratively using k-frequent itemsets to generate (k+1)-frequent itemsets.
[2]: http://bayanbox.ir/view/6380775039559120652/phase2.png phase-2
To implement the first phase I have written the following code:
from operator import add
transactions = sc.textFile("/FileStore/tables/wo7gkiza1500361138466/T10I4D100K.dat").cache()
minSupport = 0.05 * transactions.count()
items = transactions.flatMap(lambda line:line.split(" "))
itemCount = items.map(lambda item:(item, 1)).reduceByKey(add)
l1 = itemCount.filter(lambda (i,c): c > minSupport)
l1.take(5)
output: [(u'', 100000), (u'494', 5102), (u'829', 6810), (u'368', 7828), (u'766', 6265)]
My problem is that I have no idea for implementing the second phase, and especially getting the candidate set item.
For example, suppose we have the following RDD (frequent 3-itemsets):
([1, 4, 5], 7), ([1, 4, 6], 6), ...
We want to find the candidate 4-itemsets so that if in the 3-itemsets, the two first items is the same, you will have four items available as follows:
[1, 4, 5, 6], ...

Related

RDD and PipelinedRDD type

I am bit new to PySpark (, more into Spark-Scala), Recently i came across below observation . When i am creating RDD using parallelize() method the return type is RDD type. But when i am creating RDD using range() method it is of type PipelinedRDD . For example:
>>> listRDD =sc.parallelize([1,2,3,4,5,6,7])
>>> print(listRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(listRDD))
<class 'pyspark.rdd.RDD'>
>>> rangeRDD =sc.range(1,8)
>>> print(rangeRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(rangeRDD))
<class 'pyspark.rdd.PipelinedRDD'>
I checked how both rdds are constructed and found :
1)Internally both are using parallize method only.
>>> rangeRDD.toDebugString()
b'(8) PythonRDD[25] at collect at <stdin>:1 []\n | ParallelCollectionRDD[24] at parallelize at PythonRDD.scala:195 []'
>>> listRDD.toDebugString()
b'(8) PythonRDD[26] at RDD at PythonRDD.scala:53 []\n | ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:195 []'
2)The PipeLineRDD is a subclass of RDD class that i understand.
But is there any generic logic when it will of type PipeLineedRDD and when it will be of RDD type?
Thanks All In Advance.
sc.range is indeed calling parallelize method internally - it is defined here. You can see that sc.range is calling sc.parallelize with xrange as input. And sc.parallelize has a separate code branch when called with xrange input type: it calls itself with the empty list as argument and then applies mapPartitionsWithIndex here which is the final output of sc.parallelize call and sc.range call in turn. So you can see how first regular object is created similarly as you do with sc.parallelize (though object is an empty list), but final output is the result of applying mapping function on top of it.
It seems that the main reason for this behavior is to avoid materializing the data which would happen otherwise (if input doesn't have len implemented, it's iterated over and converted into list right away).

Dynamic Time Analysis using PySpark

Suppose we have a dataset with the following structure:
df = sc.parallelize([['a','2015-11-27', 1], ['a','2015-12-27',0], ['a','2016-01-29',0], ['b','2014-09-01', 1], ['b','2015-05-01', 1] ]).toDF(("user", "date", "category"))
What I want to analyze is the users' attributes with regard to their lifetime in months. For example, I want to sum up the column "category" for each month of a user's lifetime. For user 'a', this would look like:
output = sc.parallelize([['a',0, 1], ['a',1,0], ['a',2,0]]).toDF(("user", "user_lifetime_in_months", "sum(category)"))
What is the most efficient way in Spark to do that? E.g., window functions?

How to find distinct values of multiple columns in Spark

I have an RDD and I want to find distinct values for multiple columns.
Example:
Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)
I would like to find have a map:
col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]
Can dataframe help compute it faster/simpler?
Update:
My solution with RDD was:
def to_uniq_vals(row):
return [(k,v) for k,v in row.items()]
rdd.flatMap(to_uniq_vals).distinct().collect()
Thanks
I hope I understand your question correctly;
You can try the following:
import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show
Results:
+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
| [a1, b, a]| [1, 2, 4]| [1, 10]|
+---------------+---------------+---------------+
The code above should be more efficient than the purposed select distinct
column-by-column for several reasons:
Less workers-host round trips.
De-duping should be done locally on the worker prior to inter-worker de-doupings.
Hope it helps!
You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:
df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON
** Works well using Scala

How Spark HashingTF works

I am new to Spark 2.
I tried Spark tfidf example
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)
for each in featurizedData.collect():
print(each)
It outputs
Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
I expected that in rawFeatures I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}. Because terms frequency is:
tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)
In our case is : tf(w) = 1/5 = 0.2 for each word, because each word apears once in a document.
If we imagine that output rawFeatures dictionary contains word index as key, and number of word appearances in a document as value, why key 1 is equal to 3.0? There no word that appears in a document 3 times.
This is confusing for me. What am I missing?
TL;DR; It is just a simple hash collision. HashingTF takes hash(word) % numBuckets to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer.
In detail. HashingTF by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark'] will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]. If you follow the source you'll see that the implementation is equivalent to:
import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes
Seq("hi", "i", "heard", "about", "spark")
.map(UTF8String.fromString(_))
.map(utf8 =>
hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]:
List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
.map(nonNegativeMod(_, 32))
List[Int] = List(24, 1, 13, 1, 1)
Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.
Related:
What hashing function does Spark use for HashingTF and how do I duplicate it?
How to get word details from TF Vector RDD in Spark ML Lib?

Difference between RDD.foreach() and RDD.map()

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?
rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!
A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map on a function that returns None like print isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.
More simply, call map if you care about the return value of the function. Call foreach if you don't.
Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.
Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.
For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.
queries = <code to load queries or a transformation that was applied on other RDDs>
Then you want to save those queries in another system via a call to another API
import urllib2
def log_search(q):
response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)
queries.foreach(log_search)
Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

Resources