RDD and PipelinedRDD type - apache-spark

I am bit new to PySpark (, more into Spark-Scala), Recently i came across below observation . When i am creating RDD using parallelize() method the return type is RDD type. But when i am creating RDD using range() method it is of type PipelinedRDD . For example:
>>> listRDD =sc.parallelize([1,2,3,4,5,6,7])
>>> print(listRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(listRDD))
<class 'pyspark.rdd.RDD'>
>>> rangeRDD =sc.range(1,8)
>>> print(rangeRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(rangeRDD))
<class 'pyspark.rdd.PipelinedRDD'>
I checked how both rdds are constructed and found :
1)Internally both are using parallize method only.
>>> rangeRDD.toDebugString()
b'(8) PythonRDD[25] at collect at <stdin>:1 []\n | ParallelCollectionRDD[24] at parallelize at PythonRDD.scala:195 []'
>>> listRDD.toDebugString()
b'(8) PythonRDD[26] at RDD at PythonRDD.scala:53 []\n | ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:195 []'
2)The PipeLineRDD is a subclass of RDD class that i understand.
But is there any generic logic when it will of type PipeLineedRDD and when it will be of RDD type?
Thanks All In Advance.

sc.range is indeed calling parallelize method internally - it is defined here. You can see that sc.range is calling sc.parallelize with xrange as input. And sc.parallelize has a separate code branch when called with xrange input type: it calls itself with the empty list as argument and then applies mapPartitionsWithIndex here which is the final output of sc.parallelize call and sc.range call in turn. So you can see how first regular object is created similarly as you do with sc.parallelize (though object is an empty list), but final output is the result of applying mapping function on top of it.
It seems that the main reason for this behavior is to avoid materializing the data which would happen otherwise (if input doesn't have len implemented, it's iterated over and converted into list right away).

Related

Pandas MultiIndex in which one factor is an Enum

I'm having trouble working with a dataframe whose columns are a multiindex in which one of the iterables is an Enum. Consider the code:
MyEnum = Enum("MyEnum", "A B")
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[MyEnum, [1, 2]]))
This raises
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
This can be worked around by instead putting:
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[
pd.Series(MyEnum, dtype="category"),
[1, 2]
]))
but then appending a row with
df.append({(MyEnum.A, 1): "abc", (MyEnum.B, 2): "xyz"}, ignore_index=True)
raises the same TypeError as before.
I've tried various variations on this theme, with no success. (No problems occur if the columns is not a multiindex, but is an enum.)
(Note that I can dodge this by using an IntEnum instead of an Enum. But then, my columns, simply appear as numbers---this is why I wanted to use an Enum in the first place, as opposed to ints.)
Many thanks!

How to fill Pandas dataframe column with numpy array in for loop

I have a for loop generating three values, age(data type int64),dayofMonth(data type numpy.ndarray) and Gender(data type str). I would like to store these three values of each iteration in a pandas data frame with columns as age,Day & Gender.Can you suggest how to do that?I'm using python 3.X
I tried this code inside for loop
df = pd.DataFrame(columns=["Age","Day", "Gender"])
for i in range(100):
df.loc[i]=[age,day,gender]
I can't able to share sample data but can give one example,
age=38,day=array([[1],
[3],
[5],
...,
[25],
[26],
[30]], dtype=int64) and Gender='M'
But I'm getting error message as ValueError: setting an array element with a sequence

Apache Spark - How to use groupBy groupByKey to form a (Key, List) pair

I have org.apache.spark.sql.DataFrame = [id: bigint, name: string] in hand
and sample data in it looks like:
(1, "City1")
(2, "City3")
(1, "CityX")
(4, "CityZ")
(2, "CityN")
I am trying to form a output like
(1, ("City1", "CityX"))
(2, ("City3", "CityN"))
(4, ("CityZ"))
I tried the following variants
df.groupByKey.mapValues(_.toList).show(20, false)
df.groupBy("id").show(20, false)
df.rdd.groupByKey.mapValues(_.toList).show(20, false)
df.rdd.groupBy("id").show(20, false)
All of them complain about either groupBy or groupByKey being ambiguous or method not found errors. Any help is appreciated.
I tried the solution posted in Spark Group By Key to (Key,List) Pair, however that doesn't work for me and it fails with the following error:
<console>:88: error: overloaded method value groupByKey with alternatives:
[K](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,K], encoder: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row] <and>
[K](func: org.apache.spark.sql.Row => K)(implicit evidence$3: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row]
cannot be applied to ()
Thanks.
Edit:
I did try the following:
val result = df.groupBy("id").agg(collect_list("name"))
which gives
org.apache.spark.sql.DataFrame = [id: bigint, collect_list(node): array<string>]
I am not sure how to use this collect_list type .. I am trying to dump this to a file by doing
result.rdd.coalesce(1).saveAsTextFile("test")
and I see the following
(1, WrappedArray(City1, CityX))
(2, WrappedArray(City3, CityN))
(4, WrappedArray(CityZ))
How do I dump this as the following ?
(1, (City1, CityX))
(2, (City3, CityN))
(4, (CityZ))
If you have an RDD of pairs, then you can use combineByKey(). To do this you have to pass 3 methods as arguments.
Method 1 takes a String, for example 'City1' as input, will add that String to an empty List and return that list
Method 2 takes a String, for example 'CityX' and one of the lists created by the previous method. Add the String to the list and return the list.
Method 3 takes 2 lists as input. It returns a new list with all the values from the 2 argument lists
combineByKey will then return an RDD>.
However in your case you are starting off with a DataFrame, which I do not have much experience with. I imagine that you will need to convert it to an RDD in order to use combineByKey()

Implementation of apriori with spark

I intend to implement the Apriori algorithm according to YAFIM article with pySpark. It contains with two phases in processing workflow:
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
[1]: http://bayanbox.ir/download/2410687383784490939/phase1.png phase-1
Phase 2: in this phase, we iteratively using k-frequent itemsets to generate (k+1)-frequent itemsets.
[2]: http://bayanbox.ir/view/6380775039559120652/phase2.png phase-2
To implement the first phase I have written the following code:
from operator import add
transactions = sc.textFile("/FileStore/tables/wo7gkiza1500361138466/T10I4D100K.dat").cache()
minSupport = 0.05 * transactions.count()
items = transactions.flatMap(lambda line:line.split(" "))
itemCount = items.map(lambda item:(item, 1)).reduceByKey(add)
l1 = itemCount.filter(lambda (i,c): c > minSupport)
l1.take(5)
output: [(u'', 100000), (u'494', 5102), (u'829', 6810), (u'368', 7828), (u'766', 6265)]
My problem is that I have no idea for implementing the second phase, and especially getting the candidate set item.
For example, suppose we have the following RDD (frequent 3-itemsets):
([1, 4, 5], 7), ([1, 4, 6], 6), ...
We want to find the candidate 4-itemsets so that if in the 3-itemsets, the two first items is the same, you will have four items available as follows:
[1, 4, 5, 6], ...

Difference between RDD.foreach() and RDD.map()

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?
rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!
A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map on a function that returns None like print isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.
More simply, call map if you care about the return value of the function. Call foreach if you don't.
Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.
Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.
For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.
queries = <code to load queries or a transformation that was applied on other RDDs>
Then you want to save those queries in another system via a call to another API
import urllib2
def log_search(q):
response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)
queries.foreach(log_search)
Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

Resources