RDD foreach method provides no results - apache-spark

I am trying to understand how foreach method works. In my jupyter notebook, I tried :
def f(x): print(x)
a = sc.parallelize([1, 2, 3, 4, 5])
b = a.foreach(f)
print(type(b))
<class 'NoneType'>
I can execute that without any problem, but I don't have any output except the print(type(b)) part. The foreach doesn't return anything, just a none type. I do not know what foreach is supposed to do, and how to use it. Can you explain me what it is used for ?

foreach is an action, and does not return anything; so, you cannot use it as you do, i.e. assigning it to another variable like b = a.foreach(f). From Learning Spark, p. 41-42:
Adapting the simple example from the docs, run in a PySpark terminal:
>>> def f(x): print(x)
>>> a = sc.parallelize([1, 2, 3, 4, 5])
>>> a.foreach(f)
5
4
3
1
2
(NOTE: not sure about Jupyter, but the above code will not produce any print results in a Databricks notebook.)
You may also find the answers in this thread helpful.

I just use the following method and it is working perfectly under Jupyter Notebook with PySpark:
for row in RDD.toLocalIterator():
print(row)
Actually it will convert your RDD into a generator object and then by using this generator object you can easily iterate over each element. OR you can first create a generator object and then use it in your loop like below:
genobj = data.toLocalIterator()
for row in genobj:
print(row)

Related

Spark 3 with Pandas Vectorised UDF's

I'm looking at using Pandas UDF's in PySpark (v3). For a number of reasons, I understand iterating and UDF's in general are bad and I understand that the simple examples I show here can be done PySpark using SQL functions - all of that is besides the point!
I've been following this guide: https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
I have a simple example working from the docs:
import pandas as pd
from typing import Iterator, Tuple
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame(([1, 2, 3], [4, 5, 6], [8, 9, 0]), columns=["x", "y", "z"])
df = spark.createDataFrame(pdf)
#pandas_udf('long')
def test1(x: pd.Series, y: pd.Series) -> pd.Series:
return x + y
df.select(test1(col("x"), col("y"))).show()
And this works well for performing basic arithmetic - if I want to add, multiply etc this is straight forward (but it is also straightforward in PySpark without functions).
I want to do a comparison between the values for example:
#pandas_udf('long')
def test2(x: pd.Series, y: pd.Series) -> pd.Series:
return x if x > y else y
df.select(test2(col("x"), col("y"))).show()
This will error with ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. I understand that it is evaluating the series rather than the row value.
So there is an iterator example. Again this works fine for the basic arithmetic example they provide. But if I try to apply logic:
#pandas_udf("long")
def test3(batch_iter: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
for x, y in batch_iter:
yield x if x > y else y
df.select(test3(col("x"), col("y"))).show()
I get the same ValueError as before.
So my question is how should I perform row by row comparisons like this? Is it possible in a vectorised function? And if not then what are the use cases for them?
I figured this out. So simple after you write it down and publish the problem to the world.
All that needs to happen is to return an array and then convert to a Pandas Series:
#pandas_udf('long')
def test4(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([a if a > b else b for a, b in zip(x, y)])
df.select(test4(col("x"),col("y"))).show()
I've spent the last two days looking for this answer, thank you simon_dmorias!
I needed a slightly modified example here. I'm breaking out the single pandas_udf into multiple components for easier management. Here is an example of what I'm using for others to reference:
xdf = pd.DataFrame(([1, 2, 3,'Fixed'], [4, 5, 6,'Variable'], [8, 9, 0,'Adjustable']), columns=["x", "y", "z", "Description"])
df = spark.createDataFrame(xdf)
def fnRate(x):
return pd.Series(['Fixed' if 'Fixed' in str(v) else 'Variable' if 'Variable' in str(v) else 'Other' for v in zip(x)])
#pandas_udf('string')
def fnRateRecommended(Description: pd.Series) -> pd.Series:
varProduct = fnRate(Description)
return varProduct
# call function
df.withColumn("Recommendation", fnRateRecommended(sf.col("Description"))).show()

pass list of list into numba function in no python mode, if element in list_of_list[0] doe not work

See the following minimum code,
import numba
list_of_list = [[1, 2], [34, 100]]
#numba.njit()
def test(list_of_list):
if 1 in list_of_list[0]:
return 'haha'
test(list_of_list)
This won't work and it seems that list_of_list[0] is no longer behaves like a list during compile. However, the following code works:
list_of_list = [[1, 2], [34, 100]][0] # this is a list NOW!
#numba.njit()
def test(list_of_list):
if 1 in list_of_list:
return 'haha'
test(list_of_list)
This time, what I pass into is actually list, NOT list of list. Then it works. It seems for i in list works in numba, not for i in list_of_list.
In my use case, passing list of list or array like 2d data into numba function is common. Sometimes I only need one element in the list, which is dynamically determined in the program.
In order to make it work, I actually worked out a solution: making list_of_list flattened into a long list, then use linear index to extract one element in original list_of_list.
I am asking here, is there other alternative solutions?
The in method works on sets. Returning a string can also cause some problems.
Working example
import numba as nb
import numpy as np
array_2D = np.array([[1, 2], [34, 100]])
#nb.njit()
def test(array_2D):
if 1 in set(array_2D[0]):
#Strings also causes sometimes problems
#return 'haha'
return(1)
else:
return(-1)
You can return a string with my revised version. It passed test and worked successfully.
from numba import njit
import numpy as np
#njit
def test():
if 1 in set(np_list_of_list[0]):
return 'haha'
if __name__ == '__main__':
list_of_list = [[1, 2], [34, 100]]
np_list_of_list = np.array(list_of_list)
print(test())

Trying to understand the following generator in python

I am trying to understand the difference between the following two code snippets. The second one just prints the generator but the first snippet expands it and iters the generator. Why does it happen?
Is it because the two square brackets expand any iterable object?
#Code snippet 1
li=[[1,2,3],[4,5,6],[7,8,9]]
for col in range(0,3):
print( [row[col] for row in li] )`
Output:
[1, 4, 7]
[2, 5, 8]
[3, 6, 9]
#Code snippet 2
li=[[1,2,3],[4,5,6],[7,8,9]]
for col in range(0,3):
print( row[col] for row in li )
Output
<generator object <genexpr> at 0x7f1e0aef55c8>
<generator object <genexpr> at 0x7f1e0aef55c8>
<generator object <genexpr> at 0x7f1e0aef55c8>
Why is the output of above two quotes different?
The print function outputs the returning values of the __str__ method of the objects in its arguments. For lists, the __str__ method returns a nicely formatted string of comma-delimited item values enclosed in square brackets, but for generator objects, the __str__ method simply returns generic object information so to avoid altering the state of the generator.
By putting a generator expression in square brackets you're using list comprehension to explicitly make a list by iterating through the output of the generator expression. Since the items are already produced, the __str__ method of the list would have no problem returning their values.

Should I convert dict.keys() into list(dict.keys()) for iteration in Python3?, 2to3 suggests converting that

I am doing a migration, Python2 to Pytnon3 with 2to3.
(Python2.7.12 and Python3.5.2 exactly)
While doing the migration, 2to3 suggests I use type-cast like the below.
a = {1: 1, 2: 2, 3: 3}
for i in a.keys(): ----> for i in list(a.keys()):
print(i)
After that, I try to check what difference there is in a script.
$ python3
>>> a = {1: 1, 2: 2, 3: 3}
>>> a.keys()
dict_keys([1, 2, 3])
>>> for i in a.keys(): print(i)
1
2
3
It apparently returns different type dict_keys not being list but dict_keys still seems to work with loop like list without type-cast in the above simple code.
I wonder If I use without type-cast, there would be some side-effect or not.
If there is nothing, it looks unnecessary operation.
Why does 2to3 suggest that?
Generally it doesn’t matter for iterating, but it does matter if you try to take an index, because keys() isn’t a list in py3, so you can’t take an index of it, it is generally a safe operation, know the cost of the list call, generally if it was ok in py2 it will be ok in py3.
here is a concrete example, in python3
a = {1: 1, 3: 2, 2: 3}
>>> a
{1: 1, 2: 3, 3: 2}
>>> a.keys()
dict_keys([1, 2, 3])
>>> type(a.keys())
<class 'dict_keys'>
>>>
whereas in python 2
a={1: 1, 2: 3, 3: 2}
>>> type(a.keys())
<type 'list'>
>>>
as Grady said, for iteration everything works well but if you are designing an application that receives a list of keys in Python 2 and you port it to python 3 and apply to it list functions,it will definitely throw an error
I'm assuming this only became true recently, but now keys() is a list type, so I think type casting it to list does nothing, correct me if I'm wrong.
Edit:
Nevermind, apparently the program I was using was on an old version of python. I tried on an updated version of python, and it returns type dict_keys.

Difference between map and list iterators in python3

I ran into unexpected behaviour when working with map and list iterators in python3. In this MWE I first generate a map of maps. Then, I want the first element of each map in one list, and the remaining parts in the original map:
# s will be be a map of maps
s=[[1,2,3],[4,5,6]]
s=map(lambda l: map(lambda t:t,l),s)
# uncomment to obtain desired output
# s = list(s) # s is now a list of maps
s1 = map(next,s)
print(list(s1))
print(list(map(list,s)))
Running the MWE as is in python 3.4.2 yields the expected output for s1:
s1 = ([1,4]),
but the empty list [] for s. Uncommenting the marked line yields the correct output, s1 as above, but with the expected output for s as well:
s=[[2,3],[5,6]].
The docs say that map expects an iterable. To this day, I saw no difference between map and list iterators. Could someone explain this behaviour?
PS: Curiously enough, if I uncomment the first print statement, the initial state of s is printed. So it could also be that this behaviour has something to do with a kind of lazy(?) evaluation of maps?
A map() is an iterator; you can only iterate over it once. You could get individual elements with next() for example, but once you run out of items you cannot get any more values.
I've given your objects a few easier-to-remember names:
>>> s = [[1, 2, 3], [4, 5, 6]]
>>> map_of_maps = map(lambda l: map(lambda t: t, l), s)
>>> first_elements = map(next, map_of_maps)
Iterating over first_elements here will in turn iterate over map_of_maps. You can only do so once, so once we run out of elements any further iteration will fail:
>>> next(first_elements)
1
>>> next(first_elements)
4
>>> next(first_elements)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
list() does exactly the same thing; it takes an iterable argument, and will iterate over that object to create a new list object from the results. But if you give it a map() that is already exhausted, there is nothing to copy into the new list anymore. As such, you get an empty result:
>>> list(first_elements)
[]
You need to recreate the map() from scratch:
>>> map_of_maps = map(lambda l: map(lambda t: t, l), s)
>>> first_elements = map(next, map_of_maps)
>>> list(first_elements)
[1, 4]
>>> list(first_elements)
[]
Note that a second list() call on the map() object resulted in an empty list object, once again.

Resources