Print the content of ResultIterable object

Print the content of ResultIterable object - apache-spark

How can I print the content of a pyspark.resultiterable.ResultIterable object that has a list of rows and columns
Is there a built-in function for that?
I would like something like dataframe.show()

I was facing the same issue and solved it eventually, so let me share my way of doing it...
Let us assume we have two RDDs.
rdd1 = sc.parallelize([(1,'A'),(2,'B'),(3,'C')])
rdd2 = sc.parallelize([(1,'a'),(2,'b'),(3,'c')])
Let us cogroup those RDDs in order to get ResultIterable.
cogrouped = rdd1.cogroup(rdd2)
for t in cogrouped.collect():
print t
>>
(1, (<pyspark.resultiterable.ResultIterable object at 0x107c49450>, <pyspark.resultiterable.ResultIterable object at 0x107c95690>))
(2, (<pyspark.resultiterable.ResultIterable object at 0x107c95710>, <pyspark.resultiterable.ResultIterable object at 0x107c95790>))
(3, (<pyspark.resultiterable.ResultIterable object at 0x107c957d0>, <pyspark.resultiterable.ResultIterable object at 0x107c95810>))
Now we want to see what is inside of those ResultIterables.
We can do it like this:
def iterate(iterable):
r = []
for v1_iterable in iterable:
for v2 in v1_iterable:
r.append(v2)
return tuple(r)
x = cogrouped.mapValues(iterate)
for e in x.collect():
print e
or like this
def iterate2(iterable):
r = []
for x in iterable.__iter__():
for y in x.__iter__():
r.append(y)
return tuple(r)
y = cogrouped.mapValues(iterate2)
for e in y.collect():
print e
In both cases we will get the same result:
(1, ('A', 'a'))
(2, ('B', 'b'))
(3, ('C', 'c'))
Hopefully, this will help somebody in future.

Related

How to merge multiple tuples or lists in to dictionary using loops?

Here is my code to merge all tuple in to dictionary,
x = (1,2,3)
y = ('car',"truck","plane")
z=("merc","scania","boeing")
products={}
for i in x,y,z:
products[x[i]]= {y[i]:z[i]}
output:
error:
6 for i in x,y,z:
----> 7 products[x[i]]= {y[i]:z[i]}
8
9 print(products)
TypeError: tuple indices must be integers or slices, not a tuple
Now if i use indexing method inside loop for identifying positions like below code,
for i in x,y,z:
products[x[0]]= {y[0]:z[0]}
print(products)
out:
{1: {'car': 'merc'}}
here, I could only create what I need but only for a specified index how do create a complete dictionary using multiple lists/tuples??
is it also possible to use Zip & map functions?

Use zip to iterate over your separate iterables/tuples in parallel
list(zip(x, y, z)) # [(1, 'car', 'merc'), (2, 'truck', 'scania'), (3, 'plane', 'boeing')]
x = (1, 2, 3)
y = ("car", "truck", "plane")
z = ("merc", "scania", "boeing")
products = {i: {k: v} for i, k, v in zip(x, y, z)}
print(products) # {1: {'car': 'merc'}, 2: {'truck': 'scania'}, 3: {'plane': 'boeing'}}

You should use integer as indices.
x = (1,2,3)
y = ('car',"truck","plane")
z=("merc","scania","boeing")
products={}
for i in range(len(x)):
products[x[i]]= {y[i]:z[i]}
This should solve your problem
To add for above answer, I'm posting a solution using map,
x = (1,2,3)
y = ('car',"truck","plane")
z=("merc","scania","boeing")
products=dict(map(lambda x,y,z:(x,{y:z}),x,y,z))
print(products)

How to find the second most repetitive character in string using python

Here in the program how can you find the second repetitive character in the string. for ex:abcdaabdefaggcbd"
Output : d (because 'd' occurred 3 times where 'a' occurred 4 times)
how can I get the output, please help me.
Given below is my code:
s="abcdaabdefaggcbd"
d={}
for i in s:
d[i] = d.get(i,0)+1
print(d,"ddddd")
max2 = 0
for k,v in d.items():
if(v>max2 and v<max(d.values())):
max2=v
if max2 in d.values():
print k,"kkk"

The magnificent Python Counter and its most_common() method are very handy here.
import collections
my_string = "abcdaabdefaggcbd"
result = collections.Counter(my_string).most_common()
print(result[1])
Output
('b', 3)
In case you need to capture all the second values (if you have more than one entry) you can use the following:
import collections
my_string = "abcdaabdefaggcbd"
result = collections.Counter(my_string).most_common()
second_value = result[1][1]
seconds = []
for item in result:
if item[1] == second_value:
seconds.append(item)
print(seconds)
Output
[('b', 3), ('d', 3)]
I also wanted to add an example of solving the problem using a methodology more similar to the one that you showed in your question:
my_string="abcdaabdefaggcbd"
result={}
for character in my_string:
if character in result:
result[character] = result.get(character) + 1
else:
result[character] = 1
sorted_data = sorted([(value,key) for (key,value) in result.items()])
second_value = sorted_data[-2][0]
result = []
for item in sorted_data:
if item[0] == second_value:
result.append(item)
print(result)
Output
[(3, 'b'), (3, 'd')]
Ps
Please forgive me if I took the freedom to change variable names but I think that in this way my answer will be more readable for a broader audience.

Sort the dict's items on their values (descending) and get the second item:
>>> from collections import Counter
>>> c = Counter("abcdaabdefaggcbd")
>>> vals = sorted(c.items(), key=lambda item:item[1], reverse=True)
>>> vals
[('a', 4), ('b', 3), ('d', 3), ('c', 2), ('g', 2), ('e', 1), ('f', 1)]
>>> print(vals[1])
('b', 3)
>>>
EDIT:
or just use Counter.most_common():
>>> from collections import Counter
>>> c = Counter("abcdaabdefaggcbd")
>>> print(c.most_common()[1])

Both b and d are second most repetitive. I would think that both should be displayed. This is how I would do it:
Code:
s="abcdaabdefaggcbd"
d={}
for i in s:
ctr=s.count(i)
d[i]=ctr
fir = max(d.values())
sec = 0
for j in d.values():
if(j>sec and j<fir):
sec = j
for k,v in d.items():
if v == sec:
print(k,v)
Output:
b 3
d 3

in order to find the second most repetitive character in string you can very well use collections.Counter()
Here's an example:
import collections
s='abcdaabdefaggcbd'
count=collections.Counter(s)
print(count.most_common(2)[1])
Output: ('b', 3)
You can do a lot with Counter(). Here's a link for a further read:
More about Counter()
I hope this answers your question. Cheers!

When utilizing a for loop what does each argument specify exactly?

I'm new to learning Python and have a clarifying question regarding for loops.
For instance:
dictionary_a = {"A": "Apple", "B": "Ball", "C": "Cat"}
dictionary_b = {"A": "Ant", "B": "Basket", "C": "Carrot"}
temp = ""
for k_a, v_a in dictionary_a.items():
temp = dictionary_b[k_a]
dictionary_b[k_a] = v_a
dictionary_a[k_a] = temp
How exactly is k_a run through the interpreter? I understand v_a in dictionary_a.items() as simply iterating through the sequence in whatever collection.
But when for loops have the syntax for x, y in z I don't quite understand what values x takes with each iteration.
Hope I'm making some sense. Appreciate any help.

when iterating over a dict.items(), it will return a 2 tuple, so when providing two variables in the for loop, each tuple elements will be assigned to it.
Here is another example to help you understand the mechanics:
coordinates = [(1, 2, 3), (4, 5, 6)]
for x, y, z in coordinates:
print(x)
Edit: you can make even more complicated unpacking. For example, let's assume you are interested to collect only the first and last item in a long list, you can proceed as follow:
long_list = 'This is a very long list to process'.split()
first_item, *_, last_item = long_list

In Python you can "Cast" multiple variables from another iterable variable.
Let's use this example:
>>> a, b = [1, 2]
>>> a
1
>>> b
2
The above behavior is what is happening when you loop over a dictionary with the dict.items() method.
Here is an example of what is happening:
>>> a = {"abc":123, "def":456}
>>> a.items()
dict_items([('abc', 123), ('def', 456)])
>>> for i in a.items():
... i
...
('abc', 123)
('def', 456)
>>>

Join RDD and get min value

I have multiple rdd's and want to get the common words by joining it and get the minimum count .So I Join and get it by below code :
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y).map(lambda x: (x[0], int(x[1]))).reduceByKey(lambda (x,y,z) : (x,y) if y<=z else (x,z))
final = joined.collect()
print "Join RDD -> %s" % (final)
But this throws below error:
TypeError: int() argument must be a string or a number, not 'tuple'
So I am inputiing a tuple instead of a number .Not sure which is causing it. Any help is appreciated

x.join(other, numPartitions=None): Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.
Therefore you have a tuple as second element:
In [2]: x.join(y).collect()
Out[2]: [('spark', (1, 2)), ('hadoop', (4, 5))]
Solution :
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.map(lambda x: (x[0], min(x[1])))
final.collect()
>>> [('spark', 1), ('hadoop', 4)]

Only first index of tuples list being used

im currently returning a list of Tuples using the Zip function:
The returned data is: [(13, 3), (12, 3), (11, 3), (10, 3), (9, 3), (8, 3), (6, 3), (5, 3), (4, 3)]
im looping through the data to use it but currently only the first index is being printed.
CheckPath = self.CheckQueenPathDown(QueenRowColumn,TheComparisonQueen) #This is where the list of tuples is being used
print(CheckPath) # this shows all the correct data when i print it.
for TheQueenMoves in QueenMoves:
for a,b in list(self.pieces.items()):
for CheckThePath in CheckPath:
if TheComparisonQueen == TheQueenMoves and TheComparisonQueen[0] >= 0 and TheComparisonQueen[1] <= 7 and \
TheComparisonQueen[1] >= 0 and TheComparisonQueen[0] <= 7 and CheckThePath != b: # this is the line im trying to use it in.
self.placepiece(piece, row = MoveRow, column = MoveColumn)
print(CheckThePath)
This is the code i am getting the info from:
Example data:
QueenRowColumn: (3,3)
TheComparisonQueen: (7,3)
def CheckQueenPathDown(self, QueenRowColumn, TheComparisonQueen):
row = []
column = []
CurrentLocation = QueenRowColumn
#MoveLocation = TheComparisonQueen
a = QueenRowColumn[0]
b = QueenRowColumn[1]
for i in range (-10,0):
row.append(CurrentLocation[1] - i)
column.append(a)
Down = zip(row,column)
#Down.remove(TheComparisonQueen)
return Down
im currently trying to use all the varibles of the returned data by looping through it, however only the first index appears when i print it, i dont understand what the problem is. any idea how to fix this?

zip doesn't make a list on Python 3. If you need a list, call list on the result.
On Python 3, zip returns an iterator, which is exhausted after iterating over it once. If you try to reuse it, you get no elements out of it.

Try:
def CheckQueenPathDown(self, QueenRowColumn, TheComparisonQueen):
row = []
column = []
CurrentLocation = QueenRowColumn
#MoveLocation = TheComparisonQueen
a = QueenRowColumn[0]
b = QueenRowColumn[1]
for i in range (-10,0):
row.append(CurrentLocation[1] - i)
column.append(a)
Down = zip(row,column)
#Down.remove(TheComparisonQueen)
return list(Down)
Since return list(ZIP_OBJ) will allocate memory to it so you can reuse it in the nested loop.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Print the content of ResultIterable object - apache-spark

How can I print the content of a pyspark.resultiterable.ResultIterable object that has a list of rows and columns Is there a built-in function for that? I would like something like dataframe.show()

Related

How to merge multiple tuples or lists in to dictionary using loops?

How to find the second most repetitive character in string using python

When utilizing a for loop what does each argument specify exactly?

Join RDD and get min value

Only first index of tuples list being used

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Print the content of ResultIterable object - apache-spark

How can I print the content of a pyspark.resultiterable.ResultIterable object that has a list of rows and columns Is there a built-in function for that? I would like something like dataframe.show()

Related

How to merge multiple tuples or lists in to dictionary using loops?

How to find the second most repetitive character​ ​in string using python

When utilizing a for loop what does each argument specify exactly?

Join RDD and get min value

Only first index of tuples list being used

Categories

Resources

How to find the second most repetitive character in string using python