zipWithIndex fails in PySpark - apache-spark

I have an RDD like this
>>> termCounts.collect()
[(2, 'good'), (2, 'big'), (1, 'love'), (1, 'sucks'), (1, 'sachin'), (1, 'formulas'), (1, 'batsman'), (1, 'time'), (1, 'virat'), (1, 'modi')]
When am zipping this to create a dictionary, it gives me some random output
>>> vocabulary = termCounts.map(lambda x: x[1]).zipWithIndex().collectAsMap()
>>> vocabulary
{'formulas': 5, 'good': 0, 'love': 2, 'modi': 9, 'big': 1, 'batsman': 6, 'sucks': 3, 'time': 7, 'virat': 8, 'sachin': 4}
Is this the expected output? I wanted to create a dictionary with each word as key and their respective count as value

You need to write like this for word and occurance,
vocabulary =termCounts.map(lambda x: (x[1], x[0])).collectAsMap()
BTW, the code you have written will print the word and index of pair in list.

Related

List of list to get element whose values greater than 3

I have 2 list where each list is of size 250000. I wanted to iterate thru the lists and return the values that are greater than 3.
For example:
import itertools
from array import array
import numpy as np
input = (np.array([list([8,1]), list([2,3,4]), list([5,3])],dtype=object), np.array([1,0,0,0,1,1,1]))
X = input[0]
y = input[1]
res = [ u for s in X for u in zip(y,s) ]
res
I don't get the expected output.
Actual res : [(1, 8), (0, 1), (1, 2), (0, 3), (0, 4), (1, 5), (0, 3)]
Expected output 1 : [(8,1), (1,0), (2, 0), (3, 0), (4, 1), (5, 1), (3, 1)]
Expected output 2 : [(8,1), (4, 1), (5, 1))] ---> for greater than 3
I took references from stackoverflow. Tried itertools as well.
Using NumPy to store lists of non-uniform lengths creates a whole lot of issues, like the ones you are seeing. If it were an array integers, you could simply do
X[X > 3]
but since it is an array of lists, you have to jump through all sorts of hoops to get what you want, and basically lose all the advantages of using NumPy in the first place. You could just as well use lists of lists and skip NumPy altogether.
As an alternative I would recommend using Pandas or something else more suitable than NumPy:
import pandas as pd
df = pd.DataFrame({
'group': [0, 0, 1, 1, 1, 2, 2],
'data': [8, 1, 2, 3, 4, 5, 4],
'flag': [1, 0, 0, 0, 1, 1, 1],
})
df[df['data'] > 3]
# group data flag
# 0 0 8 1
# 4 1 4 1
# 5 2 5 1
# 6 2 4 1
Use filter
For example:
input = [1, 3, 2, 5, 6, 7, 8, 22]
# result contains even numbers of the list
result = filter(lambda x: x % 2 == 0, input)
This should give you result = [2, 6, 8, 22]
Not sureI quite understand exactly what you're trying to do... but filter is probably a good way.

Numpy unable to access columns

I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.
When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.
My numpy array looks like this:
np_arr =
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]
And I want to index columns 1,3.
I have my columns stored in a list as ix = [1,3]
However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .
I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).
Could you please tell me how to fix my issue.
Thanks!
Edit:
I'm creating my numpy object from my pandas dataframe like this:
def _to_numpy(self, data):
v = data.reset_index()
np_res = np.rec.fromrecords(v, names=v.columns.tolist())
return(np_res)
The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).
To illustrate:
In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [427]: df
Out[427]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [428]: arr = df.to_records()
In [429]: arr
Out[429]:
rec.array([(0, 0, 1, 2), (1, 3, 4, 5), (2, 6, 7, 8), (3, 9, 10, 11)],
dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape
Out[431]: (4,)
I believe to_records has a parameter to eliminate the index field.
Or with your method:
In [432]:
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())
In [433]: arr
Out[433]:
rec.array([(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)],
dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A'] # arr.A also works
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape
Out[435]: (4,)
And multifield access:
In [436]: arr[['A','C']]
Out[436]:
rec.array([(0, 2), (3, 5), (6, 8), (9, 11)],
dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})
Note that the str display of this array
In [437]: print(arr)
[(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11)]
shows a list of tuples, just as your np_arr. Each tuple is a 'record'. The repr display shows the dtype as well.
You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.
In [438]: arr = df.to_numpy()
In [439]: arr
Out[439]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])

how to cluster a nested list by a specific variable and do some statistics for other variables

I have a nested list sorted by the first element already, I want to rearrange the nested list based on the unique identifier ('a','b','c') and do some statistic calculations for other corresponding variables. below is what I have and what I want to get. Any hint? thanks.
What I already have:
sortedlist=[['a',5,7],['a',3,5],['a',6,7],['b',3,7],['b',5,5],['b',6,9],['b',5,5],['c',1,9],['c',4,5]]
What I want to get first:
target_list1[[['a','a','a'],[5,3,6],[7,5,7]],[['b','b','b','b'],[3,5,6,5],[7,5,9,5]],[['c','c'],[1,4],[9,5]]]
What I want to get next:
target_list2[['a',median[5,3,6],median[7,5,7]],['b',median[3,5,6,5],median[7,5,9,5]],['c',median[1,4],median[9,5]]]
First, you can use the itertools.groupby method. This will group your list of lists by a key function, in this case the first element of each list:
groups = itertools.groupby(sortedlist, operator.itemgetter(0))
This gives you tuples of keys and groups (which are iterators themselves). The expression [list(g[1]) for g in groups] will evaluate to:
[[['a', 5, 7], ['a', 3, 5], ['a', 6, 7]],
[['b', 3, 7], ['b', 5, 5], ['b', 6, 9], ['b', 5, 5]],
[['c', 1, 9], ['c', 4, 5]]]
Now you want to transpose that list of lists, and this can be done with the zip() function:
target_list1 = [list(zip(*g[1])) for g in groups]
This evaluates to:
[[('a', 'a', 'a'), (5, 3, 6), (7, 5, 7)],
[('b', 'b', 'b', 'b'), (3, 5, 6, 5), (7, 5, 9, 5)],
[('c', 'c'), (1, 4), (9, 5)]]
Next you want to apply your function to the result. As this is tagged with python3, we can use the statistics module:
>>> import statistics
>>> [ [l[0][0]] + [statistics.median(x) for x in l[1:]] for l in target_list1]
Bringing this all together:
import itertools
import operator
import statistics
target_list1 = [list(zip(*g[1]))
for g in itertools.groupby(sortedlist, operator.itemgetter(0))]
target_list2 = [ [l[0][0]] + [ statistics.median(x) for x in l[1:]]
for l in target_list1]

Combine two lists by appending the element of one list to the element of the other list in Python

I have two same length lists like [[-3, -2, 1],[2,3,5],[1,2,3]...[7,8,9]] and [-1,1,1,...1]. I would like to combine them as: [(-3,-2,1,-1), (2,3,5,1), (1,2,3,1)...(7,8,9,1)] in Python.
Appreciate if any comment.
>>> a = [(-3, -2, 1),(2,3,5),(1,2,3), (7,8,9)]
>>> b = [-1,1,1, 1]
>>> [i+(j,) for i, j in zip(a, b)]
[(-3, -2, 1, -1), (2, 3, 5, 1), (1, 2, 3, 1), (7, 8, 9, 1)]

Multiply a dictionary value by other in a nested dictionary

Sorry for the long question.
I have a dictionary like this in Python 3.2:
d = {'Paul ': [5, Counter({'i': 1, 'in': 1, 't': 1, 'fa': 1, 'cr': 1})],
'Lisa ': [4, Counter({'f': 3, 'bo': 1})],
'Tracy ': [6, Counter({'sl': 3, 'fi': 1, 'an': 1, 'piz': 1})],
'Maria': [2, Counter({'t': 2})]}
I need to multiple each of the Counter values for the first value in the key and append those values to the key:
d2 = {'Paul': [5, {'i': (2, 10), 'in': (4, 20), 't': (3, 15), 'fa': (2, 10), 'cr': (2, 10)})],
'Lisa': [4, {'f': (3, 12), 'bo': (8, 32)})],
'Tracy': [6, {'sl': (3, 18), 'fi': (1, 6), 'an': (5, 30), 'piz': (2, 12)})]}
So that I would have, in the nested dictionary, a pair of numbers for each key: the first number would be the value that was originally assigned to that key and the second number would be the result the multiplication of the first number in the dictionary by the key. As seen above, for instance, the key i nested in Paul, has a value 1 in the original file and this same value appended by 10 in the output (10 is the value of the key 'i': 2 multiplied by 5, the first value to the key 'Paul' : 5, Counter({'i':2...). And so on...
I have tried transforming it into tuples:
for key, lst in d.items():
mult[key] = [lst[0], {k: (lst[0], float(v)/lst[0]) for k, v in lst[1].items()}]
But I got the wrong outcome, for instance:
d2 = {'Paul ': [5, {'i': (5, 10), 'in': (5, 20), 't': (5, 15), 'cr: (5, 10}]...
in which the original value attributed to each key is substituted by the first value of the result of the multiplication. I really tried different things to get around this but to no avail. Has anyone a tip on how to solve this. I am learning Python very slowly, and this would be of great help. Thank you!
{k: (lst[0], float(v)/lst[0]) for k, v in lst[1].items()}
The first value here is lst[0] which is the first value from the dictionary list but not the original value in the counter; that would be v.

Resources