Multiply a dictionary value by other in a nested dictionary - python-3.x

Sorry for the long question.
I have a dictionary like this in Python 3.2:
d = {'Paul ': [5, Counter({'i': 1, 'in': 1, 't': 1, 'fa': 1, 'cr': 1})],
'Lisa ': [4, Counter({'f': 3, 'bo': 1})],
'Tracy ': [6, Counter({'sl': 3, 'fi': 1, 'an': 1, 'piz': 1})],
'Maria': [2, Counter({'t': 2})]}
I need to multiple each of the Counter values for the first value in the key and append those values to the key:
d2 = {'Paul': [5, {'i': (2, 10), 'in': (4, 20), 't': (3, 15), 'fa': (2, 10), 'cr': (2, 10)})],
'Lisa': [4, {'f': (3, 12), 'bo': (8, 32)})],
'Tracy': [6, {'sl': (3, 18), 'fi': (1, 6), 'an': (5, 30), 'piz': (2, 12)})]}
So that I would have, in the nested dictionary, a pair of numbers for each key: the first number would be the value that was originally assigned to that key and the second number would be the result the multiplication of the first number in the dictionary by the key. As seen above, for instance, the key i nested in Paul, has a value 1 in the original file and this same value appended by 10 in the output (10 is the value of the key 'i': 2 multiplied by 5, the first value to the key 'Paul' : 5, Counter({'i':2...). And so on...
I have tried transforming it into tuples:
for key, lst in d.items():
mult[key] = [lst[0], {k: (lst[0], float(v)/lst[0]) for k, v in lst[1].items()}]
But I got the wrong outcome, for instance:
d2 = {'Paul ': [5, {'i': (5, 10), 'in': (5, 20), 't': (5, 15), 'cr: (5, 10}]...
in which the original value attributed to each key is substituted by the first value of the result of the multiplication. I really tried different things to get around this but to no avail. Has anyone a tip on how to solve this. I am learning Python very slowly, and this would be of great help. Thank you!

{k: (lst[0], float(v)/lst[0]) for k, v in lst[1].items()}
The first value here is lst[0] which is the first value from the dictionary list but not the original value in the counter; that would be v.

Related

Find integer consecutive run starting indices in a list

Example:
nums = [1,2,3,5,10,9,8,9,10,11,7,8,7]
I am trying to find the first index of numbers in consecutive runs of -1 or 1 direction where the runs are >= 3.
So the desired output from the above nums would be:
[0,4,6,7]
I have tried
grplist = [list(group) for group in more_itertools.consecutive_groups(A)]
output: [[1, 2, 3], [5], [10], [9], [8, 9, 10, 11], [7, 8], [7]]
It returns nested lists but does not but that only seems to go in +1 direction. And it does not return the starting index.
listindx = [list(j) for i, j in groupby(enumerate(A), key=itemgetter(1))]
output: [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 5)], [(4, 10)], [(5, 9)], [(6, 8)], [(7, 9)], [(8, 10)], [(9, 11)], [(10, 7)], [(11, 8)], [(12, 7)]]
This does not check for consecutive runs but it does return indices.

Numpy unable to access columns

I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.
When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.
My numpy array looks like this:
np_arr =
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]
And I want to index columns 1,3.
I have my columns stored in a list as ix = [1,3]
However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .
I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).
Could you please tell me how to fix my issue.
Thanks!
Edit:
I'm creating my numpy object from my pandas dataframe like this:
def _to_numpy(self, data):
v = data.reset_index()
np_res = np.rec.fromrecords(v, names=v.columns.tolist())
return(np_res)
The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).
To illustrate:
In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [427]: df
Out[427]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [428]: arr = df.to_records()
In [429]: arr
Out[429]:
rec.array([(0, 0, 1, 2), (1, 3, 4, 5), (2, 6, 7, 8), (3, 9, 10, 11)],
dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape
Out[431]: (4,)
I believe to_records has a parameter to eliminate the index field.
Or with your method:
In [432]:
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())
In [433]: arr
Out[433]:
rec.array([(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)],
dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A'] # arr.A also works
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape
Out[435]: (4,)
And multifield access:
In [436]: arr[['A','C']]
Out[436]:
rec.array([(0, 2), (3, 5), (6, 8), (9, 11)],
dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})
Note that the str display of this array
In [437]: print(arr)
[(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11)]
shows a list of tuples, just as your np_arr. Each tuple is a 'record'. The repr display shows the dtype as well.
You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.
In [438]: arr = df.to_numpy()
In [439]: arr
Out[439]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])

how to cluster a nested list by a specific variable and do some statistics for other variables

I have a nested list sorted by the first element already, I want to rearrange the nested list based on the unique identifier ('a','b','c') and do some statistic calculations for other corresponding variables. below is what I have and what I want to get. Any hint? thanks.
What I already have:
sortedlist=[['a',5,7],['a',3,5],['a',6,7],['b',3,7],['b',5,5],['b',6,9],['b',5,5],['c',1,9],['c',4,5]]
What I want to get first:
target_list1[[['a','a','a'],[5,3,6],[7,5,7]],[['b','b','b','b'],[3,5,6,5],[7,5,9,5]],[['c','c'],[1,4],[9,5]]]
What I want to get next:
target_list2[['a',median[5,3,6],median[7,5,7]],['b',median[3,5,6,5],median[7,5,9,5]],['c',median[1,4],median[9,5]]]
First, you can use the itertools.groupby method. This will group your list of lists by a key function, in this case the first element of each list:
groups = itertools.groupby(sortedlist, operator.itemgetter(0))
This gives you tuples of keys and groups (which are iterators themselves). The expression [list(g[1]) for g in groups] will evaluate to:
[[['a', 5, 7], ['a', 3, 5], ['a', 6, 7]],
[['b', 3, 7], ['b', 5, 5], ['b', 6, 9], ['b', 5, 5]],
[['c', 1, 9], ['c', 4, 5]]]
Now you want to transpose that list of lists, and this can be done with the zip() function:
target_list1 = [list(zip(*g[1])) for g in groups]
This evaluates to:
[[('a', 'a', 'a'), (5, 3, 6), (7, 5, 7)],
[('b', 'b', 'b', 'b'), (3, 5, 6, 5), (7, 5, 9, 5)],
[('c', 'c'), (1, 4), (9, 5)]]
Next you want to apply your function to the result. As this is tagged with python3, we can use the statistics module:
>>> import statistics
>>> [ [l[0][0]] + [statistics.median(x) for x in l[1:]] for l in target_list1]
Bringing this all together:
import itertools
import operator
import statistics
target_list1 = [list(zip(*g[1]))
for g in itertools.groupby(sortedlist, operator.itemgetter(0))]
target_list2 = [ [l[0][0]] + [ statistics.median(x) for x in l[1:]]
for l in target_list1]

Combine two lists by appending the element of one list to the element of the other list in Python

I have two same length lists like [[-3, -2, 1],[2,3,5],[1,2,3]...[7,8,9]] and [-1,1,1,...1]. I would like to combine them as: [(-3,-2,1,-1), (2,3,5,1), (1,2,3,1)...(7,8,9,1)] in Python.
Appreciate if any comment.
>>> a = [(-3, -2, 1),(2,3,5),(1,2,3), (7,8,9)]
>>> b = [-1,1,1, 1]
>>> [i+(j,) for i, j in zip(a, b)]
[(-3, -2, 1, -1), (2, 3, 5, 1), (1, 2, 3, 1), (7, 8, 9, 1)]

zipWithIndex fails in PySpark

I have an RDD like this
>>> termCounts.collect()
[(2, 'good'), (2, 'big'), (1, 'love'), (1, 'sucks'), (1, 'sachin'), (1, 'formulas'), (1, 'batsman'), (1, 'time'), (1, 'virat'), (1, 'modi')]
When am zipping this to create a dictionary, it gives me some random output
>>> vocabulary = termCounts.map(lambda x: x[1]).zipWithIndex().collectAsMap()
>>> vocabulary
{'formulas': 5, 'good': 0, 'love': 2, 'modi': 9, 'big': 1, 'batsman': 6, 'sucks': 3, 'time': 7, 'virat': 8, 'sachin': 4}
Is this the expected output? I wanted to create a dictionary with each word as key and their respective count as value
You need to write like this for word and occurance,
vocabulary =termCounts.map(lambda x: (x[1], x[0])).collectAsMap()
BTW, the code you have written will print the word and index of pair in list.

Resources