Numpy unable to access columns - python-3.x

I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.
When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.
My numpy array looks like this:
np_arr =
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]
And I want to index columns 1,3.
I have my columns stored in a list as ix = [1,3]
However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .
I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).
Could you please tell me how to fix my issue.
Thanks!
Edit:
I'm creating my numpy object from my pandas dataframe like this:
def _to_numpy(self, data):
v = data.reset_index()
np_res = np.rec.fromrecords(v, names=v.columns.tolist())
return(np_res)

The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).
To illustrate:
In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [427]: df
Out[427]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [428]: arr = df.to_records()
In [429]: arr
Out[429]:
rec.array([(0, 0, 1, 2), (1, 3, 4, 5), (2, 6, 7, 8), (3, 9, 10, 11)],
dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape
Out[431]: (4,)
I believe to_records has a parameter to eliminate the index field.
Or with your method:
In [432]:
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())
In [433]: arr
Out[433]:
rec.array([(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)],
dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A'] # arr.A also works
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape
Out[435]: (4,)
And multifield access:
In [436]: arr[['A','C']]
Out[436]:
rec.array([(0, 2), (3, 5), (6, 8), (9, 11)],
dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})
Note that the str display of this array
In [437]: print(arr)
[(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11)]
shows a list of tuples, just as your np_arr. Each tuple is a 'record'. The repr display shows the dtype as well.
You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.
In [438]: arr = df.to_numpy()
In [439]: arr
Out[439]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])

Related

Find integer consecutive run starting indices in a list

Example:
nums = [1,2,3,5,10,9,8,9,10,11,7,8,7]
I am trying to find the first index of numbers in consecutive runs of -1 or 1 direction where the runs are >= 3.
So the desired output from the above nums would be:
[0,4,6,7]
I have tried
grplist = [list(group) for group in more_itertools.consecutive_groups(A)]
output: [[1, 2, 3], [5], [10], [9], [8, 9, 10, 11], [7, 8], [7]]
It returns nested lists but does not but that only seems to go in +1 direction. And it does not return the starting index.
listindx = [list(j) for i, j in groupby(enumerate(A), key=itemgetter(1))]
output: [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 5)], [(4, 10)], [(5, 9)], [(6, 8)], [(7, 9)], [(8, 10)], [(9, 11)], [(10, 7)], [(11, 8)], [(12, 7)]]
This does not check for consecutive runs but it does return indices.

List of list to get element whose values greater than 3

I have 2 list where each list is of size 250000. I wanted to iterate thru the lists and return the values that are greater than 3.
For example:
import itertools
from array import array
import numpy as np
input = (np.array([list([8,1]), list([2,3,4]), list([5,3])],dtype=object), np.array([1,0,0,0,1,1,1]))
X = input[0]
y = input[1]
res = [ u for s in X for u in zip(y,s) ]
res
I don't get the expected output.
Actual res : [(1, 8), (0, 1), (1, 2), (0, 3), (0, 4), (1, 5), (0, 3)]
Expected output 1 : [(8,1), (1,0), (2, 0), (3, 0), (4, 1), (5, 1), (3, 1)]
Expected output 2 : [(8,1), (4, 1), (5, 1))] ---> for greater than 3
I took references from stackoverflow. Tried itertools as well.
Using NumPy to store lists of non-uniform lengths creates a whole lot of issues, like the ones you are seeing. If it were an array integers, you could simply do
X[X > 3]
but since it is an array of lists, you have to jump through all sorts of hoops to get what you want, and basically lose all the advantages of using NumPy in the first place. You could just as well use lists of lists and skip NumPy altogether.
As an alternative I would recommend using Pandas or something else more suitable than NumPy:
import pandas as pd
df = pd.DataFrame({
'group': [0, 0, 1, 1, 1, 2, 2],
'data': [8, 1, 2, 3, 4, 5, 4],
'flag': [1, 0, 0, 0, 1, 1, 1],
})
df[df['data'] > 3]
# group data flag
# 0 0 8 1
# 4 1 4 1
# 5 2 5 1
# 6 2 4 1
Use filter
For example:
input = [1, 3, 2, 5, 6, 7, 8, 22]
# result contains even numbers of the list
result = filter(lambda x: x % 2 == 0, input)
This should give you result = [2, 6, 8, 22]
Not sureI quite understand exactly what you're trying to do... but filter is probably a good way.

print list content without duplicate [duplicate]

This question already has answers here:
How can I print the possible combinations of a list in Python?
(3 answers)
Closed 3 years ago.
I have a list containing some contents like this:
numbers = [5,7,9,3,8]
I want to print consecutive elements starting with the first element and see output like this:
5,7
5,9
5,3
5,8
7,9
7,3
7,8
9,3
9,8
3,8
so, in the end, it will print elements without duplicating for any two elements
I have tried this
for e in numbers:
print(numbers[:])
but it gave me
[5, 7, 9, 3, 8]
[5, 7, 9, 3, 8]
[5, 7, 9, 3, 8]
[5, 7, 9, 3, 8]
[5, 7, 9, 3, 8]
so how to solve this
Thank you
You can use combinations from itertools.
>>> from itertools import combinations
>>> numbers = [5, 7, 9, 3, 8]
>>> list(combinations(numbers, 2))
[(5, 7), (5, 9), (5, 3), (5, 8), (7, 9), (7, 3), (7, 8), (9, 3), (9, 8), (3, 8)]
You can use itertools.combinations
from itertools import combinations
numbers = [5,7,9,3,8]
combos = combinations(numbers, 2)
for combo in combos:
print(combo)
(5, 7)
(5, 9)
(5, 3)
(5, 8)
(7, 9)
(7, 3)
(7, 8)
(9, 3)
(9, 8)
(3, 8)

Tuple list comprehension

I have a list of tuples containing numbers
list_numbers = [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10)]
How do I use list comprehension to get a list of the sum of each item in the tuple?
expected_result = [7, 9, 11, 13, 15]
You can just loop through the list and call the sum() function on each tuple.
sums = [sum(t) for t in list_numbers]
> [7, 9, 11, 13, 15]

zipWithIndex fails in PySpark

I have an RDD like this
>>> termCounts.collect()
[(2, 'good'), (2, 'big'), (1, 'love'), (1, 'sucks'), (1, 'sachin'), (1, 'formulas'), (1, 'batsman'), (1, 'time'), (1, 'virat'), (1, 'modi')]
When am zipping this to create a dictionary, it gives me some random output
>>> vocabulary = termCounts.map(lambda x: x[1]).zipWithIndex().collectAsMap()
>>> vocabulary
{'formulas': 5, 'good': 0, 'love': 2, 'modi': 9, 'big': 1, 'batsman': 6, 'sucks': 3, 'time': 7, 'virat': 8, 'sachin': 4}
Is this the expected output? I wanted to create a dictionary with each word as key and their respective count as value
You need to write like this for word and occurance,
vocabulary =termCounts.map(lambda x: (x[1], x[0])).collectAsMap()
BTW, the code you have written will print the word and index of pair in list.

Resources