List of arrays without brackets - python-3.x

Given a list of arrays in this format:
[array([[63371.29484043],
[65000. ],
[51114.1118643 ],
[39000. ],
[61549.2893635 ],
[58204.43242583]]), array([[28750. ],
[19166.90102574],
[19667.19108884],
[17250. ]]), array([[32188.01786071],
[33625. ],
[23988.53674308],
[29354.92883394],
[31657.26571235],
[20175. ]])]`
I would like to print it as a list without square brackets in it e.g.
a = [18758.98675732, 23418.72996313 ... 20134.77503711]
I can apply .tolist but not strip to get rid of the inner brackets.
How do I do this?
Thanks!

Try to understand the nature of the object before worrying too much about display details. Display follows from the list's structure. Pay special attention to len (for a list) and shape (for an array).
In [119]: alist=[np.array([[63371.29484043],
...: [65000. ],
...: [51114.1118643 ],
...: [39000. ],
...: [61549.2893635 ],
...: [58204.43242583]]), np.array([[28750. ],
...: [19166.90102574],
...: [19667.19108884],
...: [17250. ]]), np.array([[32188.01786071],
...: [33625. ],
...: [23988.53674308],
...: [29354.92883394],
...: [31657.26571235],
...: [20175. ]])]
You have a list of arrays that differ in shape:
In [120]: len(alist)
Out[120]: 3
In [121]: [x.shape for x in alist]
Out[121]: [(6, 1), (4, 1), (6, 1)]
You could flatten each array, producing ones that a (6,),(4,) and (6,) shape:
In [122]: [x.ravel() for x in alist]
Out[122]:
[array([63371.29484043, 65000. , 51114.1118643 , 39000. ,
61549.2893635 , 58204.43242583]),
array([28750. , 19166.90102574, 19667.19108884, 17250. ]),
array([32188.01786071, 33625. , 23988.53674308, 29354.92883394,
31657.26571235, 20175. ])]
hstack can join them into on array. Use .tolist() if you want a list as the final result:
In [123]: np.hstack(_)
Out[123]:
array([63371.29484043, 65000. , 51114.1118643 , 39000. ,
61549.2893635 , 58204.43242583, 28750. , 19166.90102574,
19667.19108884, 17250. , 32188.01786071, 33625. ,
23988.53674308, 29354.92883394, 31657.26571235, 20175. ])
Since the arrays differ in the first dimension, we could also use:
In [127]: np.vstack(alist).shape
Out[127]: (16, 1)
In [128]: np.vstack(alist).ravel()

You can use map to get the first number of each sublist and flatten the array, then turn it into a list with list:
list(map(lambda l : l[0], a))
Output:
[18758.98675732, 23418.72996313, 23625.0, 14175.0, 21015.48300191, 20134.77503711]
You can also use numpy's multidimensional array indexing to extract only the first number of each sublist:
list(a[:, 0])

Numpy has built-in flatten function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html
list(a[0].ravel())

Related

Understanding L2-norm output for 3D tensor - TensorFlow2

For Python 3.8 and TensorFlow 2.5, I have a 3-D tensor of shape (3, 3, 3) where the goal is to compute the L2-norm for each of the three (3, 3) square matrices. The code that I came up with is:
a = tf.random.normal(shape = (3, 3, 3))
a.shape
# TensorShape([3, 3, 3])
a.numpy()
'''
array([[[-0.30071023, 0.9958398 , -0.77897555],
[-1.4251901 , 0.8463568 , -0.6138699 ],
[ 0.23176959, -2.1303613 , 0.01905925]],
[[-1.0487134 , -0.36724553, -1.0881581 ],
[-0.12025198, 0.20973174, -2.1444907 ],
[ 1.4264063 , -1.5857363 , 0.31582597]],
[[ 0.8316077 , -0.7645084 , 1.5271858 ],
[-0.95836663, -1.868056 , -0.04956183],
[-0.16384012, -0.18928945, 1.04647 ]]], dtype=float32)
'''
I am using axis = 2 since the 3rd axis should contain three 3x3 square matrices. The output I get is:
tf.math.reduce_euclidean_norm(input_tensor = a, axis = 2).numpy()
'''
array([[1.299587 , 1.7675754, 2.1430166],
[1.5552354, 2.158075 , 2.15614 ],
[1.8995634, 2.1001325, 1.0759989]], dtype=float32)
'''
How are these values computed? The formula for computing L2-norm is this. What am I missing?
Also, I was expecting three L2-norm values, one for each of the three (3, 3) matrices. The code I have to achieve this is:
tf.math.reduce_euclidean_norm(a[0]).numpy()
# 3.0668826
tf.math.reduce_euclidean_norm(a[1]).numpy()
# 3.4241767
tf.math.reduce_euclidean_norm(a[2]).numpy()
# 3.0293021
Is there any better way to get this without having to explicitly refer to each indices of tensor 'a'?
Thanks!
The formula you linked for computing the L2 norm looks correct. What you have is basically this:
np.sqrt(np.sum((a[0]**2)))
# 3.0668826
np.sqrt(np.sum((a[1]**2)))
# 3.4241767
np.sqrt(np.sum((a[2]**2)))
# 3.0293021
This can be vectorized by the following:
np.sqrt(np.sum(a**2, axis=(1,2)))
Output:
array([3.0668826, 3.4241767, 3.0293021], dtype=float32)
Which is effectively the same as using np.lingalg.norm (or tf.math.reduce_euclidean_norm if you want to use tensorflow)
np.linalg.norm(a, ord=None, axis=(1,2))
Output:
array([3.0668826, 3.4241767, 3.0293021], dtype=float32)
The default keyword ord=None is for calculating the L2 norm per the documentation. The axis keyword is to specify which dimensions we want to reduce which should be clear from the first code snippet.

Map coordinate ids to coordinate values

I have a list of coordinate ids (nodes of a graph).
edge_list =
[(0, 1),
(2, 3),
(4, 3)]
And the coordinates of these nodes are stored in a nd numpy array
position =
array([[[ -3.17113447, -16.9386692 , 16.73578644],
[ 8.19985676, 4.89544773, 21.26950455]],
[[ -8.96962166, -2.78070927, 54.1053009 ],
[ -0.1561521 , -3.05777478, 41.8996582 ]],
[[-13.20408821, -4.88086224, 46.99597549],
[ -0.1561521 , -3.05777478, 41.8996582 ]]], dtype=float32)
The above data is not easy to access and has duplicates. I want to transform it to the following format
df =
node x y z
0 -3.17113447 -16.9386692 16.73578644
1 8.19985676 4.89544773 21.26950455
2 -8.96962166 -2.78070927 54.1053009
3 -0.1561521 -3.05777478 41.8996582
4 -13.20408821 -4.88086224 46.99597549
To obtain the above dataframe, I first tried to convert the coordinates in position to a dictionary
for i in range(len(edge_list)):
map[f'edge{i}'] = position[0]
{'edge0': array([[ -3.17113447, -16.9386692 , 16.73578644],
[ 8.19985676, 4.89544773, 21.26950455]], dtype=float32),
'edge1': array([[ -3.17113447, -16.9386692 , 16.73578644],
[ 8.19985676, 4.89544773, 21.26950455]], dtype=float32),
'edge3': array([[ -3.17113447, -16.9386692 , 16.73578644],
[ 8.19985676, 4.89544773, 21.26950455]], dtype=float32)}
I'm not really sure how to proceed from here.
Any suggestions will be really helpful
You can map your edges to a single, unique number in the following way. If you have N nodes, think of an edge as an element on a (N by N) array. And mapping position (i,j) to a single number is a very classic trick. In your case, the unique index is index = i*N+j.

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

Using reduceByKey method in Pyspark to update a dictionary

I have the following rdd data.
[(13, 'Munich#en'), (13, 'Munchen#de'), (14, 'Vienna#en'), (14, 'Wien#de'),(15, 'Paris#en')]
I want to combine the above rdd , using reduceByKey method, that would result the following output, i.e to join the entries into a dictionary based on entry's language.
[
(13, {'en':'Munich','de':'Munchen'}),
(14, {'en':'Vienna', 'de': 'Wien'}),
(15, {'en':'Paris', 'de':''})
]
The examples for reduceByKey were all numerical operations such as addition, so I am not very sure how to go about updating a dictionary in each reduce step.
This is my code:
rd0 = sc.parallelize(
[(13, 'munich#en'),(13, 'munchen#de'), (14, 'Vienna#en'),(14,'Wien#de'),(15,'Paris#en')]
)
def updateDict(x,xDict):
xDict[x[:-3]]=x[-2:]
rd0.map(lambda x: (x[0],(x[1],{'en':'','de':''}))).reduceByKey(updateDict).collect()
I am getting the following error message but not sure what I am doing wrong.
return f(*args, **kwargs)
File "<ipython-input-209-16cfa907be76>", line 2, in ff
TypeError: 'tuple' object does not support item assignment
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
There are some problems with your code - for instance, your updateDict does not return a value. Here is a different approach:
First, map the values into dictionaries. One way is to split on "#", reverse, and pass the result into the dict constructor.
rd1 = rd0.mapValues(lambda x: dict([reversed(x.split("#"))]))
print(rd1.collect())
#[(13, {'en': 'munich'}),
# (13, {'de': 'munchen'}),
# (14, {'en': 'Vienna'}),
# (14, {'de': 'Wien'}),
# (15, {'en': 'Paris'})]
Now you can call reduceByKey and merge the two dictionaries. Finally add in the missing keys with a dictionary comprehension over the required keys, defaulting to empty string if the key is missing.
def merge_two_dicts(x, y):
# from https://stackoverflow.com/a/26853961/5858851
# works for python 2 and 3
z = x.copy() # start with x's keys and values
z.update(y) # modifies z with y's keys and values & returns None
return z
rd2 = rd1.reduceByKey(merge_two_dicts)\
.mapValues(lambda x: {k: x.get(k, '') for k in ['en', 'de']})
print(rd2.collect())
#[(14, {'de': 'Wien', 'en': 'Vienna'}),
# (13, {'de': 'munchen', 'en': 'munich'}),
# (15, {'de': '', 'en': 'Paris'})]

Keras 'Error when checking input' when trying to predict multiple values

I have a net with a length 4 input vector, length 2 output vector. I am trying to predict multiple inputs simultaneously. If I just want to predict one, I would do the following and it works:
in = numpy.array( [ [1,2,3,4] ] )
self.model.predict(in)
# prediction = [ [1,2] ]
However, when I try to pass in multiple inputs I get ValueError: Error when checking input: expected dense_1_input to have shape (4,) but got array with shape (1,)
in = numpy.array( [
[1,2,3,4],
[1,2,3,4]
]
)
#OR
in = numpy.array( [
[ [1,2,3,4] ],
[ [1,2,3,4] ]
]
)
self.model.predict(in)
#ERR
What am I doing wrong?
Edit:
Code =
model = Sequential()
model.add(Dense(24, input_dim=4, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(4, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=self.learning_rate))
print(batch_arr[:,3][0])
predictions = self.model.predict(batch_arr[:,3][0])
print(predictions)
print(batch_arr[:,3])
predictions = model.predict(batch_arr[:,3])
Output =
[[-0.00441936 -0.20398824 -0.08134908 0.09739554]]
[[ 0.01860509 -0.01136071]]
[array([[-0.00441936, -0.20398824, -0.08134908, 0.09739554]])
array([[-0.00517939, 0.38975933, -0.11951023, -0.9718224 ]])
array([[0.00272119, 0.0025476 , 0.002645 , 0.03973542]])
array([[-0.00421809, -0.01006362, -0.07795483, -0.16971247]])
array([[-0.00904593, 0.19332681, -0.10655871, -0.64757587]])
array([[ 0.00654432, 0.00347247, -0.15332555, -0.47302148]])
array([[-0.01921821, -0.17354519, -0.20207744, -0.58569029]])
array([[ 0.00661377, 0.20038962, -0.16278598, -0.80983334]])
array([[-0.00348096, 0.18171964, -0.07072813, -0.38913168]])
array([[-0.01268919, -0.00548544, -0.08286095, -0.27108632]])
array([[ 0.01077598, -0.19254374, -0.004982 , 0.33175341]])
array([[-4.37101750e-04, -5.68196965e-01, -1.99532537e-01,
1.10581883e-01]])
array([[ 0.00657382, -0.19263146, -0.00402872, 0.33368607]])
array([[ 0.00677398, 0.19760551, -0.00076944, -0.25153403]])
array([[ 0.00261579, 0.19642629, -0.13894668, -0.71894379]])
array([[-0.0221003 , 0.37477368, -0.03765055, -0.63564477]])
array([[-0.0110009 , 0.37599703, -0.0574645 , -0.66318148]])
array([[ 0.00277214, 0.19763152, 0.00343971, -0.25211181]])
array([[-9.31810654e-05, -2.06245307e-01, -8.09019674e-02,
1.47356796e-01]])
array([[ 0.00709025, -0.37636771, -0.19725323, -0.11396513]])
array([[ 0.00015344, -0.01233088, -0.07851076, -0.11956039]])
array([[ 0.01077811, -0.18439307, -0.19043179, -0.34107231]])
array([[-0.01460483, 0.18019651, -0.05036345, -0.35505252]])
array([[-0.0127989 , 0.19071515, -0.08828268, -0.58871071]])
array([[ 0.01072609, 0.00249456, -0.00580012, 0.0409061 ]])
array([[ 0.01062156, 0.00782762, -0.17898265, -0.57245695]])
array([[-0.01180104, -0.37085843, -0.1973209 , -0.23782701]])
array([[-0.00849912, -0.00780031, -0.07940117, -0.21980343]])
array([[ 0.00672477, 0.00246062, -0.00160252, 0.04165408]])
array([[-0.02268911, -0.36534914, -0.21379125, -0.36284594]])
array([[-0.00865513, -0.20170279, -0.08379724, 0.0468145 ]])
array([[-0.0256848 , 0.17922475, -0.03098346, -0.33335449]])]
#ERR
Edit: When I print out the shape of batch_arr[:,3] I get (32,), not (32,4) as I expected. Thus I'm guess the numpy array does not know the shape of its inner arrays. Is there an easy way to fix that? It might be the root of the problem
The issue was the way that I had created my numpy array. I created it with indices of variable size, and thus it didn't know it was shaped (32,4), only that it was (32,). Reformulating the logic to ensure that the array is always a set width from the beginning allowed the array to be a (32,4), which allowed the prediction to work.

Resources