Getting a 2D array from 2 1D arrays (Python) - python-3.x

I have two arrays x and y, which I can create a scatter plot no problem.
I am looking however to create a 1024 x 1024 array that has zeros everywhere except for where there is a point in my scatter plot.
I presume I need to use a for loop but I am a bit confused about how to go about it. As you can tell I am very much a beginner.
This is the scatter plot - I need to get an array that has 1s everywhere there is a dot and 0s everywhere else:
As requested, here is the code that I have currently. Originally I thought that I would have to loop through each column or row and then for each element decide whether it needed to be 0 or 1. But then I discovered I can just make each element 1 by indexing so have done that.
bpixdata = BPIXTAB[1].data
x = bpixdata['PIX1']
y = bpixdata['PIX2']
value = bpixdata['VALUE']
bpixarray = np.zeros([1024,1024])
bpixarray[y,x] = 1
# plt.figure('x & y scatter plot')
# plt.xlim([0,1024])
# plt.ylim([0,1024])
# plt.scatter(x,y,s=1)
# plt.figure('1024 x 1024 array')
# plt.xlim([0,1024])
# plt.ylim([0,1024])
# plt.imshow(bpixarray)

Related

How to add signal dots on time-series plot for Python Pandas dataframe?

I have a dataframe with time-series data on two variables Reported_Cases and Horizontal_Threshold, here is my graph code:
def time_series_graph_horizontal_threshold(df, x_var, y_var):
plt.figure(figsize=(10, 6))
plt.grid(True)
df.plot(x='Year_Week', y=['Reported_Cases', 'Horizontal_Threshold'])
plt.show()
Which generates this graph
How can I add positive signals on the graph such that when the Reported_Cases is higher than the Horizontal_Threshold, it will show green signal dots across the graph? We can assume I have another column named Positive_Signal which is binary (0, 1=above).
First draw your image, but saving the result (axes object):
ax = df.plot(x='Year_Week', y=['Reported_Cases', 'Horizontal_Threshold'], grid=True, rot=45)
No need to call plt.grid(True) separately, maybe you should add parameters
concerning the image size.
Then impose green points on it:
ht = df.iloc[0].Horizontal_Threshold
dotY = 280 # Y coordinate of green points
hDist = 2 # Number of weeks between green points
for idx, rc in df.Reported_Cases.items():
if idx % hDist == 0 and rc > ht:
ax.plot(idx, dotY, '.g')
Writing the above code I assumed that your DataFrame has the index composed of consecutive integers.
Maybe you should set other values of dotY and hDist. Actually hDist depends
on the number of source rows and how is the desired "density" of these points.
For my test data containing 40 rows (weeks) I got:

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Hot to get the set difference of two 2d numpy arrays, or equivalent of np.setdiff1d in a 2d array?

Here Get intersecting rows across two 2D numpy arrays they got intersecting rows by using the function np.intersect1d. So i changed the function to use np.setdiff1d to get the set difference but it doesn't work properly. The following is the code.
def set_diff2d(A, B):
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
'formats':ncols * [A.dtype]}
C = np.setdiff1d(A.view(dtype), B.view(dtype))
return C.view(A.dtype).reshape(-1, ncols)
The following data is used for checking the issue:
min_dis=400
Xt = np.arange(50, 3950, min_dis)
Yt = np.arange(50, 3950, min_dis)
Xt, Yt = np.meshgrid(Xt, Yt)
Xt[::2] += min_dis/2
# This is the super set
turbs_possible_locs = np.vstack([Xt.flatten(), Yt.flatten()]).T
# This is the subset
subset = turbs_possible_locs[np.random.choice(turbs_possible_locs.shape[0],50, replace=False)]
diffs = set_diff2d(turbs_possible_locs, subset)
diffs is supposed to have a shape of 50x2, but it is not.
Ok, so to fix your issue try the following tweak:
def set_diff2d(A, B):
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)], 'formats':ncols * [A.dtype]}
C = np.setdiff1d(A.copy().view(dtype), B.copy().view(dtype))
return C
The problem was - A after .view(...) was applied was broken in half - so it had 2 tuple columns, instead of 1, like B. I.e. as a consequence of applying dtype you essentially collapsed 2 columns into tuple - which is why you could do the intersection in 1d in the first place.
Quoting after documentation:
"
a.view(some_dtype) or a.view(dtype=some_dtype) constructs a view of the array’s memory with a different data-type. This can cause a reinterpretation of the bytes of memory.
"
Src https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
I think the "reinterpretation" is exactly what happened - hence for the sake of simplicity I would just .copy() the array.
NB however I wouldn't square it - it's always A which gets 'broken' - whether it's an assignment, or inline B is always fine...

Change the mask for few numpy arrays

So, I have on the input few masked arrays. For computation I use slices: top left, top right, bottom left and bottom right:
dy0 = dy__[:-1, :-1]
dy1 = dy__[:-1, 1:]
dy2 = dy__[1:, 1:]
dy3 = dy__[1:, :-1]
The same is done with dx and g values.
To compute sums or differences correctly I need to change the mask to make it the same for all of them. For now I count the sum of converted into int masks of 4 arrays and check if it's more than one. So if there is more than one masked element - I mask it.
import functools
sum = functools.reduce(lambda x1, x2: x1.astype('int') + x2.astype('int'), list_of_masks)
mask = sum > 1 # mask output if more than 1 input is masked
But when I initialize masks like dy0.mask = new_mask they don't change.
Also when I replace 0 elements in one array with 1 using numpy.where() the mask disappears, so I can initialize the new one. But for those arrays which stay the same mask still doesn't change. (I checked the numpy.ma documentation, and it should)
The problem is in some functions there are too many arrays which mask might be changed to the new one, so it's better to find a good way to initialize it in one operation for few arrays and be sure it works.
Is there any way to do it or to find why it doesn`t work as it should?

Define CVXPY variables for graph coloring problem

I’m trying to solve the minimum graph coloring problem. I’m trying to solve it as an mip using cvxpy. I’m following the outline of a solution described in this url:
https://manas.tech/blog/2010/09/16/modelling-graph-coloring-with-integer-linear-programming.html
I’m not sure if I’m understanding how the cvxpy variables are created correctly, and how I’m defining my constraints. I have sample input data below along with the code creating the variables, constraints, and objective function, solving the problem and the solution returned.
I think the correct answer for this input should be:
‘2 1\n0 1 0 0’
That is that the minimum number colors required is 2 and all nodes are the same color except node 1.
I’m creating the w variable to count the number of colors used:
w = cvxpy.Variable(j, boolean=True)
What I think I am doing is creating a binary variable of length equal to the number of nodes. The idea being that the maximum number of colors you could use would be equal to the number of nodes. So maximum colors:
w=[1,1,1,1]
I’m picturing w as binary variable like a list where the values can be 0 or 1 indicating if that color is used by any of the nodes.
Then to create the objective function:
obj=cvxpy.sum(w,axis=0)
I think I’m summing the entries in the row which are 1, so for example:
w=[1,1,0,0]
obj=2
I also create a variable x to indicate the color of a given node:
x = cvxpy.Variable((j,int(first_line[0])), boolean=True)
I’m picturing this as a 2 dimensional array with binary values, where the column indicates the node and the row indicates the color.
So for example if node 0 had color 0, node 1 had color 1, node 2 had color 2, and node 3 had color 2, I would imagine x to look like:
[[1,0,0,0],[0,1,0,0],[0,0,1,1],[0,0,0,0]]
Can someone please tell me if I’m understanding and creating my selection variables correctly? Also do I understand and have I created my objective function correctly? That is does the way I’ve described my objective function match the way I’ve created it? And any input on the other constraints I’ve defined or my code would be greatly appreciated. I’m learning linear programing and I’m trying to understand cvxpy syntax.
Sample Data:
input_data
'4 3\n0 1\n1 2\n1 3\n'
# parse the input
lines = input_data.split('\n')
first_line = lines[0].split()
node_count = int(first_line[0])
edge_count = int(first_line[1])
edges = []
for i in range(1, edge_count + 1):
line = lines[i]
parts = line.split()
edges.append((int(parts[0]), int(parts[1])))
edges
# Output:
[(0, 1), (1, 2), (1, 3)]
# solution using cvxpy solver
import numpy as np
import cvxpy
from collections import namedtuple
# selection variables
# binary variable if at least one node is color j
j=int(first_line[0])
# w=1 if at least one node has color j
w = cvxpy.Variable(j, boolean=True)
# x=1 if node i is color j
x = cvxpy.Variable((j,int(first_line[0])), boolean=True)
# Objective function
# minimize number of colors needed
obj=cvxpy.sum(w,axis=0)
# constraints
# 1 color per node
node_color=cvxpy.sum(x,axis=1)==1
# for adjacent nodes at most 1 node has color
diff_col = []
for edge in edges:
for k in range(node_count):
diff_col += [
# x[edge[0],k]+x[edge[1],k]<=1
x[k,edge[0]]+x[k,edge[1]]<=1
]
# w is upper bound for color of node x<=w
upper_bound = []
for i in range(j):
for k in range(j):
upper_bound += [
x[k,i]<=w[i]
]
# constraints
constraints=[node_color]+diff_col+upper_bound
# solving problem
# cvxpy must be passed as a list
graph_problem = cvxpy.Problem(cvxpy.Minimize(obj), constraints)
# Solving the problem
graph_problem.solve(solver=cvxpy.GLPK_MI)
value2=int(graph_problem.solve(solver=cvxpy.GLPK_MI))
# taken2=[int(i) for i in selection.value.tolist()]
# taken2=[int(i) for i in w.value.tolist()]
taken2=[int(i) for i in w.value.tolist()]
# prepare the solution in the specified output format
output_data2 = str(value2) + ' ' + str(0) + '\n'
output_data2 += ' '.join(map(str, taken2))
output_data2
'1 0\n0 0 0 1'
Your solution is almost correct. The main problem here is the definition of variable x. According to the blog post
x_{ij} variables that will be true if and only if node i is assigned color j
which indicates that the size of x is (nb of nodes, nb of colors).
In your code you need to change x to:
x = cvxpy.Variable((node_count, j), boolean=True)
and then, consequently, the second and third constraints:
# for adjacent nodes at most 1 node has color
diff_col = []
for edge in edges:
for k in range(j):
diff_col += [
x[edge[0],k]+x[edge[1],k]<=1
]
# w is upper bound for color of node x<=w
upper_bound = []
for i in range(node_count):
for k in range(j):
upper_bound += [
x[i,k]<=w[k]
]
Then the output is as expected i.e. 2 colors are used: one color for node 1 and another color for nodes 0, 2, 3 (because they are not adjacent).

Resources