How to select a good sample size of nodes from a graph - python-3.x

I have a network that has a node attribute labeled as 0 or 1. I want to find how the distance between nodes with the same attribute differs from the distance between nodes with a different attributes. As it is computationally difficult to find the distance between all combinations of nodes, I want to select a sample size of nodes. How will I select a sample size of nodes? I am working on python and networkx

You've not given many details, so I'll invent some data and make assumptions in the hope it's useful.
Start by importing packages and sampling a dataset:
import random
import networkx as nx
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i, attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
Next, calculate the shortest paths between random pairs of nodes:
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a, b = random.sample(list(G.nodes), 2)
try:
n = nx.algorithms.shortest_path_length(G, a, b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a, b, n))
Finally, here is some code to summarise the results and print them out:
from collections import Counter
from scipy import stats
# somewhere to counts of both 0, both 1, different labels
c_0 = Counter()
c_1 = Counter()
c_d = Counter()
# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
cc[node_data[a], node_data[b]][n] += 1
# code to display the results nicely
def show(c, title):
s = sum(c.values())
print(f'{title}, n={s}')
for k, n in sorted(c.items()):
# calculate some sort of CI over monte carlo error
lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')
show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')
The above prints out:
both 0, n=63930
-1: 60806 = 95.11% [94.94%, 95.28%]
1: 107 = 0.17% [ 0.14%, 0.20%]
2: 753 = 1.18% [ 1.10%, 1.26%]
3: 1137 = 1.78% [ 1.68%, 1.88%]
4: 584 = 0.91% [ 0.84%, 0.99%]
5: 334 = 0.52% [ 0.47%, 0.58%]
6: 154 = 0.24% [ 0.21%, 0.28%]
7: 50 = 0.08% [ 0.06%, 0.10%]
8: 3 = 0.00% [ 0.00%, 0.01%]
9: 2 = 0.00% [ 0.00%, 0.01%]
both 1, n=3978
-1: 3837 = 96.46% [95.83%, 96.99%]
1: 6 = 0.15% [ 0.07%, 0.33%]
2: 34 = 0.85% [ 0.61%, 1.19%]
3: 34 = 0.85% [ 0.61%, 1.19%]
4: 31 = 0.78% [ 0.55%, 1.10%]
5: 30 = 0.75% [ 0.53%, 1.07%]
6: 6 = 0.15% [ 0.07%, 0.33%]
To save space I've cut off the section where the labels differ. The proportions in the square brackets is the 95% CI of the Monte-Carlo error. Using more iterations above allows you to reduce this error, while obviously taking more CPU time.

This is more or less an extension of my discussion with Sam Mason and only want to give you some timing numbers, because as discussed maybe retrieving all distances is feasible and may even faster. Based on the code in Sam Mason answer, I tested both variants and retrieving all distances is for 1000 nodes much faster than sampling 100 000 pairs. The main advantage is that all "retrieved distances" are used.
import random
import networkx as nx
import time
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i, attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
def timing(f):
def wrap(*args, **kwargs):
time1 = time.time()
ret = f(*args, **kwargs)
time2 = time.time()
print('{:s} function took {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
#timing
def get_sample_distance():
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a, b = random.sample(list(G.nodes), 2)
try:
n = nx.algorithms.shortest_path_length(G, a, b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a, b, n))
#timing
def get_all_distances():
all_distances = nx.shortest_path_length(G)
get_sample_distance()
# get_sample_distance function took 2338.038 ms
get_all_distances()
# get_all_distances function took 304.247 ms
``

Related

How to concatenate gathered data using mpi4py library in python

I used to list append of data employing mpi4py and try to save the data sequentially at the source(root==0) node.
As suggested by Alan22, I've modified the code and it works, but the script does not concatenate properly, so I get the output file as shown in attached figure:01.
Can anybody help how to fix the error message? In addition, whatever I've written in python script [shown below], isn't the best way to solve the problem.
Is there any way to solve this type of problem efficiently? Any help is highly appreciated.
The python script is given as follows:
import numpy as np
from scipy import signal
from mpi4py import MPI
import random
import cmath, math
import matplotlib.pyplot as plt
import time
#File storing path
save_results_to = 'File storing path'
count_day = 1
count_hour = 1
arr_x = [0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0, 8.49, 12.0]
arr_y = [0, 8.49, 12.0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0]
M = len(arr_x)
N = len(arr_y)
np.random.seed(12345)
total_rows = 50000
raw_data=np.reshape(np.random.rand(total_rows*N),(total_rows,N))
# Function of CSD:: Using For Loop
fs = 500; # Sampling frequency
def csdMat(data):
dat, cols = data.shape # For 2D data
total_csd = []
for i in range(cols):
col_csd =[]
for j in range( cols):
freq, Pxy = signal.csd(data[:,i], data[:, j], fs=fs, window='hann', nperseg=100, noverlap=70, nfft=5000)
col_csd.append(Pxy)
total_csd.append(col_csd)
pxy = np.array(total_csd)
return freq, pxy
# Finding cross spectral density (CSD)
t0 = time.time()
freq, csd = csdMat(raw_data)
print('The shape of the csd data', csd.shape)
print('Time required {} seconds to execute CSD--For loop'.format(time.time()-t0))
kf=1*2*np.pi/10
resolution = 50 # This is important:: the HIGHER the Resolution, the higher the execution time!!!
grid_size = N * resolution
kx = np.linspace(-kf, kf, ) # space vector
ky = np.linspace(-kf, kf, grid_size) # space vector
def DFT2D(data):
P=len(kx)
Q=len(ky)
dft2d = np.zeros((P,Q), dtype=complex)
for k in range(P):
for l in range(Q):
sum_log = []
mat2d = np.zeros((M,N))
sum_matrix = 0.0
for m in range(M):
for n in range(N):
e = cmath.exp(-1j*((((dx[m]-dx[n])*kx[l])/1) + (((dy[m]-dy[n])*ky[k])/1)))
sum_matrix += data[m, n] * e
dft2d[k,l] = sum_matrix
return dft2d
dx = arr_x[:]; dy = arr_y[:]
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
data = []
start_freq = 100
end_freq = 109
freq_range = np.arange(start_freq,end_freq)
no_of_freq = len(freq_range)
for fr_count in range(start_freq, end_freq):
if fr_count % size == rank:
dft = np.zeros((grid_size, grid_size))
spec_csd = csd[:,:, fr_count]
dft = DFT2D(spec_csd) # Call the DFT2D function
spec = np.array(np.real(dft)) # Spectrum or 2D_DFT of data[real part]
print('Shape of spec', spec.shape)
data.append(spec)
#data = np.append(data,spec)
np.seterr(invalid='ignore')
data = comm.gather(data, root =0)
# comm.Allreduce(MPI.IN_PLACE,data,op=MPI.MAX)
print("Rank: ", rank, ". Spectrum shape is:\n", spec.shape)
if rank == 0:
output_data = np.concatenate(data, axis = 0)
#output_data = np.c_(data, axis = 0)
dft_tot = np.array((output_data), dtype='object')
res = np.zeros((grid_size, grid_size))
for k in range(size):
for i in range(no_of_freq):
jj = np.around(freq[freq_range[i]], decimals = 2)
#print('The shape of data after indexing', data1.shape)
#data_final=data1.reshape(data1.shape[0]*data1.shape[1], data1.shape[2])
res[i * size + k] = dft_tot[k][i] #np.array(data[k])
data = np.array(res)
#print('The shape of the dft at root node', data.shape)
np.savetxt(save_results_to + f'Day_{count_day}_hour_{count_hour}_f_{jj}_hz.txt', data.view(float))
I use the following bash script command to run the script ( i.e., my_file.sh). I submit the job with command sbatch my_file.sh
#! /bin/bash -l
#SBATCH -J testmvapich2
#SBATCH -N 1 ## Maximum 04 nodes
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=1 # cpu-cores per task
#SBATCH --mem-per-cpu=3000MB
#SBATCH --time=00:20:00
#SBATCH -p para
#SBATCH --output="stdout.txt"
#SBATCH --error="stderr.txt"
#SBATCH -A camk
##SBATCH --mail-type=ALL
##SBATCH --chdir=/work/cluster_computer/my_name/data_work/MMC331/
eval "$(conda shell.bash hook)"
conda activate myenv
#conda activate fast-mpi4py
cd $SLURM_SUBMIT_DIR
#module purge
#module add mpi/mvapich2-2.2-x86_64
mpirun python3 mpi_test.py
You can try with this after "data = comm.gather(data, root=0)"
if rank == 0:
print('Type of data:', type(data))
dft_tot = np.array((data))#, dtype='object')
print('shape of DATA array:', dft_tot.shape)
#print('Type of dft array:', type(dft_tot))
res = np.zeros((450,450))
for k in range(size):
# for i in range(len(data[rank])):
for i in range(no_of_freq):
jj = np.around(freq[freq_range[k]], decimals = 2)
#data1 = np.array(dft_tot[k])
res[i * size + k] = data[k]
data = np.array(res)#.reshape(data1.shape[0]*data1.shape[1], data1.shape[2])
print('The shape of the dft at root node', data.shape)
np.savetxt(save_results_to + f'Day_{count_day}_hour_{co
Here is the link. Hope it helps mpi4py on HPC: comm.gather
As mentioned in the comments, there are two typos in the code:
The indices for arrays kx and ky have been swapped in the line where variable e is calculated in the function DFT2D(data).
The code is being run for 10 MPI processes for frequencies fr_count in the range start_freq = 100 and end_freq = 109. For this, the loops and arange must be written as for fr_count in range(start_freq, end_freq + 1) and freq_range = np.arange(start_freq, end_freq + 1) as these are not end-point inclusive.
The data = comm.gather(data, root=0) and subsequent output_data = np.concatenate(data, axis=0) operations are performing as they should and as such, the question detracts from the actual issue in the code.
A major issue is that in line res[i * size + k] = dft_tot[k][i] arrays of disparate sizes are being assigned to each other.
Shape of res: 450 x 450
Shape of dft_tot: 10 x 50 x 450
The value of i*size + k ranges from 0 to 110. I think the user expects dft_tot to have the shape 450 x 450, probably due to the indexing confusion mentioned in typo#2 above. Properly done concatenation would yield dft_tot with shape 500 x 450 (since there are 10 arrays of size 50 x 450).
Currently the gather operation returns a list of lists, each containing a NumPy array of size 50 x 450. Technically, it should return a list of NumPy arrays each of size 50 x 450. Adding the line data = data[0] (since data has only one element anyway in each process) before performing data = comm.gather(data, root=0) will achieve this result.
But this whole process seems redundant..
Because there are 10 frequencies considered here. For each frequency, there is a data set of size 50 x 450 . There are 10 MPI processes with each handling one frequency out of the 10. Finally, 10 files are being written corresponding to each frequency. This makes the whole gather operation redundant, as each MPI process can directly write the file corresponding to each frequency.
If instead the dft_tot file was being written as is by rank = 0, then the gather operation would make sense. But splitting the array into the constituent frequencies defeats the point.
This achieves the same result without the gather operation:
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
start_freq = 100
end_freq = 109
freq_range = np.arange(start_freq,end_freq+1)
no_of_freq = len(freq_range)
for fr_count in range(start_freq, end_freq+1):
if fr_count % size == rank:
dft = np.zeros((grid_size, grid_size))
spec_csd = csd[:,:, fr_count]
dft = DFT2D(spec_csd) # Call the DFT2D function
spec = np.array(np.real(dft)) # Spectrum or 2D_DFT of data[real part]
print('Shape of spec', spec.shape)
jj = np.around(freq[freq_range[rank]], decimals = 2)
np.savetxt(f'Day_{count_day}_hour_{count_hour}_f_{jj}_hz.txt', spec.view(float))

Too many copies? Poor comparison? Urn Probability Problem

full code: https://gist.github.com/QuantVI/79a1c164f3017c6a7a2d860e55cf5d5b
TLDR: sum(a3) gives a number like 770, when it should be more like 270 - as in 270 of 1000 trials where the results of drawing 4 contained (at least) 2 blue and 1 green ball.
I've rewritten both my way of creating the sample output, and my way of comparing the results twice already. Python as a syntax `all(x in a for x n b)` which I used initially, then change to something more deliberate to see if there was a change. I still have 750+ `True` evaluations of each trial. This is why I reassessed how I was selecting without replacement.
I've tested the draw function on its own with different Hats and was sure it worked.
The expected probability when drawing 4balls, without replacement, from a hat containing (blue=3,red=2,green=6), and having the outcome contain (blue=2,green=1) or ['blue','blue','green']
is around 27.2%. In my 1000 trials, I get higher then 700, repeatedly.
Is the error in Hat.draw() or is it in experiment()?
Note: Certain things are commented out, because I am debugging. Thus use sum(a3) as experiment is commented out to return things other than the probability right now.
import copy
import random
# Consider using the modules imported above.
class Hat:
def __init__(self, **kwargs):
self.d = kwargs
self.contents = [
key for key, val in kwargs.items() for num in range(val)
]
def draw(self, num: int) -> list:
if num >= len(self.contents):
return self.contents
else:
indices = random.sample(range(len(self.contents)), num)
chosen = [self.contents[idx] for idx in indices]
#new_contents = [ v for i, v in enumerate(self.contents) if i not in indices]
new_contents = [pair[1] for pair in enumerate(self.contents)
if pair[0] not in indices]
self.contents = new_contents
return chosen
def __repr__(self): return str(self.contents)
def experiment(hat, expected_balls, num_balls_drawn, num_experiments):
trials =[]
for n in range(num_experiments):
copyn = copy.deepcopy(hat)
result = copyn.draw(num_balls_drawn)
trials.append(result)
#trials = [ copy.deepcopy(hat).draw(num_balls_drawn) for n in range(num_experiments) ]
expected_contents = [key for key, val in expected_balls.items() for num in range(val)]
temp_eval = [[o for o in expected_contents if o in trial] for trial in trials]
temp_compare = [ evaled == expected_contents for evaled in temp_eval]
return expected_contents,temp_eval,temp_compare, trials
#evaluations = [ all(x in trial for x in expected_contents) for trial in trials ]
#if evaluations: prob = sum(evaluations)/len(evaluations)
#else: prob = 0
#return prob, expected_contents
#hat3 = Hat(red=5, orange=4, black=1, blue=0, pink=2, striped=9)
#hat4 = Hat(red=1, orange=2, black=3, blue=2)
hat1 = Hat(blue=3,red=2,green=6)
a1,a2,a3,a4 = experiment(hat=hat1, expected_balls={"blue":2,"green":1}, num_balls_drawn=4, num_experiments=1000)
#actual = probability
#expected = 0.272
#self.assertAlmostEqual(actual, expected, delta = 0.01, msg = 'Expected experiment method to return a different probability.')
hat2 = Hat(yellow=5,red=1,green=3,blue=9,test=1)
b1,b2,b3,b4 = experiment(hat=hat2, expected_balls={"yellow":2,"blue":3,"test":1}, num_balls_drawn=20, num_experiments=100)
#actual = probability
#expected = 1.0
#self.assertAlmostEqual(actual, expected, delta = 0.01, msg = 'Expected experiment method to return a different probability.')
The issue is temp_eval = [[o for o in expected_contents if o in trial] for trial in trials]. It will always ad both blue to the list even if only one blue exists in the results of one trial.
However, I couldn't fix the error in a straight-forward way. Instead, my fix created a much lower answer, something less than 0.1, when around 0.27 is (270 of 1000 trials) is what I need.
The roundabout solution was to convert lists like ['red', 'green', 'blue', 'green'] into dictionaries using list on collections.Counter of that list. Then do a key-wose comparison of the values, such as [y[key]<= x.get(key,0) for key in y.keys()]). In this comparison y is the expected_balls variable, and x is the list of the counter object. If x doesn't have one of the keys, we get 0. Zero will be less than the value of any key in expected_balls.
From here we use functols.reduce to turn the output into a single True or False value. Then we map that functionality (compare all keys and get one T/F value) across all trials.
def experiment(hat, expected_balls, num_balls_drawn, num_experiments):
trials =[]
trials = [ copy.deepcopy(hat).draw(num_balls_drawn)
for n in range(num_experiments) ]
trials_kvpairs = [dict(collections.Counter(trial)) for trial in trials]
def contains(contained:dict , container:dict):
each = [container.get(key,0) >= contained[key]
for key in contained.keys()]
return reduce(lambda item0,item1: item0 and item1, each)
trials_success = list(map(lambda t: contains(expected_balls,t), trials_kvpairs))
# expected_contents = [pair[0] for pair in expected_balls.items() for num in range(pair[1])]
# temp_eval = [[o for o in trial if o in expected_contents] for trial in trials]
# temp_compare = [ evaled == expected_contents for evaled in temp_eval]
# if temp_compare: prob = sum(temp_compare)/len(trials)
# else: prob = 0
return 'prob', trials_kvpairs, trials_success
When run using the this experiment(hat=hat1, expected_balls={"blue":2,"green":1}, num_balls_drawn=4, num_experiments=1000) the sum of the third part of the output was 276.

How many ways to get the change?

While practicing the following dynamic programming question on HackerRank, I got the 'timeout' error. The code run successfully on some test examples, but got 'timeout' errors on others. I'm wondering how can I further improve the code.
The question is
Given an amount and the denominations of coins available, determine how many ways change can be made for amount. There is a limitless supply of each coin type.
Example:
n = 3
c = [8, 3, 1, 2]
There are 3 ways to make change for n=3 : {1, 1, 1}, {1, 2}, and {3}.
My current code is
import math
import os
import random
import re
import sys
from functools import lru_cache
#
# Complete the 'getWays' function below.
#
# The function is expected to return a LONG_INTEGER.
# The function accepts following parameters:
# 1. INTEGER n
# 2. LONG_INTEGER_ARRAY c
#
def getWays(n, c):
# Write your code here
#c = sorted(c)
#lru_cache
def get_ways_recursive(n, cur_idx):
cur_denom = c[cur_idx]
n_ways = 0
if n == 0:
return 1
if cur_idx == 0:
return 1 if n % cur_denom == 0 else 0
for k in range(n // cur_denom + 1):
n_ways += get_ways_recursive(n - k * cur_denom,
cur_idx - 1)
return n_ways
return get_ways_recursive(n, len(c) - 1)
if __name__ == '__main__':
fptr = open(os.environ['OUTPUT_PATH'], 'w')
first_multiple_input = input().rstrip().split()
n = int(first_multiple_input[0])
m = int(first_multiple_input[1])
c = list(map(int, input().rstrip().split()))
# Print the number of ways of making change for 'n' units using coins having the values given by 'c'
ways = getWays(n, c)
fptr.write(str(ways) + '\n')
fptr.close()
It timed out on the following test example
166 23 # 23 is the number of coins below.
5 37 8 39 33 17 22 32 13 7 10 35 40 2 43 49 46 19 41 1 12 11 28

Dijkstra's algorithm in graph (Python)

I need some help with the graph and Dijkstra's algorithm in python 3. I tested this code (look below) at one site and it says to me that the code works too long. Can anybody say me how to solve that or paste the example of code for this algorithm? I don't know how to speed up this code. I read many sites but l don't found normal examples...
P.S. Now l edit code in few places and tried to optimize it, nut it still too slow(
from collections import deque
class node:
def __init__(self, name, neighbors, distance, visited):
self.neighbors = neighbors
self.distance = distance
self.visited = visited
self.name = name
def addNeighbor(self, neighbor_name, dist): # adding new neighbor and length to him
if neighbor_name not in self.neighbors:
self.neighbors.append(neighbor_name)
self.distance.append(dist)
class graph:
def __init__(self):
self.graphStructure = {} # vocabulary with information in format: node_name, [neighbors], [length to every neighbor], visited_status
def addNode(self, index): # adding new node to graph structure
if self.graphStructure.get(index) is None:
self.graphStructure[index] = node(index, [], [], False)
def addConnection(self, node0_name, node1_name, length): # adding connection between 2 nodes
n0 = self.graphStructure.get(node0_name)
n0.addNeighbor(node1_name, length)
n1 = self.graphStructure.get(node1_name)
n1.addNeighbor(node0_name, length)
def returnGraph(self): # printing graph nodes and connections
print('')
for i in range(len(self.graphStructure)):
nodeInfo = self.graphStructure.get(i + 1)
print('name =', nodeInfo.name, ' neighborns =', nodeInfo.neighbors, ' length to neighborns =', nodeInfo.distance)
def bfs(self, index): # bfs method of searching (also used Dijkstra's algorithm)
distanceToNodes = [float('inf')] * len(self.graphStructure)
distanceToNodes[index - 1] = 0
currentNode = self.graphStructure.get(index)
queue = deque()
for i in range(len(currentNode.neighbors)):
n = currentNode.neighbors[i]
distanceToNodes[n - 1] = currentNode.distance[i]
queue.append(n)
while len(queue) > 0: # creating queue and visition all nodes
u = queue.popleft()
node_u = self.graphStructure.get(u)
node_u.visited = True
for v in range(len(node_u.neighbors)):
node_v = self.graphStructure.get(node_u.neighbors[v])
distanceToNodes[node_u.neighbors[v] - 1] = min(distanceToNodes[node_u.neighbors[v] - 1], distanceToNodes[u - 1] + node_u.distance[v]) # update minimal length to node
if not node_v.visited:
queue.append(node_u.neighbors[v])
return distanceToNodes
def readInputToGraph(graph): # reading input data and write to graph datatbase
node0, node1, length = map(int, input().split())
graph.addNode(node0)
graph.addNode(node1)
graph.addConnection(node0, node1, length)
def main():
newGraph = graph()
countOfNodes, countOfPairs = map(int, input().split())
if countOfPairs == 0:
print('0')
exit()
for _ in range(countOfPairs): # reading input data for n(countOfPairs) rows
readInputToGraph(newGraph)
# newGraph.returnGraph() # printing information
print(sum(newGraph.bfs(1))) # starting bfs from start position
main()
The input graph structure may look like this:
15 17
3 7 2
7 5 1
7 11 5
11 5 1
11 1 2
1 12 1
1 13 3
12 10 1
12 4 3
12 15 1
12 13 4
1 2 1
2 8 2
8 14 1
14 6 3
6 9 1
13 9 2
I'm only learning python so l think l could do something wrong(
The correctness of Dijkstra's algorithm relies on retrieving the node with the shortest distance from the source in each iteration. Using your code as an example, the operation u = queue.popleft() MUST return the node that has the shortest distance from the source out of all nodes that are currently in the queue.
Looking at the documentation for collections.deque, I don't think the implementation guarantees that popleft() always returns the node with the lowest key. It simply returns the left most item in what is effectively a double linked list.
The run time of Dijkstra's algorithm (once you implement it correctly) almost entirely lies on the underlying data structure used to implement queue. I would suggest that you first revisit the correctness of your implementation, and once you can confirm that it is actually correct, then start experimenting with different data structures for queue.

Grouping algorithm based on distance

I have a set of 200 points with x and y coordinates. I need to make batches of 20 such that each point in the batch is "n" centimeters away from the other 19 i.e. no two points in one batch are within "n" cm of the other. A point should only belong to one batch. How do I solve this?
I've used trees to draw out branches such that a new node is added only if it is "n" cm away from all other nodes in the branch. This works but is extremely slow.
Input.csv: |Point Name||X coordinate||Y coordinate|
Output: Lists of batches
From what I guess from your question, this should to it:
import math
from random import randrange
test_data = {(randrange(0, 1000), randrange(0, 1000)) for _ in range(200)}
inf = float("inf")
def distance(p1, p2):
return math.hypot(p2[0] - p1[0], p2[1] - p1[1])
def batch(data: set, min_distance, max_distance=inf, count=20, remove=True):
result = [next(iter(data))]
candidates = {t for t in data if max_distance > distance(result[0], t) > min_distance}
while len(result) < count and len(candidates) > 0:
result.append(candidates.pop())
candidates = {t for t in data if all(max_distance > distance(p, t) > min_distance for p in result)}
if len(result)<count:
raise ValueError("Not enough values in data that have great enough distances between each other")
if remove:
data.difference_update(result)
return result
print(len(test_data))
i = 1
while test_data:
print(i,batch(test_data, 10))
i+=1

Resources