Is the Benchmarking of my Algorithms right? - python-3.x

i wrote Quicksort and Mergesort and a Benchmark for them, to see how fast they are.
Here is my code:
#------------------------------Creating a Random List-----------------------------#
def create_random_list (length):
import random
random_list = list(range(0,length))
random.shuffle(random_list)
return random_list
# Initialize default list length to 0
random_list = create_random_list(0)
# Testing random list function
print ("\n" + "That is a randomized list: " + "\n")
print (random_list)
print ("\n")
#-------------------------------------Quicksort-----------------------------------#
"""
Recursive Divide and Conquer Algorithm
+ Very efficient for large data set
- Performance Depends largely on Pivot Selection
Time Complexity
--> Worst-Case -----> O (n^2)
--> Best-Case -----> Ω (n log (n))
--> Average Case ---> O (n log (n))
Space Complexity
--> O(log(n))
"""
# Writing the Quick Sort Algorithm for sorting the list - Recursive Method
def qsort (random_list):
less = []
equal = []
greater = []
if len(random_list)>1:
# Initialize starting Point
pivot = random_list[0]
for x in random_list:
if x < pivot:
less.append(x)
elif x == pivot:
equal.append(x)
elif x > pivot:
greater.append(x)
return qsort(less) + equal + qsort(greater)
else:
return random_list
"""
Build in Python Quick Sort:
def qsort(L):
if len(L) <= 1: return L
return qsort([lt for lt in L[1:] if lt < L[0]]) + L[0:1] + \
qsort([ge for ge in L[1:] if ge >= L[0]])
"""
# Calling Quicksort
sorted_list_qsort = qsort(random_list)
# Testint Quicksort
print ("That is a sorted list with Quicksort: " + "\n")
print (sorted_list_qsort)
print ("\n")
#-------------------------------------FINISHED-------------------------------------#
#-------------------------------------Mergesort------------------------------------#
"""
Recursive Divide and Conquer Algorithm
+
-
Time Complexity
--> Worst-Case -----> O (n l(n))
--> Best-Case -----> Ω (n l(n))
--> Average Case ---> O (n l(n))
Space Complexity
--> O (n)
"""
# Create a merge algorithm
def merge(a,b): # Let a and b be two arrays
c = [] # Final sorted output array
a_idx, b_idx = 0,0 # Index or start from a and b array
while a_idx < len(a) and b_idx < len(b):
if a[a_idx] < b[b_idx]:
c.append(a[a_idx])
a_idx+=1
else:
c.append(b[b_idx])
b_idx+=1
if a_idx == len(a): c.extend(b[b_idx:])
else: c.extend(a[a_idx:])
return c
# Create final Mergesort algorithm
def merge_sort(a):
# A list of zero or one elements is sorted by definition
if len(a)<=1:
return a
# Split the list in half and call Mergesort recursively on each half
left, right = merge_sort(a[:int(len(a)/2)]), merge_sort(a[int(len(a)/2):])
# Merge the now-sorted sublists with the merge function which sorts two lists
return merge(left,right)
# Calling Mergesort
sorted_list_mgsort = merge_sort(random_list)
# Testing Mergesort
print ("That is a sorted list with Mergesort: " + "\n")
print (sorted_list_mgsort)
print ("\n")
#-------------------------------------FINISHED-------------------------------------#
#------------------------------Algorithm Benchmarking------------------------------#
# Creating an array for iterations
n = [100,1000,10000,100000]
# Creating a dictionary for times of algorithms
times = {"Quicksort":[], "Mergesort": []}
# Import time for analyzing the running time of the algorithms
from time import time
# Create a for loop which loop through the arrays of length n and analyse their times
for size in range(len(n)):
random_list = create_random_list(n[size])
t0 = time()
qsort(random_list)
t1 = time()
times["Quicksort"].append(t1-t0)
random_list = create_random_list(n[size-1])
t0 = time()
merge_sort(random_list)
t1 = time()
times["Mergesort"].append(t1-t0)
# Create a table while shows the Benchmarking of the algorithms
print ("n\tMerge\tQuick")
print ("_"*25)
for i,j in enumerate(n):
print ("{}\t{:.5f}\t{:.5f}\t".format(j, times["Mergesort"][i], times["Quicksort"][i]))
#----------------------------------End of Benchmarking---------------------------------#
The code is well documented and runs perfectly with Python 3.8. You may copy it in a code editor for better readability.
--> My Question as the title states:
Is my Benchmarking right? I'm doubting it a litte bit, because the running times of my Algorithms seem a little odd. Can someone confirm my runtime?
--> Here is the output of this code:
That is a randomized list:
[]
That is a sorted list with Quicksort:
[]
That is a sorted list with Mergesort:
[]
n Merge Quick
_________________________
100 0.98026 0.00021
1000 0.00042 0.00262
10000 0.00555 0.03164
100000 0.07919 0.44718
--> If someone has another/better code snippet on how to print the table - feel free to share it with me.

The error is in n[size-1]: when size is 0 (the first iteration), this translates to n[-1], which corresponds to your largest size. So in the first iteration you are comparing qsort(100) with merge_sort(100000), which obviously will favour the first a lot. It doesn't help that you call this variable size, as it really isn't the size, but the index in the n list, which contains the sizes.
So remove the -1, or even better: iterate directly over n. And I would also make sure both sorting algorithms get to sort the same list:
for size in n:
random_list1 = create_random_list(size)
random_list2 = random_list1[:]
t0 = time()
qsort(random_list1)
t1 = time()
times["Quicksort"].append(t1-t0)
t0 = time()
merge_sort(random_list2)
t1 = time()
times["Mergesort"].append(t1-t0)
Finally, consider using timeit which is designed for measuring performance.

Related

Speeding up this PySCIPOpt routine for a Mixed Integer Program

I'm writing code to find median orders of tournaments, given a tournament T with n vertices, a median order is an ordering of the vertices of T such that it maximices the number of edges pointing in the 'increasing' direction with respect to the ordering.
In particular if the vertex set of T is {0,...,n-1}, the following integer problem (maximizing over the set of permutations) yields an optimal answer to the problem where Q is also a boolean n by n matrix.
I've implemented a linealization of this problem, noting that Q is a permutation, the following python code works, but my computer can't handle graphs that are as small as 10 vertices, where i would expect to have fast answers, is there any relatively easy way to speed up this computation?.
import numpy as np
from pyscipopt import Model,quicksum
from networkx.algorithms.tournament import random_tournament as rt
import math
# Some utilities to define the adjacency matrix of an oriented graph
def adjacency_matrix(t,order):
n = len(order)
adj_t = np.zeros((n,n))
for e in t.edges:
adj_t[order.index(e[0]),order.index(e[1])] = 1
return adj_t
# Random tournaments to instanciate the problem
def random_tournament(n):
r_t = rt(n)
adj_t = adjacency_matrix(r_t,list(range(n)))
return r_t, adj_t
###############################################################
############# PySCIPOpt optimization Routine ##################
###############################################################
n = 5 # some arbitrary size parameter
t,adj_t = random_tournament(n)
model = Model()
p,w,r = {},{},{}
# Defining model variables and weights
for k in range(n):
for l in range(n):
p[k,l] = model.addVar(vtype='B')
for i in range(n):
for j in range(i,n):
r[i,k,l,j] = model.addVar(vtype='C')
w[i,k,l,j] = adj_t[k][l]
for i in range(n):
# Forcing p to be a permutation
model.addCons(quicksum(p[s,i] for s in range(n))==1)
model.addCons(quicksum(p[i,s] for s in range(n))==1)
for k in range(n):
for j in range(i,n):
for l in range(n):
# Setting r[i,k,l,j] = min(p[i,k],p[l,j])
model.addCons(r[i,k,l,j] <= p[k,i])
model.addCons(r[i,k,l,j] <= p[l,j])
# Set the objective function
model.setObjective(quicksum(r[i,k,l,j]*w[i,k,l,j] for i in range(n) for j in range(i,n) for k in range(n) for l in range(n)), "maximize")
model.data = p,r
model.optimize()
sol = model.getBestSol()
# Print the solution on a readable format
Q = np.array([math.floor(model.getVal(model.data[0][key])) for key in model.data[0].keys()]).reshape([n,n])
print(f'\nOptimization ended with status {model.getStatus()} in {"{:.2f}".format(end_optimization-end_setup)}s, with {model.getObjVal()} increasing edges and optimal solution:')
print('\n',Q)
order = [int(x) for x in list(np.matmul(Q.T,np.array(range(n))))]
new_adj_t = adjacency_matrix(t,order)
print(f'\nwhich induces the ordering:\n\n {order}')
print(f'\nand induces the following adjacency matrix: \n\n {new_adj_t}')
Right now I've run it for n=5 taking between 5 and 20 seconds, and have ran it succesfully for small integers 6,7 with not much change in time needed.
For n=10, on the other hand, it has been running for around an hour with no solution yet, I suppose the linearization having O(n**4) variables hurts, but I don't understand why it blows up so fast. Is this normal? How would a better implementation be in case there is one?.

Why is heapq.heapify so fast?

I have tried to reimplement heapify method in order to use _siftup and _siftdown for updating or deleting any nodes in the heap and maintaining a time complexity of O(log(n)).
I did some effort for optimizing my code, But they proved to be worse compared to that of heapq.heapify (in terms of the total time taken). So I have decided to look into source code. and compared copied code with the modules ones.
# heap invariant.
def _siftdown(heap, startpos, pos):
newitem = heap[pos]
# Follow the path to the root, moving parents down until finding a place
# newitem fits.
while pos > startpos:
parentpos = (pos - 1) >> 1
parent = heap[parentpos]
if newitem < parent:
heap[pos] = parent
pos = parentpos
continue
break
heap[pos] = newitem
def _siftup(heap, pos):
endpos = len(heap)
startpos = pos
newitem = heap[pos]
# Bubble up the smaller child until hitting a leaf.
childpos = 2*pos + 1 # leftmost child position
while childpos < endpos:
# Set childpos to index of smaller child.
rightpos = childpos + 1
if rightpos < endpos and not heap[childpos] < heap[rightpos]:
childpos = rightpos
# Move the smaller child up.
heap[pos] = heap[childpos]
pos = childpos
childpos = 2*pos + 1
# The leaf at pos is empty now. Put newitem there, and bubble it up
# to its final resting place (by sifting its parents down).
heap[pos] = newitem
_siftdown(heap, startpos, pos)
def heapify(x):
"""Transform list into a heap, in-place, in O(len(x)) time."""
n = len(x)
# Transform bottom-up. The largest index there's any point to looking at
# is the largest with a child index in-range, so must have 2*i + 1 < n,
# or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
# j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is
# (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
for i in reversed(range(n//2)):
_siftup(x, i)
a = list(reversed(range(1000000)))
b = a.copy()
import heapq
import time
cp1 = time.time()
heapq.heapify(a)
cp2 = time.time()
heapify(b)
cp3 = time.time()
print(a == b)
print(cp3 - cp2, cp2 - cp1)
And i found always cp3 - cp2 >= cp2 - cp1 and not same , heapify took more time than heapq.heaify even though both were same.
And in some case heapify took 3 seconds and heapq.heapify took 0.1 seconds
heapq.heapfy module executes faster than the same heapify they only differ through import.
please let me know the reason, I am sorry if I made some silly mistakes.
The heapify from the heapq module is actually a built-in function:
>>> import heapq
>>> heapq
<module 'heapq' from 'python3.9/heapq.py'>
>>> heapq.heapify
<built-in function heapify>
help(heapq.heapify) says:
Help on built-in function heapify in module _heapq...
So it's actually importing the built-in module _heapq and thus running C code, not Python.
If you scroll the heapq.py code further, you'll see this:
# If available, use C implementation
try:
from _heapq import *
except ImportError:
pass
This will overwrite functions like heapify with their C implementations. For instance, _heapq.heapify is here.

What is the complexity for the following python code?

import math
# for loop includes k/2 (ie. if k/2 = 3.5, then i will go from [1, 3]. 1,2, and 3
def pow(x,k):
y =1
m = math.trunc(k/2)
if k == 0:
return 1
for i in range(1,m+1):
y = y * x
if(k%2 == 0):
return y * y
else:
return y*y*x
I think it uses O(1) memory and O(log k) steps. Is it correct?
The memory complexity is correct thus O(1).
The runtime complexity should be O(k). Each line is done in constant time except the for loop. You iterate in your for loop at most k/2 times. This means that the number of iteration will be linear in k. Note that O(k/2) = O(k).
Also you could see that the complexity of log(k) does not work. If k = 16, your loop will iterate 8 times and by taking the log (base 2) we would only obtain 4 times.

Grouping algorithm based on distance

I have a set of 200 points with x and y coordinates. I need to make batches of 20 such that each point in the batch is "n" centimeters away from the other 19 i.e. no two points in one batch are within "n" cm of the other. A point should only belong to one batch. How do I solve this?
I've used trees to draw out branches such that a new node is added only if it is "n" cm away from all other nodes in the branch. This works but is extremely slow.
Input.csv: |Point Name||X coordinate||Y coordinate|
Output: Lists of batches
From what I guess from your question, this should to it:
import math
from random import randrange
test_data = {(randrange(0, 1000), randrange(0, 1000)) for _ in range(200)}
inf = float("inf")
def distance(p1, p2):
return math.hypot(p2[0] - p1[0], p2[1] - p1[1])
def batch(data: set, min_distance, max_distance=inf, count=20, remove=True):
result = [next(iter(data))]
candidates = {t for t in data if max_distance > distance(result[0], t) > min_distance}
while len(result) < count and len(candidates) > 0:
result.append(candidates.pop())
candidates = {t for t in data if all(max_distance > distance(p, t) > min_distance for p in result)}
if len(result)<count:
raise ValueError("Not enough values in data that have great enough distances between each other")
if remove:
data.difference_update(result)
return result
print(len(test_data))
i = 1
while test_data:
print(i,batch(test_data, 10))
i+=1

more elegant way to build and append lists?

I know this is a lame question but python newbie here. I came up with below and it works but I was wondering if there are more efficient ways to do this:
the goal here is calculate 4/1 - 4/3 + 4/5 - 4/7 + 4/9 - 4/11 .... n times
n = 5000
x = 1.0
list_1 = [] #make a list for denominator
list_2 = [] #make a list of fractions using list_1 as denominator
list_3 = [] #make a list change odd elements to negative
for i in range(n):
list_1.append(float(x))
x = x + 2
for i in range(len(list_1)):
list_2.append(4/list_1[i])
for count, i in enumerate(list_2):
if count % 2 == 0:
list_3.append(i)
else:
list_3.append(i * -1)
sum(list_3)
this would be a one-liner for your task:
s = sum((-1)**i * 4 / (2*i+1) for i in range(n))
it would be more efficient because no list is created (.append is never called); it just sums over a generator of all your elements.
if you really need the list of your elements (and not just the sum) you could construct it in a similar way:
lst = list((-1)**i * 4 / (2*i+1) for i in range(n))
you can try to do something like this, using a function:
def calc4DivX(n):
signal = 1
list_values = []
for k in range(n):
if signal == 1:
list_values.append(4/n)
else:
list_values.append(-4/n)
signal = signal*-1 # signal keeps alternating for every iteraction
return sum(list_values)
#Call the function
print(calc4DivX(value))

Resources