python 3 multithreading output to CSV file (blank) - multithreading

I am new to python, i have got this multithreading working from a tutorial i ran across.
Unsure if it is goo practice or not.
What i want to archive:
pings the list of hostnames and returns up or down.
writes results to csv file
What this file currently does is:
pings the list of hostnames and returns up or down.
the csv file it creates is empty and doesnt appear to write any results to it.
I have done some testing and found that with the pings multithreadin over serial code is approx 16 times faster for me.
I am doing massive amounts of pings approx 9000 and want them returned asap.
Can you please let me know where i have gone wrong with the csv part.
import threading
from queue import Queue
import time
import subprocess as sp
import csv
# lock to serialize console output
lock = threading.Lock()
def do_work(item):
#time.sleep(1) # pretend to do some lengthy work.
# Make sure the whole print completes or threads can mix up output in one line.
status,result = sp.getstatusoutput("ping -n 3 " + str(item))
if status == 0:
result = 'Up'
else:
result = 'Down'
with lock:
output.writerow({'hostname': item,'status': result})
array.append({'hostname': item,'status': result})
print(threading.current_thread().name,item,result)
# The worker thread pulls an item from the queue and processes it
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
# Create the queue and thread pool.
q = Queue()
for i in range(100):
t = threading.Thread(target=worker)
t.daemon = True # thread dies when main thread (only non-daemon thread) exits.
t.start()
array = []
# stuff work items on the queue (in this case, just a number).
start = time.perf_counter()
headers = ['status','hostname']
output = csv.DictWriter(open('host-export.csv','w'), delimiter=',', lineterminator='\n', fieldnames=headers)
output.writeheader()
txt = open("hosts.txt", 'r', encoding="utf8")
for line in txt:
q.put(line,array)
q.join() # block until all tasks are done
# "Work" took .1 seconds per task.
# 20 tasks serially would be 2 seconds.
# With 4 threads should be about .5 seconds (contrived because non-CPU intensive "work")
print(array)
print('time:',time.perf_counter() - start)
I also added bulk writing for the csv thinking maybe i just cant access the csv object in the function but that also didnt work as below.
headers = ['status','hostname']
output = csv.DictWriter(open('host-export.csv','w'), delimiter=',', lineterminator='\n', fieldnames=headers)
output.writeheader()
output.writerows(array)

I fiugured out what i have done wrong.
I didnt close the file connection so it didnt write to the file.
here is the code i am using now to site my csv file.
fieldnames = ['ip', 'dns', 'pings'] #headings
test_file = open('test2-p.csv','w', newline='') #open file
csvwriter = csv.DictWriter(test_file, delimiter=',', fieldnames=fieldnames) #set csv writing settings
csvwriter.writeheader() #write csv headings
for row in rows: #write to csv file
csvwriter.writerow(row)
test_file.close()

Related

Using Locust to run load test on multiple URLs loaded from CSV concurrently

Referring to my locusfile.py below:
from locust import HttpLocust, TaskSet, between, task
import csv
class UserBehavior(TaskSet):
#task(1)
def index(l):
with open ('topURL.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
l.client.get("%s" % (row[0]))
class WebsiteUser(HttpLocust):
task_set = UserBehavior
wait_time = between(5.0, 9.0)
When I execute this script, Locust was able to run without any error. However, it'll loop through each row and load test only the latest URL. As it reads the next URL, the previous URL is no longer being load tested. What I want instead is for Locust to load test more and more URLs concurrently as it reads row by row from the CSV.
Edit
I managed to achieve partial concurrency by setting wait_time = between(0.0, 0.0)
Try filling an array with your csv data at setup and choosing randomly from it.
Like
def fill_array():
with open('topURL.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
urls.append(row[0])
then
#task(1)
def index(l):
l.client.get("%s" % (random.choice(urls)))
more info at setup:
https://docs.locust.io/en/stable/writing-a-locustfile.html#setups-teardowns-on-start-and-on-stop
You could try something like:
global USER_CREDENTIALS
USER_CREDENTIALS = list(readCSV)
once done you will be able to refer each line for each virtual user/iteration
References:
Python import csv to list
How to Run Locust with Different Users

Get files pictures with threads and queue in a particular website

I'm trying to create a simple program in python3 with threads and queue to concurrent download images from URL links by using 4 or more threads to download 4 images at the same time and download said images in the downloads folder in the PC while avoiding duplicates by sharing the information between threads.
I suppose I could use something like URL1= “Link1”?
Here are some examples of links.
“https://unab-dw2018.s3.amazonaws.com/ldp2019/1.jpeg”
“https://unab-dw2018.s3.amazonaws.com/ldp2019/2.jpeg”
But I don't understand how to use threads with queue and I'm lost on how to do this.
I have tried searching for some page that can explain how to use threads with queue to concurrent download I have only found links for threads only.
Here is a code that it works partially.
What i need is that the program ask how many threads you want and then download images until it reaches image 20, but on the code if input 5, it will only download 5 images and so on. The thing is that if i put 5, it will download 5 images first, then the following 5 and so on until 20. if its 4 images then 4, 4, 4, 4, 4. if its 6 then it will go 6,6,6 and then download the remaining 2.
Somehow i must implement queue on the code but i just learn threads a few days ago and im lost on how to mix threads and queue together.
import threading
import urllib.request
import queue # i need to use this somehow
def worker(cont):
print("The worker is ON",cont)
image_download = "URL"+str(cont)+".jpeg"
download = urllib.request.urlopen(image_download)
file_save = open("Image "+str(cont)+".jpeg", "wb")
file_save.write(download.read())
file_save.close()
return cont+1
threads = []
q_threads = int(input("Choose input amount of threads between 4 and 20"))
for i in range(0, q_threads):
h = threading.Thread(target=worker, args=(i+1, int))
threads.append(h)
for i in range(0, q_threads):
threads[i].start()
I adapted the following from some code I used to perform multi threaded PSO
import threading
import queue
if __name__ == "__main__":
picture_queue = queue.Queue(maxsize=0)
picture_threads = []
picture_urls = ["string.com","string2.com"]
# create and start the threads
for url in picture_urls:
picture_threads.append(picture_getter(url, picture_queue))
picture_threads[i].start()
# wait for threads to finish
for picture_thread in picture_threads:
picture_thread.join()
# get the results
picture_list = []
while not picture_queue.empty():
picture_list.append(picture_queue.get())
class picture_getter(threading.Thread):
def __init__(self, url, picture_queue):
self.url = url
self.picture_queue = picture_queue
super(picture_getter, self).__init__()
def run(self):
print("Starting download on " + str(self.url))
self._get_picture()
def _get_picture(self):
# --- get your picture --- #
self.picture_queue.put(picture)
Just so you know, people on stackoverflow like to see what you have tried first before providing a solution. However I have this code lying around anyway. Welcome aboard fellow newbie!
One thing I will add is that this does not avoid duplication by sharing information between threads. It avoids duplication as each thread is told what to download. If your filenames are numbered as they appear to be in your question this shouldn't be a problem as you can easily build a list of these.
Updated code to solve the edits to Treyons original post
import threading
import urllib.request
import queue
import time
class picture_getter(threading.Thread):
def __init__(self, url, file_name, picture_queue):
self.url = url
self.file_name = file_name
self.picture_queue = picture_queue
super(picture_getter, self).__init__()
def run(self):
print("Starting download on " + str(self.url))
self._get_picture()
def _get_picture(self):
print("{}: Simulating delay".format(self.file_name))
time.sleep(1)
# download and save image
download = urllib.request.urlopen(self.url)
file_save = open("Image " + self.file_name, "wb")
file_save.write(download.read())
file_save.close()
self.picture_queue.put("Image " + self.file_name)
def remainder_or_max_threads(num_pictures, num_threads, iterations):
# remaining pictures
remainder = num_pictures - (num_threads * iterations)
# if there are equal or more pictures remaining than max threads
# return max threads, otherwise remaining number of pictures
if remainder >= num_threads:
return max_threads
else:
return remainder
if __name__ == "__main__":
# store the response from the threads
picture_queue = queue.Queue(maxsize=0)
picture_threads = []
num_pictures = 20
url_prefix = "https://unab-dw2018.s3.amazonaws.com/ldp2019/"
picture_names = ["{}.jpeg".format(i+1) for i in range(num_pictures)]
max_threads = int(input("Choose input amount of threads between 4 and 20: "))
iterations = 0
# during the majority of runtime iterations * max threads is
# the number of pictures that have been downloaded
# when it exceeds num_pictures all pictures have been downloaded
while iterations * max_threads < num_pictures:
# this returns max_threads if there are max_threads or more pictures left to download
# else it will return the number of remaining pictures
threads = remainder_or_max_threads(num_pictures, max_threads, iterations)
# loop through the next section of pictures, create and start their threads
for name, i in zip(picture_names[iterations * max_threads:], range(threads)):
picture_threads.append(picture_getter(url_prefix + name, name, picture_queue))
picture_threads[i + iterations * max_threads].start()
# wait for threads to finish
for picture_thread in picture_threads:
picture_thread.join()
# increment the iterations
iterations += 1
# get the results
picture_list = []
while not picture_queue.empty():
picture_list.append(picture_queue.get())
print("Successfully downloaded")
print(picture_list)

Processing huge CSV file using Python and multithreading

I have a function that yields lines from a huge CSV file lazily:
def get_next_line():
with open(sample_csv,'r') as f:
for line in f:
yield line
def do_long_operation(row):
print('Do some operation that takes a long time')
I need to use threads such that each record I get from the above function I can call do_long_operation.
Most places on Internet have examples like this, and I am not very sure if I am on the right path.
import threading
thread_list = []
for i in range(8):
t = threading.Thread(target=do_long_operation, args=(get_next_row from get_next_line))
thread_list.append(t)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
My questions are:
How do I start only a finite number of threads, say 8?
How do I make sure that each of the threads will get a row from get_next_line?
You could use a thread pool from multiprocessing and map your tasks to a pool of workers:
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
from random import randint
from time import sleep
def process_line(l):
print l, "started"
sleep(randint(0, 3))
print l, "done"
def get_next_line():
with open("sample.csv", 'r') as f:
for line in f:
yield line
f = get_next_line()
t = Pool(processes=8)
for i in f:
t.map(process_line, (i,))
t.close()
t.join()
This will create eight workers and submit your lines to them, one by one. As soon as a process is "free", it will be allocated a new task.
There is a commented out import statement, too. If you comment out the ThreadPool and import Pool from multiprocessing instead, you will get subprocesses instead of threads, which may be more efficient in your case.
Using a Pool/ThreadPool from multiprocessing to map tasks to a pool of workers and a Queue to control how many tasks are held in memory (so we don't read too far ahead into the huge CSV file if worker processes are slow):
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
from random import randint
import time, os
from multiprocessing import Queue
def process_line(l):
print("{} started".format(l))
time.sleep(randint(0, 3))
print("{} done".format(l))
def get_next_line():
with open(sample_csv, 'r') as f:
for line in f:
yield line
# use for testing
# def get_next_line():
# for i in range(100):
# print('yielding {}'.format(i))
# yield i
def worker_main(queue):
print("{} working".format(os.getpid()))
while True:
# Get item from queue, block until one is available
item = queue.get(True)
if item == None:
# Shutdown this worker and requeue the item so other workers can shutdown as well
queue.put(None)
break
else:
# Process item
process_line(item)
print("{} done working".format(os.getpid()))
f = get_next_line()
# Use a multiprocessing queue with maxsize
q = Queue(maxsize=5)
# Start workers to process queue items
t = Pool(processes=8, initializer=worker_main, initargs=(q,))
# Enqueue items. This blocks if the queue is full.
for l in f:
q.put(l)
# Enqueue the shutdown message (i.e. None)
q.put(None)
# We need to first close the pool before joining
t.close()
t.join()
Hannu's answer is not the best method.
I ran the code on a 100M rows CSV file. It took me forever to perform the operation.
However, prior to reading his answer, I had written the following code:
def call_processing_rows_pickably(row):
process_row(row)
import csv
from multiprocessing import Pool
import time
import datetime
def process_row(row):
row_to_be_printed = str(row)+str("hola!")
print(row_to_be_printed)
class process_csv():
def __init__(self, file_name):
self.file_name = file_name
def get_row_count(self):
with open(self.file_name) as f:
for i, l in enumerate(f):
pass
self.row_count = i
def select_chunk_size(self):
if(self.row_count>10000000):
self.chunk_size = 100000
return
if(self.row_count>5000000):
self.chunk_size = 50000
return
self.chunk_size = 10000
return
def process_rows(self):
list_de_rows = []
count = 0
with open(self.file_name, 'rb') as file:
reader = csv.reader(file)
for row in reader:
print(count+1)
list_de_rows.append(row)
if(len(list_de_rows) == self.chunk_size):
p.map(call_processing_rows_pickably, list_de_rows)
del list_de_rows[:]
def start_process(self):
self.get_row_count()
self.select_chunk_size()
self.process_rows()
initial = datetime.datetime.now()
p = Pool(4)
ob = process_csv("100M_primes.csv")
ob.start_process()
final = datetime.datetime.now()
print(final-initial)
This took 22 minutes. Obviously, I need to have more improvements. For example, the Fred library in R takes 10 minutes maximum to do this task.
The difference is: I am creating a chunk of 100k rows first, and then I pass it to a function which is mapped by threadpool(here, 4 threads).

Threads not picking up more work from Queue

I'm pretty much brand new to python and I have been working on a script that parses the csv files in any given directory. After I implemented a queue and threads, I've been stuck on this issue of the threads not picking up new work, even though there are still items in the queue. For example, if I specify the max # of threads as 3, and there are 6 items in the queue, the threads pick up 3 files, process them, then hang, indefinitely. I may just be conceptually misunderstanding the multithreading process.
ETA:
Some of the code has been removed for security reasons.
q = Queue.Queue()
threads = []
for file in os.listdir(os.chdir(arguments.path)):
if (file.endswith('.csv')):
q.put(file)
for i in range(max_threads):
worker = threading.Thread(target=process, name='worker-{}'.format(thread_count))
worker.setDaemon(True)
worker.start()
threads.append(worker)
thread_count += 1
q.join()
def process():
with open(q.get()) as csvfile:
#do stuff
q.task_done()
You forgot a to loop over the Queue in your threads...
def process():
while True: #<---------------- keep getting stuff from the queue
with open(q.get()) as csvfile:
#do stuff
q.task_done()
That said, You are maybe re-inventing the wheel, try using a Thread Pool:
from concurrent.futures import ThreadPoolExecutor
l = [] # a list should do it ...
for file in os.listdir(arguments.path):
if (file.endswith('.csv')):
l.append(file)
def process(file):
return "this is the file i got %s" % file
with ThreadPoolExecutor(max_workers=4) as e:
results = list(e.map(process, l))

Python Multithreading missing data

useI am working on a python script to check if the url is working. The script will write the url and response code to a log file.
To speed up the check, I am using threading and queue.
The script works well if the number of url's to check is small but when increasing the number of url's to hundreds, some url's just will miss from the log file.
Is there anything I need to fix?
My script is
#!/usr/bin/env python
import Queue
import threading
import urllib2,urllib,sys,cx_Oracle,os
import time
from urllib2 import HTTPError, URLError
queue = Queue.Queue()
##print_queue = Queue.Queue()
class NoRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
infourl = urllib.addinfourl(fp, headers, req.get_full_url())
infourl.status = code
infourl.code = code
return infourl
http_error_300 = http_error_302
http_error_301 = http_error_302
http_error_303 = http_error_302
http_error_307 = http_error_302
class ThreadUrl(threading.Thread):
#Threaded Url Grab
## def __init__(self, queue, print_queue):
def __init__(self, queue,error_log):
threading.Thread.__init__(self)
self.queue = queue
## self.print_queue = print_queue
self.error_log = error_log
def do_something_with_exception(self,idx,url,error_log):
exc_type, exc_value = sys.exc_info()[:2]
## self.print_queue.put([idx,url,exc_type.__name__])
with open( error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,exc_type.__name__))
def openUrl(self,pair):
try:
idx = pair[1]
url = 'http://'+pair[2]
opener = urllib2.build_opener(NoRedirectHandler())
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1')
#open urls of hosts
resp = urllib2.urlopen(request, timeout=10)
## self.print_queue.put([idx,url,resp.code])
with open( self.error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,resp.code))
except:
self.do_something_with_exception(idx,url,self.error_log)
def run(self):
while True:
#grabs host from queue
pair = self.queue.get()
self.openUrl(pair)
#signals to queue job is done
self.queue.task_done()
def readUrlFromDB(queue,connect_string,column_name,table_name):
try:
connection = cx_Oracle.Connection(connect_string)
cursor = cx_Oracle.Cursor(connection)
query = 'select ' + column_name + ' from ' + table_name
cursor.execute(query)
#Count lines in the file
rows = cursor.fetchall()
total = cursor.rowcount
#Loop through returned urls
for row in rows:
#print row[1],row[2]
## url = 'http://'+row[2]
queue.put(row)
cursor.close()
connection.close()
return total
except cx_Oracle.DatabaseError, e:
print e[0].context
raise
def main():
start = time.time()
error_log = "D:\\chkWebsite_Error_Log.txt"
#Check if error_log file exists
#If exists then deletes it
if os.path.isfile(error_log):
os.remove(error_log)
#spawn a pool of threads, and pass them queue instance
for i in range(10):
t = ThreadUrl(queue,error_log)
t.setDaemon(True)
t.start()
connect_string,column_name,table_name = "user/pass#db","*","T_URL_TEST"
tn = readUrlFromDB(queue,connect_string,column_name,table_name)
#wait on the queue until everything has been processed
queue.join()
## print_queue.join()
print "Total retrived: {0}".format(tn)
print "Elapsed Time: %s" % (time.time() - start)
main()
Python's threading module isn't really multithreaded because of the global interpreter lock, http://wiki.python.org/moin/GlobalInterpreterLock as such you should really use multiprocessing http://docs.python.org/library/multiprocessing.html if you really want to take advantage of multiple cores.
Also you seem to be accessing a file simultatnously
with open( self.error_log, 'a') as err_log_f:
err_log_f.write("{0},{1},{2}\n".format(idx,url,resp.code))
This is really bad AFAIK, if two threads are trying to write to the same file at the same time or almost at the same time, keep in mind, their not really multithreaded, the behavior tends to be undefined, imagine one thread writing while another just closed it...
Anyway you would need a third queue to handle writing to the file.
At first glance this looks like a race condition, since many threads are trying to write to the log file at the same time. See this question for some pointers on how to lock a file for writing (so only one thread can access it at a time).

Resources