Speed up the data extraction process from dicom files - python-3.x

I am trying to extract images from Dicom files.
My folder structure is so below -
> BATCH 4 BATCH 6 BATCH 8 Batch 29 Batch 30-35 Batch 36 Batch 37-38_1
> BATCH 5 BATCH 7 BATCH 9 Batch 29_1 Batch 30-35_1 Batch 37-38
Each batch contains thousands of dicom images.
My broad approach is below -
I am storing all the batches in a single list folder_list and then iterating through all of the batches.
single_files contains every dicom file in each batch and then I am subsequently iterating through each file in a batch.
After checking few conditions on each file, I am extracting the image - pixel_array and moving it to desired location.
The issue is it is really slow and complexity is O(n^2) , is there a way to fasten it up.
Complete code-
from pydicom import pixel_data_handlers
counter = 0
Source_folder_path = '/Path/*/'
destination_dir = '/Volumes/My Book/Extracted_Dataset'
folder_list = glob.glob(Source_folder_path)
for folder_dir in folder_list:
single_files = (glob.glob(os.path.join(folder_dir,'*')))
final_destination = os.path.join(destination_dir, folder_dir.split('/')[-2])
for i in single_files:
print(i)
dcm = pydicom.dcmread(i)
name = dcm.PatientID
dest = os.path.join(destination_dir,os.path.join(folder_dir,name))
if dcm.PhotometricInterpretation == 'RGB':
if dcm.Modality == "OP":
if os.path.isdir(dest) == False:
os.mkdir(dest)
img = dcm.pixel_array
name = dcm.PatientID+'_'+str(counter)+'.png'
counter+=1
if dcm.LossyImageCompression:
if dcm.LossyImageCompression=='00':
img = pixel_data_handlers.util.convert_color_space(img, current = 'RGB', desired = 'YBR_FULL')
image_to_write = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
cv2.imwrite(os.path.join(folder_dir,name),image_to_write)
if not os.path.isdir(final_destination):
os.makedirs(final_destination)
shutil.move(os.path.join(folder_dir,name),final_destination)
else:
shutil.move(os.path.join(folder_dir,name),final_destination)
Modified version as per the suggestion,
CPU Utilisation is below -
My IO utilisation is -
Can it be speed up more -
def ProcessOne(f):
"""Function of main process."""
counter = 0
destination_dir = '/Volumes/My Book/Extracted_Dataset'
folder_dir = f
single_files = (glob.glob(os.path.join(folder_dir, '*')))
final_destination = os.path.join(destination_dir, folder_dir.split('/')[-2])
for i in single_files:
print(i)
dcm = pydicom.dcmread(i)
name = dcm.PatientID
dest = os.path.join(destination_dir, os.path.join(folder_dir, name))
if dcm.PhotometricInterpretation == 'RGB':
if dcm.Modality == "OP":
if not os.path.isdir(dest):
os.mkdir(dest)
img = dcm.pixel_array
name = dcm.PatientID+'_'+str(counter)+'.png'
counter += 1
if dcm.LossyImageCompression:
if dcm.LossyImageCompression == '00':
img = pixel_data_handlers.util.convert_color_space(img, current='RGB', desired='YBR_FULL') # noqa
image_to_write = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
cv2.imwrite(os.path.join(folder_dir, name), image_to_write)
if not os.path.isdir(final_destination):
os.makedirs(final_destination)
shutil.move(os.path.join(folder_dir, name), final_destination) # noqa
else:
shutil.move(os.path.join(folder_dir, name), final_destination) # noqa
if __name__ == '__main__':
# Create a pool of processes to check files
p = Pool()
# Create a list of files to process
Source_folder_path = '/Path/*/' # noqa
folder_list = glob.glob(Source_folder_path)
print(f'Batches to process: {len(folder_list)}')
# Map the list of files to check onto the Pool
p.map(ProcessOne, folder_list)

Related

How do I run a machine learning training in the background?

I have a function in Support Vector Classifier which runs on a scheduler on Google Cloud Platform. That function, fetches the new data, adds it to the original data, trains the model on new data and saves it on google cloud storage. All of this takes 5 min to complete. I wish to not wait for the final output, instead I want to run it in the background and end the process without waiting.
Below is my function with comments:
def train_model():
users, tasks, tags, task_tags, task_user, boards = connect_postgres() ##loading the data from a postgres function
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blob = bucket.blob('original_data.pkl')
pickle_in0 = blob.download_as_string()
data = pickle.loads(pickle_in0)
tasks = tasks.rename(columns={'id': 'task_id', 'name': 'task_name'})
# Joining tasks and task_user_assigns tables
tasks = tasks[tasks.task_name.isnull() == False]
task_user = task_user[['id', 'task_id', 'user_id']].rename(columns={'id': 'task_user_id'})
task_data = tasks.merge(task_user, on='task_id', how='left')
# Joining users with the task_data
users = users[['id', 'email']].rename(columns={'id': 'user_id'})
users_tasks = task_data.merge(users, on='user_id', how='left')
users_tasks = users_tasks[users_tasks.user_id.isnull() == False].reset_index(drop=True)
# Joining boards table to user_tasks
boards = boards[['id', 'name']].rename(columns={'id': 'board_id', 'name': 'board_name'})
users_board = users_tasks.merge(boards, on='board_id', how='left').reset_index(drop=True)
# Data Cleaning
translator = Translator() # This is to translate if the tasks are not in English
users_board["task_trans"] = users_board["task_name"].map(lambda x: translator.translate(x, dest="en").text)
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_emoji(x)) #This calls a function to remove Emoticons from text
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_punct(x)) #This calls a function to remove punctuations from text
users_board = users_board[['task_id', 'email', 'board_id', 'user_id', 'task_trans']]
data1 = pd.concat([data, users_board], axis=0)
df1 = data1.copy
X = df1.task_trans #all the observations
y = df1.user_id #all the lables
print(y.nunique())
#FROM HERE ON, THE TRAINING SCRIPT BEGINS
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_transformed = tf_transformer.transform(X_train_counts)
print('model 1 done')
labels = LabelEncoder()
y_train_labels_fit = labels.fit(y)
y_train_lables_trf = labels.transform(y)
linear_svc = LinearSVC()
clf = linear_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 2 done')
calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc, cv="prefit")
calibrated_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 3 done')
# SAVING THE MODELS ON GOOGLE CLOUD STORAGE
# storage_client = storage.Client()
fs = gcsfs.GCSFileSystem(project='my-project')
filename = '~path/svc.sav'
pickle.dump(calibrated_svc, fs.open(filename, 'wb'))
filename = '~path/count_vectorizer.sav'
pickle.dump(count_vect, fs.open(filename, 'wb'))
filename = '~path/tfidf_vectorizer.sav'
pickle.dump(tf_transformer, fs.open(filename, 'wb'))
blob = bucket.blob('original_data.pkl')
pickle_out = pickle.dumps(df1)
blob.upload_from_string(pickle_out)
return "success"
Now, I tried to do the following:
p = subprocess.Popen([sys.executable, '-c', train_model()], stdout=subprocess.PIPE, stderr=subprocess.STDOUT); print('finished')
This also took the same amount of time. Is there a way I can solve this?
Also, if I want to print the python logs for this process on client-side, is that possible?

How can I compare 2 sets of images with OpenCV

I am using OpenCV to compare 2 images.
After a couple of days, I was able to modify it to compare a image to a list of images.
How can I compare a list of images with another list?
Ex: we have 2 folders Images1 and Images2. Images1 = te1.jpg, te2.jpg, te3.jpg; Images2 = te1.jpg, te2.jpg, te3.jpg.
I want to compare te1.jpg from Images1 with te1.jpg from Images2, te2.jpg from Images1 with te2.jpg from Images2 and te3.jpg from Images1 with te3.jpg from Images2.
Can I add both folders and make it loop thru them in order to get the correspondent image in Images2 for every image in Images1?
He is my code until now:
import cv2
import numpy as np
import glob
original = cv2.imread("te.jpg")
#Load all the images
all_images_to_compare = []
titles = []
for f in glob.iglob("images2/*"):
image = cv2.imread(f)
titles.append(f)
all_images_to_compare.append(image)
for image_to_compare, title in zip(all_images_to_compare, titles):
# 1) Check if 2 images are equals
if original.shape == image_to_compare.shape:
print("The images have the same size and channels")
difference = cv2.subtract(original, image_to_compare)
b, g, r = cv2.split(difference)
#image1 = original.shape
#image2 = duplicate.shape
cv2.imshow("difference", difference)
#cv2.imshow("b", b)
#cv2.imshow("g", g)
#cv2.imshow("r", r)
#print(image1)
#print(image2)
print(cv2.countNonZero(b))
if cv2.countNonZero(b) == 0 and cv2.countNonZero(g) == 0 and cv2.countNonZero(r) ==0:
print("Similarity: 100% (equal size and channels)")
# 2) Check for similarities between the 2 images
sift = cv2.xfeatures2d.SIFT_create()
kp_1, desc_1 = sift.detectAndCompute(original, None)
kp_2, desc_2 = sift.detectAndCompute(image_to_compare, None)
#print("Keypoints 1ST Image: " + str(len(kp_1)))
#print("Keypoints 2ND Image: " + str(len(kp_2)))
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
ratio = 0.9 # mai putin de 1
for m, n in matches:
if m.distance < ratio*n.distance:
good_points.append(m)
# Define how similar they are
number_keypoints = 0
if len(kp_1) <= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
print("Keypoints 1ST Image: " + str(len(kp_1)))
print("Keypoints 2ND Image: " + str(len(kp_2)))
print("Title:" +title)
percentage_similarity = len(good_points) / number_keypoints * 100
print("Similarity: " + str(int(percentage_similarity)) + "%\n")
I think you just need a nested for loop?
So for the folders "Images1" and "Images2" - I would to it this way:
import os
import cv2
# load all image names into a list
ls_imgs1_names = os.listdir(Images1)
ls_imgs2_names = os.listdir(Images2)
# construct image paths and save in list
ls_imgs1_path = [os.path.join(Images1, img) for img in ls_imgs1_names]
ls_imgs2_path = [os.path.join(Images2, img) for img in ls_imgs2_names]
# list comprehensin to load imgs in lists
ls_imgs1 = [cv2.imread(img) for img in ls_imgs1_path]
ls_imgs2 = [cv2.imread(img) for img in ls_imgs2_path]
for original in ls_imgs1:
for image_to_compare in ls_imgs2:
# compare orignal to image_to_compare
# here just insert your code where you compare two images
Depending on your memory or rather the amount of images in your folder I would either load all imgs directly into a list as I did above or you load the imgs in the for loops, so that you loop over the ls_imgs1_path and ls_imgs2_path

Python 3 Multiprocessing and openCV problem with dictionary sharing between processor

I would like to use multiprocessing to compute the SIFT extraction and SIFT matching for object detection.
For now, I have a problem with the return value of the function that does not insert data in the dictionary.
I'm using Manager class and image that are open inside the function. But does not work.
Finally, my idea is:
Computer the keypoint for every reference image, use this keypoint as a parameter of a second function that compares and match with the keypoint and descriptors of the test image.
My code is:
# %% Import Section
import cv2
import numpy as np
from matplotlib import pyplot as plt
import os
from datetime import datetime
from multiprocessing import Process, cpu_count, Manager, Lock
import argparse
# %% path section
tests_path = 'TestImages/'
references_path = 'ReferenceImages2/'
result_path = 'ResultParametrizer/'
#%% Number of processor
cpus = cpu_count()
# %% parameter section
eps = 1e-7
useTwo = False # using the m and n keypoint better with False
# good point parameters
distanca_coefficient = 0.75
# gms parameter
gms_thresholdFactor = 3
gms_withRotation = True
gms_withScale = True
# flann parameter
flann_trees = 5
flann_checks = 50
#%% Locker
lock = Lock()
# %% function definition
def keypointToDictionaries(keypoint):
x, y = keypoint.pt
pt = float(x), float(y)
angle = float(keypoint.angle) if keypoint.angle is not None else None
size = float(keypoint.size) if keypoint.size is not None else None
response = float(keypoint.response) if keypoint.response is not None else None
class_id = int(keypoint.class_id) if keypoint.class_id is not None else None
octave = int(keypoint.octave) if keypoint.octave is not None else None
return {
'point': pt,
'angle': angle,
'size': size,
'response': response,
'class_id': class_id,
'octave': octave
}
def dictionariesToKeypoint(dictionary):
kp = cv2.KeyPoint()
kp.pt = dictionary['pt']
kp.angle = dictionary['angle']
kp.size = dictionary['size']
kp.response = dictionary['response']
kp.octave = dictionary['octave']
kp.class_id = dictionary['class_id']
return kp
def rootSIFT(dictionary, image_name, image_path,eps=eps):
# SIFT init
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
sift = cv2.xfeatures2d.SIFT_create()
keypoints, descriptors = sift.detectAndCompute(image, None)
descriptors /= (descriptors.sum(axis=1, keepdims=True) + eps)
descriptors = np.sqrt(descriptors)
print('Finito di calcolare, PID: ', os.getpid())
lock.acquire()
dictionary[image_name]['keypoints'] = keypoints
dictionary[image_name]['descriptors'] = descriptors
lock.release()
def featureMatching(reference_image, reference_descriptors, reference_keypoints, test_image, test_descriptors,
test_keypoints, flann_trees=flann_trees, flann_checks=flann_checks):
# FLANN parameter
FLANN_INDEX_KDTREE = 1
index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=flann_trees)
search_params = dict(checks=flann_checks) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params, search_params)
flann_matches = flann.knnMatch(reference_descriptors, test_descriptors, k=2)
matches_copy = []
for i, (m, n) in enumerate(flann_matches):
if m.distance < distanca_coefficient * n.distance:
matches_copy.append(m)
gsm_matches = cv2.xfeatures2d.matchGMS(reference_image.shape, test_image.shape, keypoints1=reference_keypoints,
keypoints2=test_keypoints, matches1to2=matches_copy,
withRotation=gms_withRotation, withScale=gms_withScale,
thresholdFactor=gms_thresholdFactor)
#%% Starting reference list file creation
reference_init = datetime.now()
print('Start reference file list creation')
reference_image_process_list = []
manager = Manager()
reference_image_dictionary = manager.dict()
reference_image_list = manager.list()
for root, directories, files in os.walk(references_path):
for file in files:
if file.endswith('.DS_Store'):
continue
reference_image_path = os.path.join(root, file)
reference_name = file.split('.')[0]
image = cv2.imread(reference_image_path, cv2.IMREAD_GRAYSCALE)
reference_image_dictionary[reference_name] = {
'image': image,
'keypoints': None,
'descriptors': None
}
proc = Process(target=rootSIFT, args=(reference_image_list, reference_name, reference_image_path))
reference_image_process_list.append(proc)
proc.start()
for proc in reference_image_process_list:
proc.join()
reference_end = datetime.now()
reference_time = reference_end - reference_init
print('End reference file list creation, time required: ', reference_time)
I faced pretty much the same error. It seems that the code hangs at detectAndCompute in my case, not when creating the dictionary. For some reason, sift feature extraction is not multi-processing safe (to my understanding, it is the case in Macs but I am not totally sure.)
I found this in a github thread. Many people say it works but I couldn't get it worked. (Edit: I tried this later which works fine)
Instead I used multithreading which is pretty much the same code and works perfectly. Of course you need to take multithreading vs multiprocessing into account

Multiprocessing queue of files in folders

Question is how to properly process files in Python 3.7 multiprocessing in case when I'm crawling directories recursively.
My code is as following:
def f(directoryout,directoryoutfailed,datafile,filelist_failed,imagefile,rootpath,extension,debug):
[...] some processing
if __name__ == '__main__':
import csv
import os
debug = 0
timeout = 20
if debug == 0:
folder = '/home/debian/Desktop/environments/dedpul/files/fp/'
datafile = 'fpdata.csv' # file with results
directoryout = 'fp_out' # out directory for debugging
directoryoutfailed = 'fp_out_failed' # out directory for wrongly processed for debuggin mode
filelist = 'filelist.csv' # list of processed files
filelist_failed = 'filelist_failed.csv' # list of wrongly processed files
counter = 0
pool = Pool(processes=4)
for root, subFolders, files in os.walk(folder):
for imagefile in files:
rootpath = root+'/'
fullpath = root+'/'+imagefile
extension = os.path.splitext(imagefile)[1]
imagefilesplit = imagefile.split('.')[0]
counter += 1
print('\033[93m ## ',counter,' ## \033[0m',rootpath)
fileexist = 0
with open(filelist) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
if row[0] == fullpath:
fileexist = 1
if fileexist == 1:
print(' File was processed, skipping...')
continue
with open(filelist, mode='a') as csv_file:
writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow([fullpath])
# print(directoryout,directoryoutfailed,datafile,filelist_failed,imagefile,rootpath,extension,debug)
res = pool.apply(f, (directoryout,directoryoutfailed,datafile,filelist_failed,imagefile,rootpath,extension,debug))
pool.close()
pool.join()
1st, when I'm using pool.apply_async it uses all cores, however it doesn't process function f() correctly. With pool.apply() it works single-threading.
2nd, as you see, I'm crawling recursively list of files in folders in loop. If file was found as processed, this loop should continue. Should I do that in __ main __ function, or it should be moved to f() function? If yes, how to exchange what is during processing, which takes a few seconds per file?
3rd, function f() is independent, so if it will process image file and then it will add results to fpdata.csv file (or add name of not-well-processed file to filelist_failed.csv) and just close processing without any problems, so no real output is needed. I need just to start this function in multiprocessing.
What I am doing wrong? Should I use
with Pool(processes=4) as pool:
statement?
Before asking this query I've browsed tons of answers, but apparently it was extremely hard to find such file processing, in Python Manual as well.

Why can't I split files when generating some TFrecord files?

Why can't I split files when generating some TFrecords files?
I'm doing some job predicting protein stuctures. As you may know, one protein molecule might have different strands. So I need to split the list of the atoms into different TFrecords by the strand name.
The problem is, this code ended up by generating several TFrecords with nothing written. All blank.
Or, is there a method to split the strands while training my module? Then I could ignore this problem and put the strand name in the TFrecords as a feature.
'''
with all module imported and no errors raised
'''
def generate_TFrecord(intPosition, endPosition, path):
CrtS = x #x is the name of the current strand
path = path + CrtS
writer = tf.io.TFRecordWriter('%s.tfrecord' %path)
for i in range(intPosition, endPosition):
if identifyCoreCarbon(i):
vectros = getVectors(i)
features = {}
'''
feeding this dict
'''
tf_features = tf.train.Features(feature = features)
tf_example = tf.train.Example(features = tf_features)
tf_serialized = tf_example.SerializeToString()
writer.write(tf_serialized)
'''
if checkStrand(i) == False:
writer.write(tf_serialized)
intPosition = i
'''
writer.close()
'''
strand_index is a list of all the startpoint of a single strand
'''
for loop in strand_index:
generate_TFrecord(loop, endPosition, path)
'''
________division___________
This code below works, but only generate a single tfrecord containing all the atom imformations.
writer = tf.io.TFRecordWriter('%s.tfrecord' %path)
for i in range(0, endPosition):
if identifyCoreCarbon(i):
vectros = getVectors(i)
features = {}
'''
feeing features
'''
tf_features = tf.train.Features(feature = features)
tf_example = tf.train.Example(features = tf_features)
tf_serialized = tf_example.SerializeToString()
writer.write(tf_serialized)
writer.close()
'''

Resources