How do I run a machine learning training in the background? - python-3.x

I have a function in Support Vector Classifier which runs on a scheduler on Google Cloud Platform. That function, fetches the new data, adds it to the original data, trains the model on new data and saves it on google cloud storage. All of this takes 5 min to complete. I wish to not wait for the final output, instead I want to run it in the background and end the process without waiting.
Below is my function with comments:
def train_model():
users, tasks, tags, task_tags, task_user, boards = connect_postgres() ##loading the data from a postgres function
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blob = bucket.blob('original_data.pkl')
pickle_in0 = blob.download_as_string()
data = pickle.loads(pickle_in0)
tasks = tasks.rename(columns={'id': 'task_id', 'name': 'task_name'})
# Joining tasks and task_user_assigns tables
tasks = tasks[tasks.task_name.isnull() == False]
task_user = task_user[['id', 'task_id', 'user_id']].rename(columns={'id': 'task_user_id'})
task_data = tasks.merge(task_user, on='task_id', how='left')
# Joining users with the task_data
users = users[['id', 'email']].rename(columns={'id': 'user_id'})
users_tasks = task_data.merge(users, on='user_id', how='left')
users_tasks = users_tasks[users_tasks.user_id.isnull() == False].reset_index(drop=True)
# Joining boards table to user_tasks
boards = boards[['id', 'name']].rename(columns={'id': 'board_id', 'name': 'board_name'})
users_board = users_tasks.merge(boards, on='board_id', how='left').reset_index(drop=True)
# Data Cleaning
translator = Translator() # This is to translate if the tasks are not in English
users_board["task_trans"] = users_board["task_name"].map(lambda x: translator.translate(x, dest="en").text)
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_emoji(x)) #This calls a function to remove Emoticons from text
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_punct(x)) #This calls a function to remove punctuations from text
users_board = users_board[['task_id', 'email', 'board_id', 'user_id', 'task_trans']]
data1 = pd.concat([data, users_board], axis=0)
df1 = data1.copy
X = df1.task_trans #all the observations
y = df1.user_id #all the lables
print(y.nunique())
#FROM HERE ON, THE TRAINING SCRIPT BEGINS
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_transformed = tf_transformer.transform(X_train_counts)
print('model 1 done')
labels = LabelEncoder()
y_train_labels_fit = labels.fit(y)
y_train_lables_trf = labels.transform(y)
linear_svc = LinearSVC()
clf = linear_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 2 done')
calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc, cv="prefit")
calibrated_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 3 done')
# SAVING THE MODELS ON GOOGLE CLOUD STORAGE
# storage_client = storage.Client()
fs = gcsfs.GCSFileSystem(project='my-project')
filename = '~path/svc.sav'
pickle.dump(calibrated_svc, fs.open(filename, 'wb'))
filename = '~path/count_vectorizer.sav'
pickle.dump(count_vect, fs.open(filename, 'wb'))
filename = '~path/tfidf_vectorizer.sav'
pickle.dump(tf_transformer, fs.open(filename, 'wb'))
blob = bucket.blob('original_data.pkl')
pickle_out = pickle.dumps(df1)
blob.upload_from_string(pickle_out)
return "success"
Now, I tried to do the following:
p = subprocess.Popen([sys.executable, '-c', train_model()], stdout=subprocess.PIPE, stderr=subprocess.STDOUT); print('finished')
This also took the same amount of time. Is there a way I can solve this?
Also, if I want to print the python logs for this process on client-side, is that possible?

Related

How to update the weights of a pickled file?

I am training a Calibrated Classifier on Google Cloud Scheduler every day which takes about 5 mins to run. My python script receives latest data (from that day) and concatenate it to the original data and then the model gets trained and saves the pickled files on Cloud Storage. The issue I am facing now is, if it takes more than 5 mins (which it will at some point), it gives an upstream request timeout error.
I imagine, that it because of the more time the model is taking to train and I can think of one solution where I train the model only on the new data and update the weights of the original model in the pickled file. However, I am not sure if its possible.
Below is my function that runs on the scheduler:
def train_model():
users, tasks, tags, task_tags, task_user, boards = connect_postgres() ##loading the data from a postgres function
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blob = bucket.blob('original_data.pkl')
pickle_in0 = blob.download_as_string()
data = pickle.loads(pickle_in0)
tasks = tasks.rename(columns={'id': 'task_id', 'name': 'task_name'})
# Joining tasks and task_user_assigns tables
tasks = tasks[tasks.task_name.isnull() == False]
task_user = task_user[['id', 'task_id', 'user_id']].rename(columns={'id': 'task_user_id'})
task_data = tasks.merge(task_user, on='task_id', how='left')
# Joining users with the task_data
users = users[['id', 'email']].rename(columns={'id': 'user_id'})
users_tasks = task_data.merge(users, on='user_id', how='left')
users_tasks = users_tasks[users_tasks.user_id.isnull() == False].reset_index(drop=True)
# Joining boards table to user_tasks
boards = boards[['id', 'name']].rename(columns={'id': 'board_id', 'name': 'board_name'})
users_board = users_tasks.merge(boards, on='board_id', how='left').reset_index(drop=True)
# Data Cleaning
translator = Translator() # This is to translate if the tasks are not in English
users_board["task_trans"] = users_board["task_name"].map(lambda x: translator.translate(x, dest="en").text)
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_emoji(x)) #This calls a function to remove Emoticons from text
users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_punct(x)) #This calls a function to remove punctuations from text
users_board = users_board[['task_id', 'email', 'board_id', 'user_id', 'task_trans']]
data1 = pd.concat([data, users_board], axis=0)
df1 = data1.copy
X = df1.task_trans #all the observations
y = df1.user_id #all the lables
print(y.nunique())
#FROM HERE ON, THE TRAINING SCRIPT BEGINS
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_transformed = tf_transformer.transform(X_train_counts)
print('model 1 done')
labels = LabelEncoder()
y_train_labels_fit = labels.fit(y)
y_train_lables_trf = labels.transform(y)
linear_svc = LinearSVC()
clf = linear_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 2 done')
calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc, cv="prefit")
calibrated_svc.fit(X_train_transformed, y_train_lables_trf)
print('model 3 done')
# SAVING THE MODELS ON GOOGLE CLOUD STORAGE
# storage_client = storage.Client()
fs = gcsfs.GCSFileSystem(project='my-project')
filename = '~path/svc.sav'
pickle.dump(calibrated_svc, fs.open(filename, 'wb'))
filename = '~path/count_vectorizer.sav'
pickle.dump(count_vect, fs.open(filename, 'wb'))
filename = '~path/tfidf_vectorizer.sav'
pickle.dump(tf_transformer, fs.open(filename, 'wb'))
blob = bucket.blob('data.pkl')
pickle_out = pickle.dumps(df1)
blob.upload_from_string(pickle_out)
return "success"
Any idea how to achieve that? Or any other strategy that I can follow to solve this problem?
I couldn't find a way to update the weights of a pickle file and eventually settled with increasing the timeout parameter in cloud run to more than the training time and it fixed the issue for the time being.

Speed up the data extraction process from dicom files

I am trying to extract images from Dicom files.
My folder structure is so below -
> BATCH 4 BATCH 6 BATCH 8 Batch 29 Batch 30-35 Batch 36 Batch 37-38_1
> BATCH 5 BATCH 7 BATCH 9 Batch 29_1 Batch 30-35_1 Batch 37-38
Each batch contains thousands of dicom images.
My broad approach is below -
I am storing all the batches in a single list folder_list and then iterating through all of the batches.
single_files contains every dicom file in each batch and then I am subsequently iterating through each file in a batch.
After checking few conditions on each file, I am extracting the image - pixel_array and moving it to desired location.
The issue is it is really slow and complexity is O(n^2) , is there a way to fasten it up.
Complete code-
from pydicom import pixel_data_handlers
counter = 0
Source_folder_path = '/Path/*/'
destination_dir = '/Volumes/My Book/Extracted_Dataset'
folder_list = glob.glob(Source_folder_path)
for folder_dir in folder_list:
single_files = (glob.glob(os.path.join(folder_dir,'*')))
final_destination = os.path.join(destination_dir, folder_dir.split('/')[-2])
for i in single_files:
print(i)
dcm = pydicom.dcmread(i)
name = dcm.PatientID
dest = os.path.join(destination_dir,os.path.join(folder_dir,name))
if dcm.PhotometricInterpretation == 'RGB':
if dcm.Modality == "OP":
if os.path.isdir(dest) == False:
os.mkdir(dest)
img = dcm.pixel_array
name = dcm.PatientID+'_'+str(counter)+'.png'
counter+=1
if dcm.LossyImageCompression:
if dcm.LossyImageCompression=='00':
img = pixel_data_handlers.util.convert_color_space(img, current = 'RGB', desired = 'YBR_FULL')
image_to_write = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
cv2.imwrite(os.path.join(folder_dir,name),image_to_write)
if not os.path.isdir(final_destination):
os.makedirs(final_destination)
shutil.move(os.path.join(folder_dir,name),final_destination)
else:
shutil.move(os.path.join(folder_dir,name),final_destination)
Modified version as per the suggestion,
CPU Utilisation is below -
My IO utilisation is -
Can it be speed up more -
def ProcessOne(f):
"""Function of main process."""
counter = 0
destination_dir = '/Volumes/My Book/Extracted_Dataset'
folder_dir = f
single_files = (glob.glob(os.path.join(folder_dir, '*')))
final_destination = os.path.join(destination_dir, folder_dir.split('/')[-2])
for i in single_files:
print(i)
dcm = pydicom.dcmread(i)
name = dcm.PatientID
dest = os.path.join(destination_dir, os.path.join(folder_dir, name))
if dcm.PhotometricInterpretation == 'RGB':
if dcm.Modality == "OP":
if not os.path.isdir(dest):
os.mkdir(dest)
img = dcm.pixel_array
name = dcm.PatientID+'_'+str(counter)+'.png'
counter += 1
if dcm.LossyImageCompression:
if dcm.LossyImageCompression == '00':
img = pixel_data_handlers.util.convert_color_space(img, current='RGB', desired='YBR_FULL') # noqa
image_to_write = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
cv2.imwrite(os.path.join(folder_dir, name), image_to_write)
if not os.path.isdir(final_destination):
os.makedirs(final_destination)
shutil.move(os.path.join(folder_dir, name), final_destination) # noqa
else:
shutil.move(os.path.join(folder_dir, name), final_destination) # noqa
if __name__ == '__main__':
# Create a pool of processes to check files
p = Pool()
# Create a list of files to process
Source_folder_path = '/Path/*/' # noqa
folder_list = glob.glob(Source_folder_path)
print(f'Batches to process: {len(folder_list)}')
# Map the list of files to check onto the Pool
p.map(ProcessOne, folder_list)

Write file name based on return

I'm creating a boto3 script that scrapes and uploads our entire accounts Public Ips and NatGateway Ips to our S3 bucket. I'm stuck on writing files for both returns. I would ideally like to write two separate files while still using the same filename variable you see in main(). Right now I can get this to work with only one return(either nat_ips or public_ips)
import boto3
from datetime import datetime
from csv import writer
def get_ips():
# Uses STS to assume the role needed.
boto_sts=boto3.client('sts')
sts_response = boto_sts.assume_role(
RoleArn='arn:aws:iam::1234:role/foo',
RoleSessionName='Foo'
)
# Save the details from assumed role into vars
sts_credentials = sts_response["Credentials"]
session_id = sts_credentials["AccessKeyId"]
session_key = sts_credentials["SecretAccessKey"]
session_token = sts_credentials["SessionToken"]
# List and store all the regions
ec2_client=boto3.client('ec2',aws_access_key_id=session_id,aws_secret_access_key=session_key,aws_session_token=session_token,region_name='us-west-1')
all_regions=[region['RegionName'] for region in ec2_client.describe_regions()['Regions']]
nat_ips = []
public_ips = []
for region in all_regions:
max_results = 1000
next_token = ''
ec2_client=boto3.client('ec2',aws_access_key_id=session_id,aws_secret_access_key=session_key,aws_session_token=session_token,region_name=region)
session=boto3.Session(aws_access_key_id=session_id, aws_secret_access_key=session_key, aws_session_token=session_token, region_name=region)
while next_token or next_token == '':
response = ec2_client.describe_nat_gateways(MaxResults=max_results, NextToken=next_token)
filters = [{'Name':'tag:Name', 'Values':['*sgw-eip']}]
get_ips = ec2_client.describe_addresses(Filters=filters)
for gateway in response["NatGateways"]:
for address in gateway["NatGatewayAddresses"]:
nat_ips.append(address["PublicIp"]+'/32')
for eip_dict in get_ips['Addresses']:
public_ip_string = eip_dict['Tags'][0]['Value'] + ' : ' + eip_dict['PublicIp']
public_ips.append(public_ip_string)
next_token = response.get("NextToken", None)
return nat_ips, public_ips
def _s3_upload(filename):
s3 = boto3.resource('s3')
bucket = 'foo-bar'
object_name = 'foo/'
s3.meta.client.upload_file(Filename=filename,Bucket=bucket,Key=object_name+filename)
print(f'Uploading {filename} to {bucket}')
def write_list_to_file(filename, data):
lines_string = '\n'.join(str(x) for x in data)
with open(filename,'w') as output:
output.writelines(lines_string)
print(f'Writing file to {filename}')
if __name__ == "__main__":
date = datetime.now().strftime('%Y%m%d')
# Stuck here since I want to make it one variable
filename_nat_ips = f'natgateway_ips{date}.csv'
filename_sga_ips = f'sga_ips{date}.csv'
public_ips = get_ips()
nat_ips = get_ips()
print(filename)
write_list_to_file(filename, nat_ips)
_s3_upload(filename)
I see that you are already returning a tuple of public_ips and nat_ips from your get_ips() function. So in your main, you could collect them together as well.
You might try something like this:
if __name__ == "__main__":
date = datetime.now().strftime('%Y%m%d')
# Stuck here since I want to make it one variable
filename_nat_ips = f'natgateway_ips{date}.csv'
filename_sga_ips = f'sga_ips{date}.csv'
nat_ips, public_ips = get_ips()
write_list_to_file(filename_nat_ips, nat_ips)
write_list_to_file(filename_public_ips, public_ips)
_s3_upload(filename_nat_ips)
_s3_upload(filename_public_ips)
I was doing it right the first time. And was trying to make it more complicated.
if __name__ == "__main__":
date = datetime.now().strftime('%Y%m%d')
filename_nat_ips = f'natgateway_ips{date}.csv'
filename_sga_ips = f'sga_ips{date}.csv'
nat_ips, public_ips = get_ips()
print(filename_nat_ips)
print(filename_sga_ips)
write_list_to_file(filename_nat_ips, nat_ips)
write_list_to_file(filename_sga_ips, public_ips)
_s3_upload(filename_nat_ips)
_s3_upload(filename_sga_ips)

Python 3 Multiprocessing and openCV problem with dictionary sharing between processor

I would like to use multiprocessing to compute the SIFT extraction and SIFT matching for object detection.
For now, I have a problem with the return value of the function that does not insert data in the dictionary.
I'm using Manager class and image that are open inside the function. But does not work.
Finally, my idea is:
Computer the keypoint for every reference image, use this keypoint as a parameter of a second function that compares and match with the keypoint and descriptors of the test image.
My code is:
# %% Import Section
import cv2
import numpy as np
from matplotlib import pyplot as plt
import os
from datetime import datetime
from multiprocessing import Process, cpu_count, Manager, Lock
import argparse
# %% path section
tests_path = 'TestImages/'
references_path = 'ReferenceImages2/'
result_path = 'ResultParametrizer/'
#%% Number of processor
cpus = cpu_count()
# %% parameter section
eps = 1e-7
useTwo = False # using the m and n keypoint better with False
# good point parameters
distanca_coefficient = 0.75
# gms parameter
gms_thresholdFactor = 3
gms_withRotation = True
gms_withScale = True
# flann parameter
flann_trees = 5
flann_checks = 50
#%% Locker
lock = Lock()
# %% function definition
def keypointToDictionaries(keypoint):
x, y = keypoint.pt
pt = float(x), float(y)
angle = float(keypoint.angle) if keypoint.angle is not None else None
size = float(keypoint.size) if keypoint.size is not None else None
response = float(keypoint.response) if keypoint.response is not None else None
class_id = int(keypoint.class_id) if keypoint.class_id is not None else None
octave = int(keypoint.octave) if keypoint.octave is not None else None
return {
'point': pt,
'angle': angle,
'size': size,
'response': response,
'class_id': class_id,
'octave': octave
}
def dictionariesToKeypoint(dictionary):
kp = cv2.KeyPoint()
kp.pt = dictionary['pt']
kp.angle = dictionary['angle']
kp.size = dictionary['size']
kp.response = dictionary['response']
kp.octave = dictionary['octave']
kp.class_id = dictionary['class_id']
return kp
def rootSIFT(dictionary, image_name, image_path,eps=eps):
# SIFT init
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
sift = cv2.xfeatures2d.SIFT_create()
keypoints, descriptors = sift.detectAndCompute(image, None)
descriptors /= (descriptors.sum(axis=1, keepdims=True) + eps)
descriptors = np.sqrt(descriptors)
print('Finito di calcolare, PID: ', os.getpid())
lock.acquire()
dictionary[image_name]['keypoints'] = keypoints
dictionary[image_name]['descriptors'] = descriptors
lock.release()
def featureMatching(reference_image, reference_descriptors, reference_keypoints, test_image, test_descriptors,
test_keypoints, flann_trees=flann_trees, flann_checks=flann_checks):
# FLANN parameter
FLANN_INDEX_KDTREE = 1
index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=flann_trees)
search_params = dict(checks=flann_checks) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params, search_params)
flann_matches = flann.knnMatch(reference_descriptors, test_descriptors, k=2)
matches_copy = []
for i, (m, n) in enumerate(flann_matches):
if m.distance < distanca_coefficient * n.distance:
matches_copy.append(m)
gsm_matches = cv2.xfeatures2d.matchGMS(reference_image.shape, test_image.shape, keypoints1=reference_keypoints,
keypoints2=test_keypoints, matches1to2=matches_copy,
withRotation=gms_withRotation, withScale=gms_withScale,
thresholdFactor=gms_thresholdFactor)
#%% Starting reference list file creation
reference_init = datetime.now()
print('Start reference file list creation')
reference_image_process_list = []
manager = Manager()
reference_image_dictionary = manager.dict()
reference_image_list = manager.list()
for root, directories, files in os.walk(references_path):
for file in files:
if file.endswith('.DS_Store'):
continue
reference_image_path = os.path.join(root, file)
reference_name = file.split('.')[0]
image = cv2.imread(reference_image_path, cv2.IMREAD_GRAYSCALE)
reference_image_dictionary[reference_name] = {
'image': image,
'keypoints': None,
'descriptors': None
}
proc = Process(target=rootSIFT, args=(reference_image_list, reference_name, reference_image_path))
reference_image_process_list.append(proc)
proc.start()
for proc in reference_image_process_list:
proc.join()
reference_end = datetime.now()
reference_time = reference_end - reference_init
print('End reference file list creation, time required: ', reference_time)
I faced pretty much the same error. It seems that the code hangs at detectAndCompute in my case, not when creating the dictionary. For some reason, sift feature extraction is not multi-processing safe (to my understanding, it is the case in Macs but I am not totally sure.)
I found this in a github thread. Many people say it works but I couldn't get it worked. (Edit: I tried this later which works fine)
Instead I used multithreading which is pretty much the same code and works perfectly. Of course you need to take multithreading vs multiprocessing into account

reading textfile returning empty variable in tensorflow

I have a text file which has 110 rows and 1024 columns of float values. I am trying to load the textfile and it doesnt read any thing.
filename = '300_faults.txt'
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
_,a = reader.read(filename_queue)
#x = np.loadtxt('300_faults.txt') # working
#a = tf.constant(x,tf.float32) # working
model = tf.initialize_all_variables()
with tf.Session() as session:
session.run(model)
print(session.run(tf.shape(a)))
printing the shape of the variable returns [].
Firstly - tf.shape(a) == [] doesn't mean that variable is empty. All scalars and strings have shape [].
https://www.tensorflow.org/programmers_guide/dims_types
May be you can check "rank" instead - it would be 0 for scalars and strings.
Other than that it looks like string_input_producer is a queue and it needs additional wiring to make ti work.
Please try this
filename = '300_faults.txt'
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
_,a = reader.read(filename_queue)
#x = np.loadtxt('300_faults.txt') # working
#a = tf.constant(x,tf.float32) # working
model = tf.initialize_all_variables()
with tf.Session() as session:
session.run(model)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
print(session.run(tf.shape(a)))
print(session.run((a)))
coord.request_stop()
coord.join(threads)

Resources