how to reduce number of images in a data set in python - image-scaling

I have an image data set over 3000 images. how do I write a code after the current code to separate 200 images from each individual data set. there's train = T1, T2 and test= T1, T2.

Related

how can i visualise the data which is transformed and fed into databunch in FASTAI and to plot the distribution of data after data is created

So I have been working on an object detection problem and I loaded my databunch object from the image list in fastai but I am not able to fully understand what kind of patches or images are genuinely present in my input data so I need a visualization of the data like how in what manner is it present how many positive classes, how much the data have hard negatives and their distributions.
This is how I created my data bunch in fastai:
#here train_images,valid_images is a list of <object_detection_fastai.helper.wsi_loader.SlideContainer object>
batch_size = 64
do_flip = True
flip_vert = True
max_rotate = 90
max_zoom = 1.1
max_lighting = 0.2
max_warp = 0.2
p_affine = 0.75
p_lighting = 0.75
tfms = get_transforms(do_flip=do_flip,
flip_vert=flip_vert,
max_rotate=max_rotate,
max_zoom=max_zoom,
max_lighting=max_lighting,
max_warp=max_warp,
p_affine=p_affine,
p_lighting=p_lighting)
train, valid ,test = ObjectItemListSlide(train_images), ObjectItemListSlide(valid_images), ObjectItemListSlide(test_images)
#print("type",(test_set[12]))
item_list = ItemLists(".", train, test)
lls = item_list.label_from_func(lambda x: x.y, label_cls=SlideObjectCategoryList)
lls = lls.transform(tfms, tfm_y=True, size=patch_size)
data = lls.databunch(bs=batch_size, collate_fn=bb_pad_collate,num_workers=0).normalize()
Now I need to find out what patches or images are present. They are of which distribution like how many positives and how many negatives and background class, as after the transform function they are changed and randomized into the data bunch. Hence, the data bunch becomes a black box of my input data, I need to have insights on what is the actual distribution of data inside databunch.

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
columnTitles=["Y","X"]
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
a=BallTreeIndex(df.iloc[:,0:2])
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!
First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

How to bulk test the Sagemaker Object detection model with a .mat dataset or S3 folder of images?

I have trained the following Sagemaker model: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/object_detection_pascalvoc_coco
I've tried both the JSON and RecordIO version. In both, the algorithm is tested on ONE sample image. However, I have a dataset of 2000 pictures, which I would like to test. I have saved the 2000 jpg pictures in a folder within an S3 bucket and I also have two .mat files (pics + ground truth). How can I apply this model to all 2000 pictures at once and then save the results, rather than doing it one picture at a time?
I am using the code below to load a single picture from my S3 bucket:
object = bucket.Object('pictures/pic1.jpg')
object.download_file('pic1.jpg')
img=mpimg.imread('pic1.jpg')
img_name = 'pic1.jpg'
imgplot = plt.imshow(img)
plt.show(imgplot)
with open(img_name, 'rb') as image:
f = image.read()
b = bytearray(f)
ne = open('n.txt','wb')
ne.write(b)
import json
object_detector.content_type = 'image/jpeg'
results = object_detector.predict(b)
detections = json.loads(results)
print (detections['prediction'])
I'm not sure if I understood your question correctly. However, if you want to feed multiple images to the model at once, you can create a multi-dimensional array of images (byte arrays) to feed the model.
The code would look something like this.
import numpy as np
...
# predict_images_list is a Python list of byte arrays
predict_images = np.stack(predict_images_list)
with graph.as_default():
# results is an list of typical results you'd get.
results = object_detector.predict(predict_images)
But, I'm not sure if it's a good idea to feed 2000 images at once. Better to batch them in 20-30 images at a time and predict.

VTK - How to use vtkNetCDFCFReader to read an array or variable array at specific time frame

Im trying to load an array at a specific time frame (for example if it has 50 frames or time units then get an array corresponding to the 2nd time frame) from netCDF files (.nc). Im currently using vtkNetCDFCFReader and getting the data array "vwnd" from the 1st time frame like this:
vtkSmartPointer<vtkNetCDFCFReader> reader = vtkSmartPointer<vtkNetCDFCFReader>::New();
reader->SetFileName(path.c_str());
reader->UpdateMetaData();
vtkSmartPointer<vtkStructuredGridGeometryFilter> geometryFilter = vtkSmartPointer<vtkStructuredGridGeometryFilter>::New();
geometryFilter->SetInputConnection(reader->GetOutputPort());
geometryFilter->Update();
vtkSmartPointer<vtkPolyData> ncPolydata = vtkSmartPointer<vtkPolyData>::New();
ncPolydata = geometryFilter->GetOutput();
vtkSmartPointer<vtkDataArray> dataArray = ncPolydata->GetCellData()->GetArray("vwnd");
Variable Arrays are : lat, lon, time, vwnd (vwnd has dimensions (lat,lon)). Im also interested in getting arrays for lat and lon. Any help would be appreciated.
Thanks in advance
As the dimension of lat/lon is different from vwnd, you will need 2 vtknetCDFreaders to read in data with different dimensions. Just remember to set the dimension after creating the reader.
For example in C++:
vtknetCDFReader* reader = vtknetCDFReader::New();
reader->SetFileName(fileName.c_str());
reader->UpdateMetaData();
//here you specify the dimension of the reader
reader->SetDimension(dim);
reader->SetVariableArrayStatus("lat",1)
reader->SetVariableArrayStatus("lon",1)
reader->Update();
If you are doing it correctly, you could read in any arrays and store it into vtkDataArray.
If you want to read in the vwnd data in the second time step, just skip the first lat*lon values.

How to SVM Train my Edge images using Java code

I have set of images on which I performed edge detection using OpenCV 3.1. The edges are stored in MAT of OpenCV. Can someone help me in processing for Java SVM train and test code on those set of images ?
Following discussion in comments I am providing you with an example project which I built for android studio a while back.
This was used to classify images depending on Lab color spaces.
//1.a Assign the parameters for SVM training here
double nu = 0.999D;
double gamma = 0.4D;
double epsilon = 0.01D;
double coef0 = 0;
//kernel types are Linear(0), Poly(1), RBF(2), Sigmoid(3)
//For Poly(1) set degree and gamma
double degree = 2;
int kernel_type = 4;
//1.b Create an SVM object
SVM B_channel_svm = SVM.create();
B_channel_svm.setType(104);
B_channel_svm.setNu(nu);
B_channel_svm.setCoef0(coef0);
B_channel_svm.setKernel(kernel_type);
B_channel_svm.setDegree(degree);
B_channel_svm.setGamma(gamma);
B_channel_svm.setTermCriteria(new TermCriteria(2, 10, epsilon));
// Repeat Step 1.b for the number of SVMs.
//2. Train the SVM
// Note: training_data - If your image has n rows and m columns, you have to make a matrix of size (n*m, o), where o is the number of labels.
// Note: Label_data is same as above, n rows and m columns, make a matrix of size (n*m, o) where o is the number of labels.
// Note: Very Important - Train the SVM for the entire data as training input and the specific column of the Label_data as the Label. Here, I train the data using B, G and R channels and hence, the name B_channel_SVM. I make 3 different SVM objects separately but you can do this by creating only one object also.
B_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(0));
G_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(1));
R_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(2));
// Now after training we "predict" the outcome for a sample from the trained SVM. But first, lets prepare the Test data.
// As above for the training data, make a matrix of (n*m, o) and use the columns to predict. So, since I created 3 different SVMs, I will input three separate matrices for the three SVMs of size (n*m, 1).
//3. Predict the testing data outcome using the trained SVM.
B_channel_svm.predict(scene_ml_input, predicted_final_B, StatModel.RAW_OUTPUT);
G_channel_svm.predict(scene_ml_input, predicted_final_G, StatModel.RAW_OUTPUT);
R_channel_svm.predict(scene_ml_input, predicted_final_R, StatModel.RAW_OUTPUT);
//4. Here, predicted_final_ are matrices which gives you the final value as in Label(0,1,2... etc) for the input data (edge profile in your case)
Now, I hope you have an idea for how SVM works. You basically need to do these steps:
Step 1: Identify labels - In your case Gestures from edge profile.
Step 2: Assign values to the labels - For example, if you are trying to classify haptic gestures - Open Hand = 1, Closed Hand/Fist = 2, Thumbs up = 3 and so on.
Step 3: Prepare the training data (edge profiles) and Labels (1,2,3) etc. according to the process above.
Step 4: Prepare data for prediction using the transformation calculated using SVM.
Very Important for SVM on OpenCV - Normalize your data, make sure you all matrices are of Same Type - CvType
Hope it helps. Feel free to ask questions if you have any doubts and post what you have tried. I can solve the problem for you if you send me some images but then you won't learn anything right? ;)

Resources