convert Float32 array to image in coreml - pytorch

I converted a deelplab v3 model from Pytorch to coreml and seeing the outputs as MultiArray, so I need to convert the output to image. I've been using many different advice (e.g. from coreml survival guide) but still haven't been successful. Could you anyone kindly help me out here? Thanks a lot.
FYI, this is the model spec from Python:
spec desc is
input {
name: "input.1"
type {
imageType {
width: 513
height: 513
colorSpace: RGB
}
}
}
output {
name: "1436"
type {
multiArrayType {
dataType: FLOAT32
}
}
}
In Xcode, the output the I printout in Xcode is "Float32 1 × 14 × 513 × 513 array", which I assume that 1 is the number of channels, 14 is the number of labels, and 513 x 514 is Width x Height. How can I convert this array into Int32 513 × 513 matrix or to an Image?
Thanks for your help!
Edited: I added my model output's structure compared with Apple's DeeplabV3's (visualized through Netron) below for your reference. Any guidance is appreciated!
1/ My model output's architecture
2/ Apple's DeeplabV3

Images only have 1, 3 or 4 channels (grayscale, RGB, or RGBA). You have 14 channels, so you need to decide how those 14 channels map to colors.
Here is some demo code that shows a possible approach (also using DeepLab v3): https://github.com/hollance/SemanticSegmentationMetalDemo

Related

RuntimeError: size mismatch, m1: [4 x 3136], m2: [64 x 5] at c:\a\w\1\s\tmp_conda_3.7_1

I used python 3 and when i insert transform random crop size 224 it gives miss match error.
here my code
what did i wrong ?
Your code makes variations on resnet: you changed the number of channels, the number of bottlenecks at each "level", and you removed a "level" entirely. As a result, the dimension of the feature map you have at the end of layer3 is not 64: you have a larger spatial dimension than you anticipated by the nn.AvgPool2d(8). The error message you got actually tells you that the output of level3 is of shape 64x56x56 and after avg pooling with kernel and stride 8 you have 64x7x7=3136 dimensional feature vector, instead of only 64 you are expecting.
What can you do?
As opposed to "standard" resnet, you removed stride from conv1 and you do not have max pool after conv1. Moreover, you removed layer4 which also have a stride. Therefore, You can add pooling to your net to reduce the spatial dimensions of layer3.
Alternatively, you can replace nn.AvgPool(8) with nn.AdaptiveAvgPool2d([1, 1]) an avg pool that outputs only one feature regardless of the spatial dimensions of the input feature map.

Read png image with cv2.imread form Opencv3 in Python3.7.1 and no pixel at all with black window

I have two png images, one is outputed by python library pillow to png, converted from svg font image, another is this one read by and re-saved from windows 10's paint program to png.
Strangely, I use opencv3 cv2.imread function to read these images, one is not OK with only black window, another is OK.
How to read these pngs both correctly?
CODE:
import cv2
image_file_path = r""
image = cv2.imread(image_file_path, cv2.IMREAD_ANYDEPTH)
if(! os.path.exists(image_file_path)):
print('NOT EXIST! = ' + image_file_path)
cv2.namedWindow('image', cv2.WINDOW_NORMAL)
cv2.imshow("image", image)
cv2.waitKey()
IMAGES:
OK:
NOT OK:
The first image is in 4-channel RGBA format with a completely pointless, fully opaque, alpha channel which you can ignore.
The second image is in 2-channel Grey+Alpha format where all the pixels are pure solid black and the shapes are defined only in the alpha channel.
So, basically you want to:
discard the last channel of the first image, which you can do by using cv2.IMREAD_COLOR
discard all except the last channel of the second image, which you can do like this:
im = cv2.imread('2.png',cv2.IMREAD_UNCHANGED)[:,:,-1]
I obtained the information above by using ImageMagick which is included in most Linux distros and is available on macOS and Windows.
The command I used in Terminal is:
magick identify -verbose 2.png
Sample Output
Image: 2.png
Format: PNG (Portable Network Graphics)
Mime type: image/png
Class: DirectClass
Geometry: 1040x1533+0+0
Units: Undefined
Colorspace: Gray
Type: Bilevel
Base type: Undefined
Endianess: Undefined
Depth: 8-bit
Channel depth:
Gray: 1-bit <--- Note 1
Alpha: 8-bit <--- Note 1
Channel statistics:
Pixels: 1594320
Gray:
min: 0 (0) <--- Note 2
max: 0 (0) <--- Note 2
mean: 0 (0)
standard deviation: 0 (0)
kurtosis: -3
skewness: 0
entropy: 4.82164e-05
Alpha:
min: 0 (0) <--- Note 3
max: 255 (1) <--- Note 3
mean: 50.3212 (0.197338)
standard deviation: 101.351 (0.397456)
kurtosis: 0.316613
skewness: 1.52096
entropy: 0.0954769
...
...
I have annotated with arrows and notes on the right above.
Note 1: This tells me the image is greyscale + alpha
Note 2: This tells me all the greyscale pixels are black, since the max is zero and the min is zero
Note 3: This tells me that there are some fully transparent pixels, and some fully opaque pixels
Paint is transforming the images somehow making their format incompatible with the 'typical' imread routine. I'm not sure what's happening, it might be related to paint already removing the alpha channel which OpenCV also wants to remove (according to their docs, didn't take a look at the code). Luckily you can circumvent it:
I_not_ok = cv2.imread(ImagePath, CV2.IMREAD_UNCHANGED)
I_ok = I_not_ok[:,:,3]
cv2.namedWindow('Image_ok', cv2.WINDOW_NORMAL)
cv2.imshow('Image_ok', I_ok)
cv2.waitKey(0)

Permutate over output of CNN to return smallest loss and rearange output

Lets say i am detecting dogs on images.
Output of my CNN is
Dense(24,activation="relu")
Which means i want to detect up to 6 dogs ( each dogs should be represented by xmin,ymin,xmax,ymax = 4 values , 4 * 6 = 24 )
Now lets say i have two dogs on pictures and their positions are ( bounding box )
dog1 = { xmin: 50, ymin:50, xmax:150,ymax:150}
dog2 = { xmin: 300,ymin:300,xmax:400,ymax:400}
Now tha label for this picture would look something like
( 50, 50, 150, 150, 300, 300, 400 ,400 , 0 ,0, 0 ... 16 zeros )
Now what if my CNN outputs something like
( 290, 285, 350 , 350, 60 , 40 , 120 ,110 ... 0 ... )
AS you can see the first bounding box that CNN outputs is closer to the bounding box of second dog and vice verse.
How should i deal with this?
I can create my own MSE function and output the smallest value e.g
def custom_mse(y_true, y_pred):
tmp = 10000000000
a = list(itertools.permutations(y_pred))
for i in range(0, len(a)):
t = K.mean(K.square(a[i] - y_true), axis=-1)
if t < tmp :
tmp = t
return tmp
But this would results only in "best" loss but the weights would get modified wrongly.
How can i modify output of CNN ( permutate or rearrange elements ) so this would work?
I hope i explained it clearly.
Thanks for help.
Your issue lies in on what object you calculate the loss.
Tensorflow/Keras/Almost any other library, uses its own objects in order to find derivatives and define the calculation graph.
Therefore, if you need to do anything with the graph, you must do it using a defined op or define your own op using the given methods and objects. Tensorflow also allows to wrap regular python functions and make it as ops on tensors.
As for your problem, create a 2d output array of dims [4,num_of_objects] and use tensorflow operations to reorder the second dimension before calculating loss. See full lists here. Split it according to second dimension, iterate combinations, use tf.min to find the minimum loss, and optimize only the minimum loss. Experimented with that approach, it works, also with bounding boxes.
EDIT: Noted that you perform your experiments and calculation with Keras, use Tensorflow backend and work only on tensors, do NOT retreive data from graph to numpy/list objects. Use only tensors.
Good luck!

CNN: Softmax layer for pixel-wise classification

I want to understand in more details how a softmax layer can look in a CNN for semantic segmentation / pixelwise classification of an image. The CNN outputs an image of class labels, where each pixel of the original image gets a label.
After passing a test image through the network, the next-to-last layer outputs N channels of the resolution of the original image. My question is, how the softmax layer transforms these N channels to the final image of labels.
Assumed we have C classes (# possible labels). My suggestion is that for each pixel, its N neurons of the previous layer are connected to C neurons in the softmax layer, where each of the C neurons represents one class. Using the softmax activation function, the sum of the C outputs (for this pixel) is equal to 1 (which facilitates training of the network). Last, each pixel is classified as the class with the highest probability (given by softmax values).
This would mean, that the softmax layer consists of C * #pixels neurons. Is my suggestion correct? I didn't find an explanation for this and hope that you can help me.
Thanks for helping!
The answer is softmax layer Do not transforms these N channels to the final image of labels
Assuming you have a output of N channel your question is how do you convert it to a 3 channel for the final output.
The answer is you dont. Each of those N channel represents a class. The way to go is that you should have a dummy array with same height and weight and 3 channels.
Now you fist have to abstractly encode each class with a color, like streets as green, cars as red etc.
Assume for height = 5 and width = 5, channel 7 has the max value. Now,
-> if the channel 7 represents car the you need to put a red pixel on the dummy array where height = 5 and width = 5.
-> if the channel 7 represents street the you need to put a green pixel on the dummy array where height = 5 and width = 5.
So you are trying to look for which of the N class a pixel belongs to. And based on the class you will redraw the pixel in a unique color on the dummy array.
This dummy array is called the mask.
For example, assume this is a input
We are trying to locate the tumor area of the brain using pixel wise classification. Here the number of classes are 2, Tumor present and not present. So the softmax layer outputs a 2 channel object where channel 1 says tumor present and channel 2 says otherwise.
So whenever for height = X and width = Y, channel 1 has higher value we make a white pixel of the dummmy[X][Y] image. When the channel 2 has higher value we make a black pixel.
After that we get a mask like this,
Which doesnt make that much sense. But when we overlay the two image, we get this
So basically you will try to create the mask image (2nd one) from your output with N Channel. And overlaying them will get you the final output

Converting an image to rows of grayscale pixel values

I'd like to use the node indico API. I need to convert the image to grayscale and then to arrays containing arrays/rows of pixel values. Where do I start?
These tools take a specific format for images, a list of lists, each
sub-list containing a 'row' of values corresponding to n pixels in the
image.
e.g. [[float, float, float ... *n ], [float, float, float ... *n ], ... *n]
Since pixels tend to be represented by RGBA values, you can use the
following formula to convert to grayscale.
Y = (0.2126 * R + 0.7152 * G + 0.0722 * B) * A
We're working on automatically scaling images, but for the moment it's
up to you provide a square image
It looks like node's image manipulation tools are sadly a little lacking, but there is a good solution.
get-pixels allows reading in images either from URL or from local path and will convert it into an ndarray that should work excellently for the API.
The API will accept RGB images in the format that get-pixels produces them, but if you're still interested in converting the images to grayscale, which can be helpful for other applications it's actually a little strange.
In a standard RGB image there's basically a luminence score given to each color, which is how bright the color appears. Based on the luminance, a conversion to grayscale for each pixel happens as follows:
Grayscale = 0.2126*R + 0.7152*G + 0.0722*B
Soon the API will also support the direct use of URLs, stay tuned on that front.
I maintain the sharp Node.js module that may be able to get you a little closer to what you need.
The following example will convert input to greyscale and generate a Bufferof integer values, one byte per pixel.
You'll need to add logic to divide by 255 to convert to float then split into an array of arrays to keep the Indico API happy.
sharp(input)
.resize(width, height)
.grayscale()
.raw()
.toBuffer(function(err, data) {
// data is a Buffer containing uint8 values (0-255)
// with each byte representing one pixel
});

Resources