How to remove an image and it's label from a dataset - python-3.x

When I import a dataset of images and the csv file with the corresponding labels, I made a function to extract every image label. But know I have to remove some images that do not fit specific criteria. Is there a way to remove the corresponding label as well?
This is the function that is used to load the images and the labels
def imp_img():
dirname = '/s/desk/img/'
x = np.zeros((1000, 100, 100), dtype=np.float32)
for i in range(x.shape[0]):
img = Image.open(dirname + 'img_%02d.png' % (i))
img = np.array(img)
x[i] = img
path = '/s/desk/labels_classificatio.csv'
labels = pd.read_csv(path, usecols=["category"],
sep=";" )
y = np.array(labels)
return x, y
This is how they are imported
x, y = imp_img()
x = x/255.0
y = y.reshape(y.shape[0], 1)
x.shape, y.shape
and now I made for loop to remove the images that are too dark
c =[]
for i in x:
if np.sum(i) >= 100:
c.append(i)
c = np.asarray(c)
The problem now is that I have fewer images than I have labels. Is there a way to remove the corresponding label as well?

You're looking for enumerate. It lets you loop over an iterable while maintaining a count. Instead of for i in x we'll do for i, img in enumerate(x) which let us maintain a loop counter i. This way you can sbset the labels corresponding to the images that meet your criteria.
code:
c = []
c_labels = []
for i, img in enumerate(x):
if np.sum(img) >= 100:
c.append(img)
c_labels.append(y[i])
c = np.asarray(c)
c_labels = np.asarray(c_labels)

Related

How to fix multiprocess issue in code given below?

My main function is aligning images with the reference image. The function is working smoothly for single core. I had tried to multi-process the above problem to reduce the time, but it is taking same time as single core. I think same image is allotted to all the cores. How do I split different images in a folder to different cores to speed the process? Also tell me if there is any error in my code.
MAX_FEATURES = 50000
GOOD_MATCH_PERCENT = 1.00
def alignImages(refimage, input_path, output_path):
im1 = cv2.imread(input_path)
im1Gray = cv2.cvtColor(im1, cv2.COLOR_BGR2GRAY)
im2Gray = cv2.cvtColor(reference_image, cv2.COLOR_BGR2GRAY)
# Detect ORB features and compute descriptors.
#Machter Algo
# Remove not so good matches
numGoodMatches = int(len(matches) * GOOD_MATCH_PERCENT)
matches = matches[:numGoodMatches]
# Draw top matches
imMatches = cv2.drawMatches(im1, keypoints1, reference_image,
keypoints2, matches, None)
# Extract location of good matches
points1 = np.zeros((len(matches), 2), dtype=np.float32)
points2 = np.zeros((len(matches), 2), dtype=np.float32)
for i, match in enumerate(matches):
points1[i, :] = keypoints1[match.queryIdx].pt
points2[i, :] = keypoints2[match.trainIdx].pt
# Find homography
h, mask = cv2.findHomography(points1, points2, cv2.RANSAC)
# Use homography
height, width, channels = reference_image.shape
imReg = cv2.warpPerspective(im1, h, (width, height))
f, e = os.path.splitext(output_path)
cv2.imwrite(f + '.TIF',imReg)
if __name__ == "__main__":
start = time.time()
from multiprocessing import Pool
path = r'Path of folder to align images'
dirs = os.listdir(path)
array1 = []
array2 = []
array3 = []
for i in dirs:
input_path = path+'\\'+i
reference_image = cv2.imread('Reference image path')
array1.append(reference_image)
array2.append(input_path)
out = path +'\\'+'new\\'+i
array3.append(out)
z = list(zip(array1,array2,array3))
p = Pool(12)
p.starmap(alignImages,z,chunksize=28)
end = time.time()
print(end-start)

What is the best way to extract text contained within a table in a pdf using python?

I'm constructing a program to extract text from a pdf, put it in a structured format, and send it off to a database. I have roughly 1,400 individual pdfs that all follow a similar format, but nuances in the verbiage and plan designs that the documents summarize make it tricky.
I've played around with a couple different pdf readers in python including tabula-py and pdfminer but none of them are quite getting to what I'd like to do. Tabula reads in all of the text very well, however it pulls everything as it explicitly lays horizontally, excluding the fact that some of the text is wrapped in a box. For example, if you open up the sample SBC I have attached where it reads "What is the overall deductible?" Tabula will read in "What is the overall $500/Individual or..." skipping the fact that the word "deductible" is really part of the first sentence. (Note the files I'm working with are pdfs but I've attached a jpeg because I couldn't figure out how to attach a pdf.)
import tabula
df = tabula.read_pdf(*filepath*, pandas_options={'header': None))
print(df.iloc[0][0])
print(df)
In the end, I'd really like to be able to parse out the text within each box so that I can better identify what values belong to deductible, out-of-pocket limts, copays/coinsurance, etc. I thought possibly some sort of OCR would allow me to recognize which parts of the PDF are contained in the blue rectangles and then pull the string from there, but I really don't know where to start with that.Sample SBC
#jpnadas In this case the code you copied from my answer in this post isn't really suitable because it addresses the case when a table doesn't have surrounding grid. That algorithm looks for repeating blocks of texts and tries to find a pattern that resembles a table heuristically.
But in this particular case the table does have the grid and by taking this advantage we can achieve a lot more accurate result.
The strategy is the following:
Increase image gamma to make the grid darker
Get rid of colour and apply Otsu thresholding
Find long vertical an horizontal lines in the image and create a mask from it using erode and dilate functions
Find the cell blocks in the mask using findContours function.
Find table objects
5.1 The rest can be as in the post about finding a table without the
grid: find table structure heuristically
5.2 Alternative approach could be using hierarchy returned by the findContours function. This approach is even more accurate and
allows to find multiple tables on a single image.
Having cell coordinates it's easy to extract certain cell image from the original image:
cell_image = image[cell_y:cell_y + cell_h, cell_x:cell_x + cell_w]
Apply OCR to each cell_image.
BUT! I consider the OpenCV approach as a last resort when you're not able to read the PDF's contents: for instance in case when a PDF contains raster image inside.
If it's a vector-based PDF and its contents are readable it makes more sense to find the table inside contents and just read the text from it instead of doing heavy 'OCR lifting'.
Here's the code for reference for more accurate table recognition:
import os
import imutils
import numpy as np
import argparse
import cv2
def gamma_correction(image, gamma = 1.0):
look_up_table = np.empty((1,256), np.uint8)
for i in range(256):
look_up_table[0,i] = np.clip(pow(i / 255.0, gamma) * 255.0, 0, 255)
result = cv2.LUT(image, look_up_table)
return result
def pre_process_image(image):
# Let's get rid of color first
# Applying gamma to make the table lines darker
gamma = gamma_correction(image, 2)
# Getting rid of color
gray = cv2.cvtColor(gamma, cv2.COLOR_BGR2GRAY)
# Then apply Otsu threshold to reveal important areas
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
# inverting the thresholded image
return ~thresh
def get_horizontal_lines_mask(image, horizontal_size=100):
horizontal = image.copy()
horizontal_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontal_size, 1))
horizontal = cv2.erode(horizontal, horizontal_structure, anchor=(-1, -1), iterations=1)
horizontal = cv2.dilate(horizontal, horizontal_structure, anchor=(-1, -1), iterations=1)
return horizontal
def get_vertical_lines_mask(image, vertical_size=100):
vertical = image.copy()
vertical_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (1, vertical_size))
vertical = cv2.erode(vertical, vertical_structure, anchor=(-1, -1), iterations=1)
vertical = cv2.dilate(vertical, vertical_structure, anchor=(-1, -1), iterations=1)
return vertical
def make_lines_mask(preprocessed, min_horizontal_line_size=100, min_vertical_line_size=100):
hor = get_horizontal_lines_mask(preprocessed, min_horizontal_line_size)
ver = get_vertical_lines_mask(preprocessed, min_vertical_line_size)
mask = np.zeros((preprocessed.shape[0], preprocessed.shape[1], 1), dtype=np.uint8)
mask = cv2.bitwise_or(mask, hor)
mask = cv2.bitwise_or(mask, ver)
return ~mask
def find_cell_boxes(mask):
# Looking for the text spots contours
# OpenCV 3
# img, contours, hierarchy = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
# OpenCV 4
contours = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
contours = imutils.grab_contours(contours)
contours = sorted(contours, key=cv2.contourArea, reverse=True)
image_width = mask.shape[1]
# Getting the texts bounding boxes based on the text size assumptions
boxes = []
for contour in contours:
box = cv2.boundingRect(contour)
w = box[2]
# Excluding the page box shape but adding smaller boxes
if w < 0.95 * image_width:
boxes.append(box)
return boxes
def find_table_in_boxes(boxes, cell_threshold=10, min_columns=2):
rows = {}
cols = {}
# Clustering the bounding boxes by their positions
for box in boxes:
(x, y, w, h) = box
col_key = x // cell_threshold
row_key = y // cell_threshold
cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]
# Filtering out the clusters having less than 2 cols
table_cells = list(filter(lambda r: len(r) >= min_columns, rows.values()))
# Sorting the row cells by x coord
table_cells = [list(sorted(tb)) for tb in table_cells]
# Sorting rows by the y coord
table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))
return table_cells
def build_vertical_lines(table_cells):
if table_cells is None or len(table_cells) <= 0:
return [], []
max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]
max_last_row_height_box = max(table_cells[-1], key=lambda b: b[3])
max_y = max_last_row_height_box[1] + max_last_row_height_box[3]
hor_lines = []
ver_lines = []
for box in table_cells:
x = box[0][0]
y = box[0][1]
hor_lines.append((x, y, max_x, y))
for box in table_cells[0]:
x = box[0]
y = box[1]
ver_lines.append((x, y, x, max_y))
(x, y, w, h) = table_cells[0][-1]
ver_lines.append((max_x, y, max_x, max_y))
(x, y, w, h) = table_cells[0][0]
hor_lines.append((x, max_y, max_x, max_y))
return hor_lines, ver_lines
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True, help="path to images directory")
args = vars(ap.parse_args())
in_file = args["image"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
img = cv2.imread(in_file)
pre_processed = pre_process_image(img)
# Visualizing pre-processed image
cv2.imwrite(filename_base + ".pre.png", pre_processed)
lines_mask = make_lines_mask(pre_processed, min_horizontal_line_size=1800, min_vertical_line_size=500)
# Visualizing table lines mask
cv2.imwrite(filename_base + ".mask.png", lines_mask)
cell_boxes = find_cell_boxes(lines_mask)
cells = find_table_in_boxes(cell_boxes)
# apply OCR to each cell rect here
# the cells array contains cell coordinates in tuples (x, y, w, h)
hor_lines, ver_lines = build_vertical_lines(cells)
# Visualize the table lines
vis = img.copy()
for line in hor_lines:
[x1, y1, x2, y2] = line
cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)
for line in ver_lines:
[x1, y1, x2, y2] = line
cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)
cv2.imwrite(filename_base + ".result.png", vis)
Some parameters are hard-coded:
page size threshold - 0.95
min horizontal line size - 1800 px
min vertical line size - 500 px
You can provide them as configurable parameters or make them relative to image size.
Results:
I think that the best way to do what you need is to find and isolate the cells in the file and then apply OCR to each individual cell.
There are a number of solutions in SO for that, I got the code from this answer and played around a little with the parameters to get the output below (not perfect yet, but you can tweak it a little bit yourself).
import os
import cv2
import imutils
# This only works if there's only one table on a page
# Important parameters:
# - morph_size
# - min_text_height_limit
# - max_text_height_limit
# - cell_threshold
# - min_columns
def pre_process_image(img, save_in_file, morph_size=(23, 23)):
# get rid of the color
pre = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Otsu threshold
pre = cv2.threshold(pre, 250, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# dilate the text to make it solid spot
cpy = pre.copy()
struct = cv2.getStructuringElement(cv2.MORPH_RECT, morph_size)
cpy = cv2.dilate(~cpy, struct, anchor=(-1, -1), iterations=1)
pre = ~cpy
if save_in_file is not None:
cv2.imwrite(save_in_file, pre)
return pre
def find_text_boxes(pre, min_text_height_limit=20, max_text_height_limit=120):
# Looking for the text spots contours
contours, _ = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
# Getting the texts bounding boxes based on the text size assumptions
boxes = []
for contour in contours:
box = cv2.boundingRect(contour)
h = box[3]
if min_text_height_limit < h < max_text_height_limit:
boxes.append(box)
return boxes
def find_table_in_boxes(boxes, cell_threshold=100, min_columns=3):
rows = {}
cols = {}
# Clustering the bounding boxes by their positions
for box in boxes:
(x, y, w, h) = box
col_key = x // cell_threshold
row_key = y // cell_threshold
cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]
# Filtering out the clusters having less than 2 cols
table_cells = list(filter(lambda r: len(r) >= min_columns, rows.values()))
# Sorting the row cells by x coord
table_cells = [list(sorted(tb)) for tb in table_cells]
# Sorting rows by the y coord
table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))
return table_cells
def build_lines(table_cells):
if table_cells is None or len(table_cells) <= 0:
return [], []
max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]
max_last_row_height_box = max(table_cells[-1], key=lambda b: b[3])
max_y = max_last_row_height_box[1] + max_last_row_height_box[3]
hor_lines = []
ver_lines = []
for box in table_cells:
x = box[0][0]
y = box[0][1]
hor_lines.append((x, y, max_x, y))
for box in table_cells[0]:
x = box[0]
y = box[1]
ver_lines.append((x, y, x, max_y))
(x, y, w, h) = table_cells[0][-1]
ver_lines.append((max_x, y, max_x, max_y))
(x, y, w, h) = table_cells[0][0]
hor_lines.append((x, max_y, max_x, max_y))
return hor_lines, ver_lines
if __name__ == "__main__":
in_file = os.path.join(".", "test.jpg")
pre_file = os.path.join(".", "pre.png")
out_file = os.path.join(".", "out.png")
img = cv2.imread(os.path.join(in_file))
pre_processed = pre_process_image(img, pre_file)
text_boxes = find_text_boxes(pre_processed)
cells = find_table_in_boxes(text_boxes)
hor_lines, ver_lines = build_lines(cells)
# Visualize the result
vis = img.copy()
# for box in text_boxes:
# (x, y, w, h) = box
# cv2.rectangle(vis, (x, y), (x + w - 2, y + h - 2), (0, 255, 0), 1)
for line in hor_lines:
[x1, y1, x2, y2] = line
cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)
for line in ver_lines:
[x1, y1, x2, y2] = line
cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)
cv2.imwrite(out_file, vis)

How to perform Earth Mover's Distance instead of DoG for center surround difference on multiscale level in images in python 3.7

I am working on a image processing project where i have to perform center surround difference calculation with Earth Mover's distance(EMD) on multiscale level but the problem is that i can't figure it out how center surround difference works and how could i use EMD for it.
I found the python function for EMD but it works with 2 source image histograms whereas in my problem i have only one source.
I am generating multi scales of the image using skimage's pyramid_gaussian function using solution provided on
link: https://gist.github.com/duhaime/211365edaddf7ff89c0a36d9f3f7956c
I tried:
def get_img(path, norm_size=True, norm_exposure=False):
img = imread(path, flatten=True).astype(int)
if norm_size:
img = resize(img, (height, width), anti_aliasing=True, preserve_range=True)
if norm_exposure:
img = normalize_exposure(img)
return img
def get_histogram(img):
h, w = img.shape
hist = [0.0] * 256
for i in range(h):
for j in range(w):
hist[img[i, j]] += 1
return np.array(hist) / (h * w)
def normalize_exposure(img):
img = img.astype(int)
hist = get_histogram(img)
cdf = np.array([sum(hist[:i+1]) for i in range(len(hist))]) # get the sum of vals accumulated by each position in hist
sk = np.uint8(255 * cdf) # determine the normalization values for each unit of the cdf
height, width = img.shape # normalize each position in the output image
normalized = np.zeros_like(img)
for i in range(0, height):
for j in range(0, width):
normalized[i, j] = sk[img[i, j]]
return normalized.astype(int)
def earth_movers_distance(path_a, path_b):
img_a = get_img(path_a, norm_exposure=True)
img_b = get_img(path_b, norm_exposure=True)
hist_a = get_histogram(img_a)
hist_b = get_histogram(img_b)
return wasserstein_distance(hist_a, hist_b)
if __name__ == '__main__':
image = cv2.imread("images/test3.jpg")
pyramidlist=[]
dst = []
for (i, resized) in enumerate(pyramid_gaussian(image, downscale=1.4)):
if resized.shape[0] < 30 or resized.shape[1] < 30:
break
cv2.imshow(f"Layer {i+1}", resized)
cv2.waitKey(0)
pyramidlist.append(resized[i])
print(pyramidlist)
print(len(pyramidlist))
cv2.destroyAllWindows()
but don't know how to use EMD after generating pyramids and calculate center surround difference.

Region growing implementation in python, without seeding

I am trying to implement the region growing segmentation algorithm in python, but I am not allowed to use seed points. My idea so far is this:
Start from the very first pixel, verify its neighbors for boundaries check (within width and height), then verify the neighbors so that they are within the threshold (I obtained this by using the euclidean distance between the current pixel and the neighbor). To be made sure that I don't visit the same pixel again, I created a matrix with the same dimensions as the image (width, height) and made all elements 0 in the beginning. I will start from currentArea number 1 and increment as it goes. In the end, after obaining all areas with their respective average colors, I will use a function that writes down the new Image based on the tuples I have obtained. (averagePixelColor, areaNumber)
This is my code:
from PIL import Image
from scipy.spatial import distance
import statistics
import numpy as np
import sys
sys.setrecursionlimit(10**9)
# SCRIPT: color palette reduction applier script
# SCRIPT: this is the second method. region growing segmentation
# list of red values of pixels
rList = []
# list of green values of pixels
gList = []
# list of blue values of pixels
bList = []
# this matrix will be initially 0, then every region growth
# will be updated to its number
overlayMatrix = []
# starting area number
currentArea = 1
def isValidPixel(x, y, width, height):
if x < 0 or x >= width:
return False
if y < 0 or y >= height:
return False
return True
def get_average_color():
global rList
global gList
global bList
# set values to none
r = None
g = None
b = None
# get average value for each chanel
if rList != []:
r = sum(rList)/len(rList)
g = sum(gList)/len(gList)
b = sum(bList)/len(bList)
# make values integers to be used as pixel
r = int(r)
g = int(g)
b = int(b)
# return values
return (r, g, b)
def add_pixel_to_lists(pixel):
global rList
global gList
global bList
rList.append(pixel[0])
gList.append(pixel[1])
bList.append(pixel[2])
def region_growing(x, y, pixel, img, currentArea):
global overlayMatrix
global rList
global gList
global bList
# get width, heihgt
width, height = img.size
# set this pixel to be visited on current area
overlayMatrix[x][y] = currentArea
# set a list for all possible neighbours
neighbouringPixels = [(x-1, y), (x-1, y-1), (x-1, y+1), (x, y-1), (x, y+1), (x+1, y), (x+1, y-1), (x+1, y+1)]
# filter to get only valid neighbours
validPixels = [x for x in neighbouringPixels if isValidPixel(x[0], x[1], width, height) == True]
# filter pixels to be not visited
notVisitedPixels = [x for x in validPixels if overlayMatrix[x[0]][x[1]] == 0]
# set a threshold value
threshold = 5
# filter to get only pixels in threshold
thresholdPixels = []
thresholdPixels = [x for x in notVisitedPixels if distance.euclidean(img.getpixel(x), pixel) < threshold]
# set the list for pixels to make recursive calls
toVisitRecursive = []
# for every pixel that is a valid neighbour, add it to the toVisit list in recursive call
# and add its rgb values to the lists so an average can be computed
for pixel in thresholdPixels:
toVisitRecursive.append(pixel)
add_pixel_to_lists(img.getpixel(pixel))
# compute average
averagePixel = get_average_color()
# if I still have neighoburs that haven't been visited
# and are within the threshold, get first from list
# remove it so we don't do the same again, then apply
# the algorithm for it
if toVisitRecursive != []:
pixel = toVisitRecursive[0]
toVisitRecursive.remove(pixel)
region_growing(pixel[0], pixel[1], averagePixel, img, currentArea)
# finally, return
return (averagePixel, currentArea+1)
def write_image():
pass
# this will write to the image based on the list of tuples
# average color to the area with number X
def palette_reduction_mt2(input_image):
# open original image
img = Image.open(input_image)
tuple_list = []
# get width, height
width, height = img.size
# make overlay matrix of 0, initial
global overlayMatrix
overlayMatrix = np.zeros((width, height), dtype=np.int32)
# create new image
newimg = Image.new("RGB", (width, height), "white")
currentArea = 1
# iterate through image pixels
for y in range(0, height-1):
for x in range(0, width-1):
global rList
global gList
global bList
# get pixel from edge image
p = img.getpixel((x, y))
# apply region growing
average, currentArea = region_growing(x, y, p, img, currentArea)
tuple_list.append((average, currentArea-1))
# reset color lists to compute new average
rList = []
gList = []
bList = []
print(tuple_list)
# Save image
newimg.save("images/reduced_mt1.jpg")
# return the name of the image
return "images/reduced_mt1.jpg"
For some reason, when I print the tuples list, I only get a single value: [(0,0,0), 1]
Can anyone please point me in the right direction?
Thank you!
Seems like you are not actually iterating over the toVisitRecursive list.
Maybe you should add a loop like:
for pixel in toVisitRecursive:
// region_growing...
or change this line if toVisitRecursive != []: for a while loop;

select with bokeh not really working

I am using bokeh 0.12.2. I have a select with words. When i choose a word it should circle the dot data. It seems to work then stop. I am trying with 2 words, word1 and word2. lastidx is full of index.xc and yx are the location of the circle here is the code. This is working with one but not really if i change the value in the select:
for j in range(0,2):
for i in range(0,len(lastidx[j])):
xc.append(tsne_kmeans[lastidx[j][i], 0])
yc.append(tsne_kmeans[lastidx[j][i], 1])
source = ColumnDataSource(data=dict(x=xc, y=yc, s=mstwrd))
def callback(source=source):
dat = source.get('data')
x, y, s = dat['x'], dat['y'], dat['s']
val = cb_obj.get('value')
if val == 'word1':
for i in range(0,75):
x[i] = x[i]
y[i] = y[i]
elif val == 'word2':
for i in range(76,173):
x[i-76] = x[i]
y[i-76] = y[i]
source.trigger('change')
slct = Select(title="Word:", value="word1", options=mstwrd , callback=CustomJS.from_py_func(callback))
# create the circle around the data where the word exist
r = plot_kmeans.circle('x','y', source=source)
glyph = r.glyph
glyph.size = 15
glyph.fill_alpha = 0.0
glyph.line_color = "black"
glyph.line_dash = [4, 2]
glyph.line_width = 1
x and y are loaded with all the data here and I just pick the data for the word I select. It seems to work and then it does not.
Is it possible to do that as a stand alone chart?
Thank you
I figured it out: code here is just to see if this was working. This will be improved of course. And may be this is what was written here at the end:
https://github.com/bokeh/bokeh/issues/2618
for i in range(0,len(lastidx[0])):
xc.append(tsne_kmeans[lastidx[0][i], 0])
yc.append(tsne_kmeans[lastidx[0][i], 1])
addto = len(lastidx[1])-len(lastidx[0])
# here i max out the data which has the least
# so when you go from one option to the other it
# removes all the previous data circle
for i in range(0,addto):
xc.append(-16) # just send them somewhere
yc.append(16)
for i in range(0, len(lastidx[1])):
xf.append(tsne_kmeans[lastidx[1][i], 0])
yf.append(tsne_kmeans[lastidx[1][i], 1])
x = xc
y = yc
source = ColumnDataSource(data=dict(x=x, y=y,xc=xc,yc=yc,xf=xf,yf=yf))
val = "word1"
def callback(source=source):
dat = source.get('data')
x, y,xc,yc,xf,yf = dat['x'], dat['y'], dat['xc'], dat['yc'], dat['xf'], dat['yf']
# if slct.options['value'] == 'growth':
val = cb_obj.get('value')
if val == 'word1':
for i in range(0,len(xc)):
x[i] = xc[i]
y[i] = yc[i]
elif val == 'word2':
for i in range(0,len(xf)):
x[i] = xf[i]
y[i] = yf[i]
source.trigger('change')
slct = Select(title="Most Used Word:", value=val, options=mstwrd , callback=CustomJS.from_py_func(callback))
# create the circle around the data where the word exist
r = plot_kmeans.circle('x','y', source=source)
I will check if i can pass a matrix. Don't forget to have the same size of data if not you will have multiple options circled in the same time.
Thank you

Resources