How to find unicode planes for emojis in Python - python-3.x

I have pandas dataframe containing emojis and I want to categorize them according to their Unicode Planes.
emoji | unicode
---------------
๐Ÿ˜‚ | 1F602
๐Ÿ˜Š | 1F60A
Expected Output
emoji | unicode | Plane
-----------------------
๐Ÿ˜‚ | 1F602 | 1
๐Ÿ˜Š | 1F60A | 1
โ›น | 26F9 | 0
Here Plane 0 refers to the Basic Multilingual Plane (BMP) and Plane 1 refers to the Supplementary Multilingual Plane (SMP).
[NB: please use Safari on Mac, Firefox on Linux, Chrome on Windows to see this question with proper emoji symbols]

Both ๐Ÿ˜‚ and ๐Ÿ˜Š belong to Plane 1, the Supplementary Multilingual Plane (SMP).
The following code snippet can exemplify an algorithm for getting Unicode plane # (it's ord(ch)>>16, see bitwise right shift).
for ch in 'โœŒโ›นโ˜นโ˜บโ˜ป๐Ÿ˜‚๐Ÿ˜Š':
print( ch, '\t{:04x}\t'.format(ord(ch)), ord(ch)>>16)
โœŒ 270c 0
โ›น 26f9 0
โ˜น 2639 0
โ˜บ 263a 0
โ˜ป 263b 0
๐Ÿ˜‚ 1f602 1
๐Ÿ˜Š 1f60a 1

Please always give a minimum reproducible example to help others help you.
According to your link on Unicode Planes,
There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00โ€“10 (in base 16) of the first two positions in six position hexadecimal format (U+hhhhhh).
Based on that explanation, let's write a function to get that information.
# in the comments, we can use char = '๐Ÿ˜€'
def unicode_to_plane(char: str) -> int:
unicode_codepoint = ord(char) # 128512
hex_repr = hex(unicode_codepoint) # '0x1f600'
hex_digits = hex_repr[2:] # '1f600'
plane = 0 # Assume plane is 0 until proven otherwise
if len(hex_digits) > 4: # The plane is 0 if hex representation is four hex digits or less
hex_plane = hex_digits[:-4] # '1' (take away the last four characters)
plane = int(hex_plane, 16) # 1 (convert hex characters to integer)
return plane # 1
Please note that the according to the wiki on Emoji,
Most, but not all, emoji are included in the Supplementary Multilingual Plane (SMP) of Unicode.
and the SMP is Plane 1.

Related

How do i decode text of a pdf file using python

I have been trying to decode a pdf file using python and the data is as below:
BT
/F2 8.8 Tf
1 0 0 1 36.85 738.3 Tm
0 g
0 G
[(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] TJ
ET
How do I make sense of this???
[(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] is of what type???
BT /F2 8.8 Tf 1 0 0 1 36.85 738.3 Tm 0 g 0 G [(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] TJ ET
Is normal plain ASCII text, thus everyday, decoded binary as text.
Your question is
Q) How do I make sense of this??? [(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )]
A) Always look at the context
BT = B(egin) T(ext)
/F2 = use F(ont) 2 for encoding (whatever that is)
8.8 = units of height (if un-modified those could be 8.8 full unscaled DTP points,
but beware, point size does not necessarily correspond to any measurement
of the size of the letters on the printed page.)
... Mainly T(ransform )m(atrix) e.g. placement
[ = start a string group
(A) = literal(y) "A"
31 = Kern next character (+ is usually) left wise by 31 units where units (is usually) 1/1440 inch or 17.639 ยตm
(c) = the next glyph literal that needs to be etched on screen or paper
-44 is push the two x (c) apart by 44 units
(c)
...
] Tj ET = Close Group, T(exte)j(ect) E(nd) T(ext)
So there we have it somewhere on the page (first or last word or any time in between) but at that time somewhere, most likely top Left, there is one continuous selectable plain text string that **audibly sounds like a word in a human language = "Account", with an extra spacebar literal (that's actually un-necessary for a PDF, it will print that and any other "word" good enough without one.)
Why did I say sounds and not "looks like" is because those "literal" characters are not the ones presented they are the encoded names of glyphs.
Hear :-) is how they could look like using /F2 if it was set to different glyph font such as use emojis or other Dings, so A is BC but c is a checkbox u is underground t is train but audibly all ink, is just an Account of which graphics to use.

understanding the torch.nn.functional.grid_sample op by concrete example

I am debugging a neural network which has a torch.nn.functional.grid.sample operator inside. Using the Pycharm IDE, I can watch the values during debugging. My grid is a 1*15*2 tensor, here are the values in the first batch.
My input is a 1*128*16*16 tensor, here are the values in the first channel of the first batch:.
My output is 1*128*1*15 tensor, here are the values in the first channel of the first batch.
align_corners = False, mode = 'bilinear', padding_mode = 'zero'.
For gird coordinates (-1,-1), I can understand that the value(-4.74179) is sampled from 4 values on the top-left corner with 3 of them being the padded '0's and 1 of them being the value '-18.96716'.(-18.96716/4 = -4.74179).
But for other grid coordinates, I am confused. Taking the value '84.65594' for example, it's corresponding grid coordinate is (-0.45302, 0.53659). I firstly convert them from (-1,1) to (0,15) by adding 1 and then dividing by 2 and then multiplying 15(see official implementation). The converted coordinate is then (4.10235, 11.524425), Upon which I see the four values that should be sampled from are :
(x)44.20010---0.10235---------(y)26.68777
| | |
| | |
0.524425---(a,b)--------------------
| | |
| | |
(w)102.18765---------------------(z)30.03996
here are my calculation by hand step, Let:
a = 0.10235
b = 0.524425
x = 44.20010
y = 26.68777
z = 30.03996
w = 102.18765
The interpolated value should then be:
output = a*b*z + (1 - a)*(1 - b)*x + (1 - a)*b*w + (1-b)*a*y
= 0.10235*0.524425*30.03996 + (1-0.10235)*(1-0.524425)*44.20010 + (1-
0.10235)*0.524425*102.18765 + (1-0.524425)*0.10235*26.68777
= 69.8852865171
which isn't 84.65594, I cant't figure out how the value '84.65594' in the output is calculated, please help!
I answer my own question, it turns out that the inconsistency is due to the 'align_corners' flag. My way of calculation is actually under the case when 'align_corners' is true while in the program, this flag is set to be false. For how to calculate sample coordinates, please see this

Check if a point is inside an arbitrary hexahedron

I am working on a 3D finite element code, where i face the following problem:
If I take an arbitrary point (say x), how do I figure out, which element it is in?
This can be simplified to: How do I check if an arbitrary point (x) lies inside or outside of an (hexahedral) element?
What I already found:
Limited to cubes: How to determine a point is inside or outside a cube?
Limited to rectangular shapes: https://math.stackexchange.com/questions/1472049/check-if-a-point-is-inside-a-rectangular-shaped-area-3d
Contrary to the two approaches above, my problem does not assume right angles nor parallel faces.
Problem sketch:
Notation: (again: though the sketch shows a regular shape, our hexahedron is assumed to be of general shape)
8-node hexahedron topology, nodes: 0,..,7
axis: r,s,t
t
|
4--------|-------------7
/| | /|
/ | | / |
/ | | / |
/ | | / |
/ | | / |
/ | | / |
5----------------------6 |
| | | | |
| | o------|---------s
| | / | |
| 0------/--------|------3
| / / | /
| / / | /
| / / | /
| / / | /
| / r | /
|/ |/
1----------------------2
Data that we have available:
coordinates of the nodes (vectors P0 to P7)
coordinates of the point we want to check (lets say Px)
Additionaly we assume the nodes are ordered as sketched above.
My approach/solution so far:
calculate the surface (outward) normal vectors
Use cross products, eg. for the r_pos_normal_vec (pointing out of the plane)
r_pos_normvec = (P2-P1) x (P5-P1)
and for the r_neg_normal_vec
r_neg_normvec = (P4-P0) x (P3-P0)
similarly for the s and t directions
check two opposite corner nodes (I chose node0 and node 6)
For node0
calculate vector from P0 to Px:
P0x = Px - P0
calculate inner prodcut of P0x and surfaces adjacent to node 0
<P0x, r_neg_normal_vec>
<P0x, s_neg_normal_vec>
<P0x, t_neg_normal_vec>
For node1
same scheme as for node 0, whereas P1 instead of P0 and the positive counterparts of the normal vectors are used
Iff all 6 (3 from node0 and 3 from node1) inner products result in negative value -> the point is inside the hexahedron.
Question:
I implemented the functionality described above in my code and ran some tests.
It seems to work, from the math side I am quite confident.
Please discuss my approach, I am happy for any hints/clues/recommendations/bug fixes ...
Is there some way to make this faster?
Alternative solutions?
Note:
To speed up the algorithm a box check can be done first:
Construct a rectangular box around the hexahedron:
Get the min and max values of the node coordinates in each direction.
If the point to check (x) is outside this (larger) box, it cannot be inside the hexahedron.
For any convex polyhedron, establish the implicit equations of the faces (f.i. plane by three points), of the form ax+by+cz+d=0.
When you plug the coordinates of a known point inside the volume (such as the center) in the expression ax+by+cz+d, you will get a set of signs. An arbitrary point is inside if it yields the same signs.
Update:
For maximum performance, you can consider also using an axis-aligned bounding box for quick rejection. This only makes sense if many of the points are outside. Make sur to use a shortcut evaluation so that early rejection can happen.
Note that a rejection test such as X<Xmin is nothing but the above sign test against the plane of equation X-Xmin=0.
I personnally prefer your method, however there also is a way to approach the problem if the hexahedral restricted to parallelepiped. So you can transfer the coordinate of P in the frame $(0; e_1; e_2; e_3)$ to $(P_0, P_0P_1,P_0P_3,P_0P_4)$. We call it $(a,b,c)$, then the point is in that parallelepiped if $a,b,c > 0 a \in [0,1], a+b+c \in [0,1]$.
Because you mentioned that you want to be able to handle arbitrary hexahedrons, I think that your process might be improved if you adjust your s, r, and t normals to account for having slightly warped faces. I would do this by making the following change to r normals (and similar for s and t):
r_pos_normvec = (P6-P1) x (P5-P2)
r_neg_normvec = (P7-P0) x (P4-P3)
This would be important for a case where you shifted node 6 towards node 7 (say 0.9xP6) and had a point at 0.95xP6. Without the warping correction, I believe you would erroneously determine the point as inside the hexahedron.
Here is a python example :
def point_is_in_hexa(point,centers,normals):
vect=[]
prod=[]
for i in range(6):
vect.append(point-centers[i])
vect= np.array(vect)
for i in range(6):
prod.append(np.dot(vect[i],normals[i]))
prod=np.array(prod)
if all(prod <= 0):
is_in_hexa = 1
else:
is_in_hexa = -1
return is_in_hexa
https://github.com/fgomez03/hexacheck

Greedy Algorithms and Time Complexity #2

We have a bomb that is ticking and may explode. This bomb has n switches, that can be moved up or down. Certain combinations of these switches trigger the bomb, but only one combination disables it.
Our task is to move the switches from the current position to a position that disables the bomb, without exploding it in the meantime. The switches are big and awkward, so we can move only one switch at a time.
We have, lets say, n = 4 switches currently in position ^vvv. We need to get them to the position ^v^^. Forbidden positions are vvv^, ^vv^, ^v^v, and ^^^v.
a.) I had to draw this by hand and find the shortest sequence of switch movements that solves the task - result I got was 4 ...and I found two such sequences, if i am right...
b.) this is where it gets a hard - write a code that answers the above question/questions (the shortest sequence and how many). The code should be generalized so that it would work with another number of switches and other starting, targeted, and forbidden combinations; targeted and forbidden combinations may be multiple or even fewer. Only thing we know for sure is that the switches have only two positions. It should also provide the possibility that the desired condition is unavailable; in this case, the program should of course tell.
c.) Next questions is the time complexity of the code this but for now I think I will just stop here...
I used '0' and '1' instead, because it is easier for me to imagine this.
So my approach towards this was something of a greedy algorithm (I think) - starting position, you think of all the possible (allowed) positions, you ignore the forbidden ones, then pick the one that the sequence of positions has the fewest difference from our targeting sequence.
The key part of the code I am yet to write and that's the part I need help with.
all_combinations = ['0000', '0001', '0010', '0011', '0100', '0101', '0110', '0111', '1000', '1001', '1010', '1011' , '1100', '1101', '1110', '1111']
def distance (position1, position2):
distance = 0
for i in range (len (position1)):
if position1 [i]! = position2 [i]:
distance + = 1
return distance
def allowed_positions (current, all_combinations):
allowed = set ()
for combination and all combinations:
if the distance (current, combination) == 1:
allowed.add (combination)
return allowed
def best_name (current, all_combinations, target):
list = []
for option and permitted_mood (current, all_combinations):
list.append (distance (option, target), option)
The task at hand is finding a shortest path in a graph. For this there is one typical approach and that is a breadth-first search algorithm (https://en.wikipedia.org/wiki/Breadth-first_search).
There is no real need to go into the details of how this is done because it can be read elsewhere in more detail and far better explained than I can do this in a StackOverflow answer.
But what might need to be explained is how the switch-combinations you have at hand are represented by a graph.
Imagine you have just two switches. Then you have exactly this graph:
^^---^v
| |
| |
v^---vv
If your starting position is ^^ and your ending (defusing) position is vv while the position ^v is an exploding position, then your graph is reduced to this:
^^ ^v
|
|
v^---vv
In this small example the shortest path is obvious and simple.
The graph at hand is easily sketched out in 2D, each dimension (x and y) representing one of the switches. If you have more switches, then you just add one dimension for each switch. For three switches this would look like this:
^^^--------^^v
|\ |\
| \ | \
| \ | \
| \ | \
| ^v^--- | --^vv
| | | |
| | | |
v^^--------v^v |
\ | \ |
\ | \ |
\ | \ |
\| \|
vv^--------vvv
If the positions ^^v, v^^, and vv^ are forbidden, then this graph is reduced to this:
^^^ ^^v
\
\
\
\
^v^--------^vv
|
|
v^^ v^v |
\ |
\ |
\ |
\|
vv^ vvv
Which already shows the clear way and the breadth-first search will easily find it. It gets interesting only for many dimensions/switches, though.
Drawing this for more dimensions/switches gets confusing of course (look up tesseracts for 4D). But it isn't necessary to have a visual image. Once you have written the algorithm for creating the graph in 2D and 3D in a general way it easily scales to n dimensions/switches without adding any complexity.
start = 8
target = 11
forbidden = {1: -1 , 9: -1, 10: -1, 14: -1}
dimensions = 4
def distance(start, target, forbidden, dimensions):
stack1 = []
stack1.append(start)
forbidden[start] = -1
while(len(stack1) > 0):
top = stack1.pop()
for i in range(dimensions):
testVal = top ^ (1 << i)
if testVal is target:
forbidden[testVal] = top
result = [testVal]
while testVal is not start:
testVal = forbidden[testVal]
result.insert(0, testVal)
return result
if testVal not in forbidden:
forbidden[testVal] = top
stack1.append(testVal)
return [-1]
print(distance(start, target, forbidden, dimensions))
Here is my code for your example in your question. Instead of using bits, I went ahead and used the base 10 number to represent the codes. Forbidden codes are mapped to a hashmap which is used later to trace the path upwards after the target is found. I use a stack to keep track of which code to try. Each time the while loop passes, the last code added is popped and it's unvisited neighbors are added to the stack. Importantly, to prevent cycles, codes on the stack or seen before are added to the list of forbidden nodes. When the target code is found for the first time, an early return is called and the path is traced through the hashmap.
This solution uses breadth first search and returns the first time the target is found. That means it does not guarantee the shortest path from start to target, but it does guarantee a working path if it's available. Since all possible codes are possibly traversed and there are 2^dimensions number of nodes, the time complexity of this algorithm is also O(2^n)

How does numPy's advanced indexing work for boolean masks?

I am very new to Python, and I am struggling with understanding how the following code works.
I have an image read into a numpy array. Similar to this:
# Read the image
image = mpimg.imread('images/rgb-road.png')
After this I set a 3-value boundary list rgb_boundary[123,231,122]. The values are irrelevant.
Then comes the confusing part.
boundary = (image[:,:,0] < rgb_boundary[0]) \
| (image[:,:,1] < rgb_threshold[1]) \
| (image[:,:,2] < rgb_threshold[2])
image[boundary] = [0,0,0]
It is the combination of my poor knowledge of Python syntax and working with images that is causing the issue.
I would be extremely happy if somebody could explain what is happening in the above piece of code. Especially in the line where we have the image[boundary] assignment. My image values are changed, but I don't understand how this is working in Python.
In addition, if there is a resource where I can read about how/why this is working, please feel free to refer me to it.
Thanks!
Let's start with the first piece that executes:
image[:,:,0] < rgb_boundary[0]
Now, rgb_boundary[0] is simply 123, so what this does is find the locations where the first color (red) in the image is less than 123 intensity (so about half-bright, since 255 is max-bright).
The result of the above expression will be a 2D boolean array, which is True wherever the red color byte is less than 123, and False elsewhere.
We can then understand this code:
boundary = (image[:,:,0] < rgb_boundary[0]) \
| (image[:,:,1] < rgb_boundary[1]) \
| (image[:,:,2] < rgb_boundary[2])
It is creating a 2D mask called boundary which will be True for any pixel where red is less than 123, or green is less than 231, or blue is less than 122. It will be False wherever none of those conditions are met.
Finally:
image[boundary] = [0,0,0]
This sets the selected pixels to black. In NumPy, a boolean mask like boundary can be used with "fancy indexing" to create a "view" of an array. In this case, the "view" is used to assign [0,0,0] (black) to image wherever the boundary condition is True.

Resources