How do i decode text of a pdf file using python

How do i decode text of a pdf file using python - string

I have been trying to decode a pdf file using python and the data is as below:
BT
/F2 8.8 Tf
1 0 0 1 36.85 738.3 Tm
0 g
0 G
[(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] TJ
ET
How do I make sense of this???
[(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] is of what type???

BT /F2 8.8 Tf 1 0 0 1 36.85 738.3 Tm 0 g 0 G [(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )] TJ ET
Is normal plain ASCII text, thus everyday, decoded binary as text.
Your question is
Q) How do I make sense of this??? [(A)31(c)-44(c)-44(o)-79(u)11(n)-79(t)5( )]
A) Always look at the context
BT = B(egin) T(ext)
/F2 = use F(ont) 2 for encoding (whatever that is)
8.8 = units of height (if un-modified those could be 8.8 full unscaled DTP points,
but beware, point size does not necessarily correspond to any measurement
of the size of the letters on the printed page.)
... Mainly T(ransform )m(atrix) e.g. placement
[ = start a string group
(A) = literal(y) "A"
31 = Kern next character (+ is usually) left wise by 31 units where units (is usually) 1/1440 inch or 17.639 µm
(c) = the next glyph literal that needs to be etched on screen or paper
-44 is push the two x (c) apart by 44 units
(c)
...
] Tj ET = Close Group, T(exte)j(ect) E(nd) T(ext)
So there we have it somewhere on the page (first or last word or any time in between) but at that time somewhere, most likely top Left, there is one continuous selectable plain text string that **audibly sounds like a word in a human language = "Account", with an extra spacebar literal (that's actually un-necessary for a PDF, it will print that and any other "word" good enough without one.)
Why did I say sounds and not "looks like" is because those "literal" characters are not the ones presented they are the encoded names of glyphs.
Hear :-) is how they could look like using /F2 if it was set to different glyph font such as use emojis or other Dings, so A is BC but c is a checkbox u is underground t is train but audibly all ink, is just an Account of which graphics to use.

Related

Transformed colors when painting semi-transparent in p5.js

A transformation seems to be applied when painting colors in p5.js with an alpha value lower than 255:
for (const color of [[1,2,3,255],[1,2,3,4],[10,11,12,13],[10,20,30,40],[50,100,200,40],[50,100,200,0],[50,100,200,1]]) {
clear();
background(color);
loadPixels();
print(pixels.slice(0, 4).join(','));
}
Input/Expected Output Actual Output (Firefox)
1,2,3,255 1,2,3,255 ✅
1,2,3,4 0,0,0,4
10,11,12,13 0,0,0,13
10,20,30,40 6,19,25,40
50,100,200,40 51,102,204,40
50,100,200,0 0,0,0,0
50,100,200,1 0,0,255,1
The alpha value is preserved, but the RGB information is lost, especially on low alpha values.
This makes visualizations impossible where, for example, 2D shapes are first drawn and then the visibility in certain areas is animated by changing the alpha values.
Can these transformations be turned off or are they predictable in any way?
Update: The behavior is not specific to p5.js:
const ctx = new OffscreenCanvas(1, 1).getContext('2d');
for (const [r,g,b,a] of [[1,2,3,255],[1,2,3,4],[10,11,12,13],[10,20,30,40],[50,100,200,40],[50,100,200,0],[50,100,200,1]]) {
ctx.clearRect(0, 0, 1, 1);
ctx.fillStyle = `rgba(${r},${g},${b},${a/255})`;
ctx.fillRect(0, 0, 1, 1);
console.log(ctx.getImageData(0, 0, 1, 1).data.join(','));
}

I could be way off here...but it looks like internally that in the background method if _isErasing is true then blendMode is called. By default this will apply a linear interpolation of colours.
See https://github.com/processing/p5.js/blob/9cd186349cdb55c5faf28befff9c0d4a390e02ed/src/core/p5.Renderer2D.js#L45
See https://p5js.org/reference/#/p5/blendMode
BLEND - linear interpolation of colours: C = A*factor + B. This is the
default blending mode.
So, if you set the blend mode to REPLACE I think it should work.
REPLACE - the pixels entirely replace the others and don't utilize
alpha (transparency) values.
i.e.
blendMode(REPLACE);
for (const color of [[1,2,3,255],[1,2,3,4],[10,11,12,13],[10,20,30,40],[50,100,200,40],[50,100,200,0],[50,100,200,1]]) {
clear();
background(color);
loadPixels();
print(pixels.slice(0, 4).join(','));
}

Internally, the HTML Canvas stores colors in a different way that cannot preserve RGB values when fully transparent. When writing and reading pixel data, conversions take place that are lossy due to the representation by 8-bit numbers.
Take for example this row from the test above:
Input/Expected Output Actual Output
10,20,30,40 6,19,25,40
IN (conventional alpha)
R
G
B
A
values
10
20
30
40 (= 15.6%)
Interpretation: When painting, add 15.6% of (10,20,30) to the 15.6% darkened (r,g,b) background.
Canvas-internal (premultiplied alpha)
R
G
B
A
R
G
B
A
calculation
10 * 0.156
20 * 0.156
30 * 0.156
40 (= 15.6%)
values
1.56
3.12
4.7
40
values (8-bit)
1
3
4
40
Interpretation: When painting, add (1,3,4) to the 15.6% darkened (r,g,b) background.
Premultiplied alpha allows faster painting and supports additive colors, that is, adding color values without darkening the background.
OUT (conventional alpha)
R
G
B
A
calculation
1 / 0.156
3 / 0.156
4 / 0.156
40
values
6.41
19.23
25.64
40
values (8-bit)
6
19
25
40
So the results are predictable, but due to the different internal representation, the transformation cannot be turned off.
The HTML specification explicitly mentions this in section 4.12.5.1.15 Pixel manipulation:
Due to the lossy nature of converting between color spaces and converting to and from premultiplied alpha color values, pixels that have just been set using putImageData(), and are not completely opaque, might be returned to an equivalent getImageData() as different values.
see also 4.12.5.7 Premultiplied alpha and the 2D rendering context

How to find unicode planes for emojis in Python

I have pandas dataframe containing emojis and I want to categorize them according to their Unicode Planes.
emoji | unicode
---------------
😂 | 1F602
😊 | 1F60A
Expected Output
emoji | unicode | Plane
-----------------------
😂 | 1F602 | 1
😊 | 1F60A | 1
⛹ | 26F9 | 0
Here Plane 0 refers to the Basic Multilingual Plane (BMP) and Plane 1 refers to the Supplementary Multilingual Plane (SMP).
[NB: please use Safari on Mac, Firefox on Linux, Chrome on Windows to see this question with proper emoji symbols]

Both 😂 and 😊 belong to Plane 1, the Supplementary Multilingual Plane (SMP).
The following code snippet can exemplify an algorithm for getting Unicode plane # (it's ord(ch)>>16, see bitwise right shift).
for ch in '✌⛹☹☺☻😂😊':
print( ch, '\t{:04x}\t'.format(ord(ch)), ord(ch)>>16)
✌ 270c 0
⛹ 26f9 0
☹ 2639 0
☺ 263a 0
☻ 263b 0
😂 1f602 1
😊 1f60a 1

Please always give a minimum reproducible example to help others help you.
According to your link on Unicode Planes,
There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–10 (in base 16) of the first two positions in six position hexadecimal format (U+hhhhhh).
Based on that explanation, let's write a function to get that information.
# in the comments, we can use char = '😀'
def unicode_to_plane(char: str) -> int:
unicode_codepoint = ord(char) # 128512
hex_repr = hex(unicode_codepoint) # '0x1f600'
hex_digits = hex_repr[2:] # '1f600'
plane = 0 # Assume plane is 0 until proven otherwise
if len(hex_digits) > 4: # The plane is 0 if hex representation is four hex digits or less
hex_plane = hex_digits[:-4] # '1' (take away the last four characters)
plane = int(hex_plane, 16) # 1 (convert hex characters to integer)
return plane # 1
Please note that the according to the wiki on Emoji,
Most, but not all, emoji are included in the Supplementary Multilingual Plane (SMP) of Unicode.
and the SMP is Plane 1.

Printing in Color in HTBasic/Rocky Mountain BASIC in Windows XP

Can someone help me understand this "Rocky Mountain BASIC" or "HTBasic" code?
I have to find out why the print functionality doesn't work anymore.
First, this line
PRINTER IS 26
I understand that the printer that we are going to use is "26" but what does 26 mean?
REPEAT
IF LWC$(Imp$)="o" THEN
PRINTER IS 26
FOR I=0 TO VAL(Mesu$(0,5))
FOR L=0 TO 6
PRINT Mesu$(I,L)
NEXT L
NEXT I
ELSE
FOR L=0 TO 6
PRINT TABXY(2,9+L);Mesu$(0,L)
NEXT L
FOR C=1 TO VAL(Mesu$(0,5))
PRINT TABXY(20-36*(C>3)+(C-1)*12,8+8*(C>3)),"voie "&VAL$(C-1)
FOR L=1 TO 7
PRINT TABXY(20-36*(C>3)+(C-1)*12,L+8+8*(C>3)),Mesu$(C,L-1)
NEXT L
NEXT C
END IF
INPUT "SORTIE sur l'IMPRIMANTE O/N ?",Imp$
UNTIL LWC$(Imp$)="n"

“26” is one of the codes that specifies an output port for the PRINT statement. For example,
PRINTER IS CRT
PRINTER IS PRT
The letter codes correspond to number codes; PRINTER IS CRT is the same as PRINTER IS 1, for example, and PRT is the same as 701.
The codes that are likely to work for printing in this BASIC dialect, including 26, are:
26 701 9 15 19 23 24 25
I pulled this from an ancient document, Using HP BASIC For Instrument Control: A Self-Study Course, which you may find useful. (I suspect you meant HPBasic, not HTBasic, in your subject line?)
TABXY is a variant of the PRINT statement, for printing to specific locations on a CRT screen; the docs I’m seeing say that the XY is ignored if not printing to a CRT, but I wouldn’t be surprised if TABXY also worked on some plotters. The first two numbers would be the X and Y coordinates to begin displaying the text, with TABXY(1, 1) indicating the upper left corner, and the lower right corner depending on how many columns and rows the CRT has.
You may find the HP9000 series BASIC Language Reference, Volume 1 and BASIC Language Reference, Volume 2 useful.
LWC$ is just a lowercase function, to ensure that that whether the user inputs “O”, “N”, “o”, or “n” at the INPUT line, the program will respond correctly.
VAL converts a string to the number that that string represents. The string “3” would become the number 3, for example.
The variables Mesu$ is likely a two-dimensional array, with x from 0 to, judging from line 4, a variable amount contained in Mesu$(0, 5) and y from 0 to 6, judging from line 5.

I guess that the line with PLOTTER IS 26 and we say that we want colors.
MAT Menu$=("")
DISP "envoi à l' imprimante .."
Menu$(1)="PLOTTER"
Menu$(2)="IMPRIMANTE COULEUR"
!Select(0,1,Tp,26,12,1)
IF Tp=1 THEN
PLOTTER IS 705,"HPGL"
ELSE
PLOTTER IS 26,"HPGL;PCL5;COLOR,1600",0,260,0,185
END IF

how to display numbers in scientific notation in plot legends in Matlab?

I have two float variables, let's say they are phi = 1.34e8 and beta = -2.7e-6. How do I display both results in a plot label in latex scientific notation over two lines? I want the plot label to look like (in latex font):
\phi = 1.34 x 10^8
\beta = -2.7 x 10^-6
And what about I have other variables for error, e.g. phi_err = 7.1e7, and I want the legend to look like:
\phi = (1.34 +/- 0.71) x 10^8
Edit:
My current Matlab code:
txt1 = texlabel(['n2=',num2str(n2)]);
txt2 = texlabel(['beta=',num2str(beta)]);
figure(1)
plot(...)
text(0.7,0.8,{txt1,txt2},'Units','normalized')
And the plot text looks like the upper part of the attached figure. How do I display the text in scientific notation with the multiply sign and the base 10 instead of e? Also, if I want to add the error (let's say I set in Matlab beta=[-2.7e-6, 1.2e-6] where beta(1) is the value and beta(2) is the error), then show should I modify the above code so that the result looks like the lower part of the attached figure? For the example I give, how do I extract 2.7 and 1.2 before the e? And what if they are of different order of magnitude, e.g. the error is 1.2e-7, which means in the displayed text I have to change it from 1.2e-7 to 0.12e-6 and combine the error and the beta value.

Postscript: truncate string with ellipsis?

Is there a way via PostScript to add a string such that is will be truncated by "..." so as not to exceed a certain width?
I've looking at some old report generation code and would like add this feature. In the existing reports, values that are too long are visually overwriting other data.
The reason I'm trying to do this at the PS level is that in the existing code I don't see anything that could calculate any kind of accurate width metric.
I've yet to write any Postscript, so maybe this is trivial. (?)
Per comment below: Yes, localization will an issue. So I guess a user defined "ellipsis" string makes sense.
Here is some example output that shows how strings are currently printed:
% Change font style and/or size
/Times-Roman-ISOLatin1 findfont 12 scalefont setfont
219 234 moveto (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA_) show
Can this be modified to ellipsize things?

Well you can do something like this (replace the char in front of concatstrings with your ellipsis):
/concatstrings % (a) (b) -> (ab)
{ exch dup length
2 index length add string
dup dup 4 2 roll copy length
4 -1 roll putinterval
} bind def
/ellipsis_show {
1 dict begin
/width_t exch def
{dup stringwidth pop width_t lt {exit} if dup length 1 sub 0 exch getinterval} loop
(_) concatstrings
show
end
}def
% Change font style and/or size
/Times-Roman-ISOLatin1 findfont 12 scalefont setfont
0 0 moveto (foobar barfoo foofoo barbar) 100.0 ellipsis_show
concatstrings copied from: http://en.wikibooks.org/wiki/PostScript_FAQ#How_to_concatenate_strings.3F

The simple answer is 'no'. A longer answer is that, since PostScript is a programming language, you can do this, but it will require some knowledge of PostScript, and some work, it certainly is not trivial.
You can redefine the various operators which draw text on the output, there are quite a few; show, ashow, cshow, kshow, xshow, yshow, xyshow, widthshow, awidthshow, and glyphshow. You could define modified versions of these which determine (using stringwidth and the parameters used by the various operators) the width of thefinal printed text. Probably you would want to calculate this glyph by glyph and terminate with your ellipsis when the value exceeds some threshold. (NB not all fonts will contain an ellipsis glyph, and its encoded position may vary).
However, given that you are working with existing code, there is most probably already a function defined to draw text and it probably only uses a subset of the possible operators. You would probably be better advised to modify that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string