I'm using beautiful soup to parse email invoices and I'm running into consistent problem involving special characters.
The text I am trying to parse is shown in the image.
But what I get from beautiful soup after finding the element and calling elem.text is this:
'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'
As you can see the apostrophe is now represented by "=E2=80=99", double quotes are "=E2=80=9C" and "=E2=80=9D" and there are seemingly random newlines in the text, for example "product=\r\ns".
The newlines don't seem to appear in the image.
Apparently "E2 80 99" is the unicode hex representation of ' , but I don't understand why I can still see it in this form after having done email.decode('utf-8') before sending it to beautiful soup.
This is the element
<td border:="" class='3D"td"' left="" middle="" padding:="" solid="" style='3D"color:' text-align:="" v="ertical-align:">Hi Mike, It=E2=80=
=99s probably not a big drama if you are having problems separating product=
s from classes. It is not uncommon to receive an order for pole classes and=
a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your =
system will not be able to place into a class list, hence having the extra =
sheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.</td>
I can post my code if required but I figure I must be making a simple mistake.
I checked out the answer to this question
Decode Hex String in Python 3
but i think that expects the entire string to be hex rather than just having random hex parts.
but I'm honestly not even sure how to search for "decode partial hex strings"
My final questions are
Q1 How do I convert
'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'
into
'Hi Mike, It's probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any "erroneous" orders will be handy.'
using python 3, without manually fixing each string and writing a replace method for each possible character.
Q2 Why does this "=\r\n" appear everywhere in my string but not in the rendered html?
#JosefZ's comment lead me to the answer.
Q1 has an answer.
>>> import quopri
>>> print(quopri.decodestring(mystring).decode('utf-8'))
Hi Mike, It’s probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any “erroneous” orders will be handy.
Q2 Thanks to #snakecharmerb I now know that the seemingly random unrepresented line endings are to enforce a line length of 80 characters.
#snakecharmerb wrote a much better answer than this one to someone with the same problem as me here.
https://stackoverflow.com/a/55295640/992644
Related
I have different commands my program is reading in (i.e., print, count, min, max, etc.). These words can also include a number at the end of them (i.e., print3, count1, min2, max6, etc.). I'm trying to figure out a way to extract the command and the number so that I can use both in my code.
I'm struggling to figure out a way to find the last element in the string in order to extract it, in Smalltalk.
You didn't told which incarnation of Smalltalk you use, so I will explain what I would do in Pharo, that is the one I'm familiar with.
As someone that is playing with Pharo a few months at most, I can tell you the sheer amount of classes and methods available can feel overpowering at first, but the environment actually makes easy to find things. For example, when you know the exact input and output you want, but doesn't know if a method already exists somewhere, or its name, the Finder actually allow you to search by giving a example. You can open it in the world menu, as shown bellow:
By default it seeks selectors (method names) matching your input terms:
But this default is not what we need right now, so you must change the option in the upper right box to "Examples", and type in the search field a example of the input, followed by the output you want, both separated by a ".". The input example I used was the string 'max6', followed by the desired result, the number 6. Pharo then gives me a list of methods that match that:
To get what would return us the text part, you can make a new search, changing the example output from number 6 to the string 'max':
Fortunately there is several built-in methods matching the description of your problem.
There are more elegant ways, I suppose, but you can make use of the fact that String>>#asNumber only parses the part it can recognize. So you can do
'print31' reversed asNumber asString reversed asNumber
to give you 31. That only works if there actually is a number at the end.
This is one of those cases where we can presume the input data has a specific form, ie, the only numbers appear at the end of the string, and you want all those numbers. In that case it's not too hard to do, really, just:
numText := 'Kalahari78' select: [ :each | each isDigit ].
num := numText asInteger. "78"
To get the rest of the string without the digits, you can just use this:
'Kalahari78' withoutTrailingDigits. "Kalahari"6
As some of the Pharo "OGs" pointed out, you can take a look at the String class (just type CMD-Return, type in String, hit Return) and you will find an amazing number of methods for all kinds of things. Usually you can get some ideas from those. But then there are times when you really just need an answer!
Hi,
I am relatively new to python, and I was wondering why the code below doesn't remain applicable to all of the sample tests in Codewars ("Jaden Casing strings") which is as follows:
Jaden Casing Strings:
Jaden Smith, the son of Will Smith, is the star of films such as The Karate Kid (2010) and After Earth (2013). Jaden is also known for some of his philosophy that he delivers via Twitter. When writing on Twitter, he is known for almost always capitalizing every word. For simplicity, you'll have to capitalize each word, check out how contractions are expected to be in the example below.
Your task is to convert strings to how they would be written by Jaden Smith. The strings are actual quotes from Jaden Smith, but they are not capitalized in the same way he originally typed them.
Example:
Not Jaden-Cased: "How can mirrors be real if our eyes aren't real"
Jaden-Cased: "How Can Mirrors Be Real If Our Eyes Aren't Real"
Link to Jaden's former Twitter account #officialjaden via archive.org
My code:
def to_jaden_case(string):
for word in string:
if "'" in word:
word.capitalize()
else:
word.title()
return string
I am also new to Python I tired below method which seems to work:
def to_jaden_case(string):
return ' '.join(i.capitalize() for i in string.split())
I was trying to use .title() in different ways but couldn't seem get solution with that but i could split string and capitalize every word.
This is not a language-specific question.
I have a string in ALL CAPS. This string comes in from a separate source and for some reason is always in all caps.
I've been given the task of making the string a little more reader-friendly so I decided to just slap a sentence case converter method on it using simple regex.
The thing is, there are a lot of acronyms used in this string and I would like to keep them unaffected. Things like country codes(US, CA, JP, FR, etc...), or airport codes(LAX, LGA) and sometimes many others.
Now I'm guessing I would first need a list of the acronyms in a database or something, of all the possible airport codes, country codes and a list of commonly used acronyms like ETA, COD, etc...
Once I have this database created, how can I apply it to the string in question?? How can I prevent the word "us" being changed to US and vice-versa?? What I basically wanna know is, how do I take what's in the DB and apply all the necessary changes to the string?
Remember, I get the original string in ALL CAPS so there's no way to differentiate.
Any ideas would be greatly appreciated!!
Thanks!!!
Something close to this can be done with ActiveSupport::Inflector, which provides the titleize method (which does the work for String.titleize).
First, define your own inflections in an initializer.
# config/initializers/inflections.rb
ActiveSupport::Inflector.inflections do |inflect|
inflect.acronym 'US'
end
Restart your app to pick up the change. Now titleize knows how to handle "US". Fire up a Rails console to check it out:
> "us".titleize
=> "US"
Next, check out the source code for titleize. Once you understand it, reopen the Inflector class in an initializer and define your own method that doesn't capitalize the first letter of each word. Call it something nifty, like decapitalize.
module ActiveSupport::Inflector
def decapitalize(word)
humanize(underscore(word)) # you may enhance this a bit
end
end
class String
def decapitalize
ActiveSupport::Inflector.decapitalize(self)
end
end
Caveats and Limitations
You may need to tweak the code, but I think it's close.
Here are some sentences this solution won't handle very well:
> "US STATES VISITED BY US".titleize
=> "US States Visited By US"
> "COLUMBIA (CO) EXPORTS ARE PROCESSED BY ACME BUILDING CO.".decapitalize
=> "Columbia (CO) exports are processed by acme building CO."
Does any one know how to generate the possible misspelling ?
Example : unemployment
- uemployment
- onemploymnet
-- etc.
If you just want to generate a list of possible misspellings, you might try a tool like this one. Otherwise, in SAS you might be able to use a function like COMPGED to compute a measure of the similarity between the string someone entered, and the one you wanted them to type. If the two are "close enough" by your standard, replace their text with the one you wanted.
Here is an example that computes the Generalized Edit Distance between "unemployment" and a variety of plausible mispellings.
data misspell;
input misspell $16.;
length misspell string $16.;
retain string "unemployment";
GED=compged(misspell, string,'iL');
datalines;
nemployment
uemployment
unmployment
uneployment
unemloyment
unempoyment
unemplyment
unemploment
unemployent
unemploymnt
unemploymet
unemploymen
unemploymenyt
unemploymenty
unemploymenht
unemploymenth
unemploymengt
unemploymentg
unemploymenft
unemploymentf
blahblah
;
proc print data=misspell label;
label GED='Generalized Edit Distance';
var misspell string GED;
run;
Essentially you are trying to develop a list of text strings based on some rule of thumb, such as one letter is missing from the word, that a letter is misplaced into the wrong spot, that one letter was mistyped, etc. The problem is that these rules have to be explicitly defined before you can write the code, in SAS or any other language (this is what Chris was referring to). If your requirement is reduced to this one-wrong-letter scenario then this might be managable; otherwise, the commenters are correct and you can easily create massive lists of incorrect spellings (after all, all combinations except "unemployment" constitute a misspelling of that word).
Having said that, there are many ways in SAS to accomplish this text manipulation (rx functions, some combination of other text-string functions, macros); however, there are probably better ways to accomplish this. I would suggest an external Perl process to generate a text file that can be read into SAS, but other programmers might have better alternatives.
If you are looking for a general spell checker, SAS does have proc spell.
It will take some tweaking to get it working for your situation; it's very old and clunky. It doesn't work well in this case, but you may have better results if you try and use another dictionary? A Google search will show other examples.
filename name temp lrecl=256;
options caps;
data _null_;
file name;
informat name $256.;
input name &;
put name;
cards;
uemployment
onemploymnet
;
proc spell in=name
dictionary=SASHELP.BASE.NAMES
suggest;
run;
options nocaps;
I have a LaTeX document where I'd like the numbering of floats (tables and figures) to be in one numeric sequence from 1 to x rather than two sequences according to their type. I'm not using lists of figures or tables either and do not need to.
My documentclass is report and typically my floats have captions like this:
\caption{Breakdown of visualisations created.}
\label{tab:Visualisation_By_Types}
A quick way to do it is to put \addtocounter{table}{1} after each figure, and \addtocounter{figure}{1} after each table.
It's not pretty, and on a longer document you'd probably want to either include that in your style sheet or template, or go with cristobalito's solution of linking the counters.
The differences between the figure and table environments are very minor -- little more than them using different counters, and being maintained in separate sequences.
That is, there's nothing stopping you putting your {tabular} environments in a {figure}, or your graphics in a {table}, which would mean that they'd end up in the same sequence. The problem with this case (as Joseph Wright notes) is that you'd have to adjust the \caption, so that doesn't work perfectly.
Try the following, in the preamble:
\makeatletter
\newcounter{unisequence}
\def\ucaption{%
\ifx\#captype\#undefined
\#latex#error{\noexpand\ucaption outside float}\#ehd
\expandafter\#gobble
\else
\refstepcounter{unisequence}% <-- the only change from default \caption
\expandafter\#firstofone
\fi
{\#dblarg{\#caption\#captype}}%
}
\def\thetable{\#arabic\c#unisequence}
\def\thefigure{\#arabic\c#unisequence}
\makeatother
Then use \ucaption in your tables and figures, instead of \caption (change the name ad lib). If you want to use this same sequence in other environments (say, listings?), then define \the<foo> the same way.
My earlier attempt at this is in fact completely broken, as the OP spotted: the getting-the-lof-wrong is, instead of being trivial and only fiddly to fix, absolutely fundamental (ho, hum).
(For the afficionados, it comes about because \advance commands are processed in TeX's gut, but the content of the .lof, .lot, and .aux files is fixed in TeX's mouth, at expansion time, thus what was written to the files was whatever random value \#tempcnta had at the point \caption was called, ignoring the \advance calculations, which were then dutifully written to the file, and then ignored. Doh: how long have I know this but never internalised it!?)
Dutiful retention of earlier attempt (on the grounds that it may be instructively wrong):
No problem: try putting the following in the preamble:
\makeatletter
\def\tableandfigurenum{\#tempcnta=0
\advance\#tempcnta\c#figure
\advance\#tempcnta\c#table
\#arabic\#tempcnta}
\let\thetable\tableandfigurenum
\let\thefigure\tableandfigurenum
\makeatother
...and then use the {table} and {figure} environments as normal. The captions will have the correct 'Table/Figure' text, but they'll share a single numbering sequence.
Note that this example gets the numbers wrong in the listoffigures/listoftables, but (a) you say you don't care about that, (b) it's fixable, though probably mildly fiddly, and (c) life is hard!
I can't remember the syntax, but you're essentially looking for counters. Have a look here, under the custom floats section. Assign the counters for both tables and figures to the same thing and it should work.
I'd just use one type of float (let's say 'figure'), then use the caption package to remove the automatically added "Figure" text from the caption and deal with it by hand.