want to extract review but getting some problem - python-3.x

my script that i'm using to extract the review for one of the book is:
URL:
www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird
from selenium import webdriver
import time
driver = webdriver.Chrome()
time.sleep(3)
driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')
time.sleep(5)
reviews = driver.find_elements_by_css_selector("div.reviewText")
for r in reviews:
spanText = r.find_element_by_css_selector("span.readable:nth-child(2)").text
print("Span text:", spanText)
I'm facing the problem that i am not able to extract the whole text from the div.reviewText>span as in that div>span there are two nested spans one contains little text(for getting full text have to click on ...more link) not full one and the second span in the div.contains the full text, so i want to get the text frm the second span. Can someone help me please?
HTML(or you can visit the site as link is given above)
<div class="reviewText stacked">
<span id="reviewTextContainer35272288" class="readable">
<span id="freeTextContainer13558188749606170457">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
<br>
<br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not
</span>
<span id="freeText13558188749606170457" style="display:none">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
<br>
<br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not to bury, but to praise. Written in the late fifties, TKAM is free of the social changes and conventions that people at the time were (and are, to some extent) still grating at. The primary dividing line in TKAM is not one of race, but is rather one of good people versus bad people -- something that, of course, Atticus and the children can discern effortlessly.
<br>
<br>The characters are one dimensional. Calpurnia is the Negro who knows her place and loves the children; Atticus is a good father, wise and patient; Tom Robinson is the innocent wronged; Boo is the kind eccentric; Jem is the little boy who grows up; Scout is the precocious, knowledgable child. They have no identity outside of these roles. The children have no guile, no shrewdness--there is none of the delightfully subversive slyness that real children have, the sneakiness that will ultimately allow them to grow up. Jem and Scout will be children forever, existing in a world of black and white in which lacking knowledge allows people to see the truth in all of its simple, nuanceless glory.
<br>
<br>I think that's why people find it soothing: TKAM privileges, celebrates, even, the child's point of view. Other YA classics--Huckleberry Finn; Catcher in the Rye; A Wrinkle in Time; The Day No Pigs Would Die; Are You There, God? It's Me, Margaret; Bridge to Terabithia--feature protagonists who are, if not actively fighting to become adults, at least fighting to find themselves as people. There is an active struggle throughout each of those books to make sense of the world, to define the world as something larger than oneself, as something that the protagonist can somehow be a part of. To Kill A Mockingbird has no struggle to become part of the world--in it, the children *are* the world, and everything else is just only relevant in as much as it affects them. There's no struggle to make sense of things, because to them, it already makes sense; there's no struggle to be a part of something, because they're already a part of everything. There's no sense of maturation--their world changes, but it leaves them, in many ways, unchanged, and because of that, it fails as a story for me. The whole point of a coming of age story--which is what TKAM is generally billed as--is that the characters come of age, or at least mature in some fashion, and it just doesn't happen.
<br>
<br>All thematic issues aside, I think that the writing is very, er, uneven, shall we say? Overwhelmingly episodic, not terribly consistent, and largely as dimensionless as the characters.
<br>
</span>
<a data-text-id="13558188749606170457" href="#" onclick="swapContent($(this));; return false;">...more</a>
</span>
</div>

use get_attribute() to extract hidden content and you don't need unnecessary sleep
driver = webdriver.Chrome()
driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')
reviews = driver.find_elements_by_css_selector("span.readable span:nth-child(2)")
for r in reviews:
spanText = r.get_attribute('textContent')
print("Span text:", spanText)

Second span is hidden, so you cannot get its content with text property.
You need to try
spanText = r.find_elements_by_css_selector("span.readable > span")[-1].get_attribute('textContent')
to get content of hidden element

Related

What functionality does the "g?" command provide in vim

As a beginner programmer I've been practicing vim on vimgolf recently and saw that the command "g?" was used effectively to switch many lines of 'Ivm' to become 'Vim'. I understand that this shifts each alphabetical letter 13 times to the right but do not understand how this would prove useful except in unique circumstances like these.
The g? command (type :help g? for brief documentation) implements the rot13 algorithm, which rotates each letter 13 places forward or backward in the alphabet.
I'm not sure how commonly it's used today, but on Usenet it used to be a common way to encode spoilers. For example, if I'm writing a post that gives away the ending of something that not everyone has seen, I might use rot13 to weakly encrypt part of the article. It's enough to make it impossible to read accidentally (unless you've had a lot of practice), but easy to read if you're using a newsreader that has a built-in rot13 function -- as most of them do.
For example:
Pretend this is a spoiler. Filter with rot13 to read it.
would become
Cergraq guvf vf n fcbvyre. Svygre jvgu ebg13 gb ernq vg.
If I don't want to read the spoiler, I can ignore it. If I do want to read it, I can decrypt it easily enough.
I have been using Vim since 4 years and learned about that command very early on but, even if I knew perfectly well what ROT13 was, I never found a use for g?.
Until a couple of weeks ago when I needed to add a bunch of <li> with unique IDs to a <ul> in an HTML prototype…
The starting point:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
</ul>
After duplicating the two <li>:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
</ul>
After g?i" on the two new <li>'s ids:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
<li id="yberz">foo</li>
<li id="vcfhz">foo</li>
</ul>
There! I found a practical use for g? in actual "programming"! Celebration!!!
It can prove useful on the case where you want to quickly hide some part of text that you typed in a visible vim buffer from onlookers.
For example some piece of password or token which you put in your code (but only do this temporarily, when you must).
Perhaps you want to invite a team-mate to look at some of your code, or you work in a place where people use to walk behind your back all the time so you can just rot13 the string and it is useless to them (at least in a glance).
It probably works best against non technical passerby's or for short exposure period.
Keep in mind it does not rotate numbers and for security purposes it was even better if it could take a rotation size.
It can also become useful when you solve a CTF that has a rot13 challenge...

What text processing tool is recommended for parsing screenplays?

I have some plain-text kinda-structured screenplays, formatted like the example at the end of this post. I would like to parse each into some format where:
It will be easy to pull up just stage directions that deal with a specific place.
It will be easy to pull up just dialogue belonging to a particular character.
The most obvious approach I can think of is using sed or perl or php to put div tags around each block, with classes representing character, location, and whether it's stage directions or dialogue. Then, open it up as a web-page and use jQuery to pull out whatever I'm interested in. But this sounds like a roundabout way to do it and maybe it only seems like a good idea because these are tools I'm accustomed to. But I'm sure this is a recurring problem that's been solved before, so can anybody recommend a more efficient workflow that can be used on a Linux box? Thanks.
Here is some sample input:
SOMEWHERE CORPORATION - OPTIONAL COMMENT
A guy named BOB is sitting at his computer.
BOB
Mmmm. Stackoverflow. I like.
Footsteps are heard approaching.
ALICE
Where's that report you said you'd have for me?
Closeup of clock ticking.
BOB (looking up)
Huh? What?
ALICE
Some more dialogue.
Some more stage directions.
Here is what sample output might look like:
<div class='scene somewhere_corporation'>
<div class='comment'>OPTIONAL COMMENT</div>
<div class='direction'>A guy named BOB is sitting at his computer.</div>
<div class='dialogue bob'>Mmmm. Stackoverflow. I like.</div>
<div class='direction'>Footsteps are heard approaching.</div>
<div class='dialogue alice'>Where's that report you said you'd have for me?</div>
<div class='direction'>Closeup of clock ticking.</div>
<div class='comment bob'>looking up</div>
<div class='dialogue bob'>Huh? What?</div>
<div class='dialogue alice'>Some more dialogue.</div>
<div class='direction'>Some more stage directions.</div>
</div>
I'm using DOM as an example, but again, only because that's something I understand. I'm open to whatever is considered a best practice for this type of text-processing task if, as I suspect, roll-your-own regexps and jQuery is not the best practice. Thanks.
You could use Celtx to import plain text scripts and export them to HTML (and RDF/XML for the metadata) (see this related thread and this blog post, which describes the file structure).
Other screenplay editors like Trelby might offer this feature, too.
There is also Fountain, a plain text markup language for screenwriting. They offer libraries which you might (I did not check if they offer something for importing and converting) use for your cause:
Fountain is free and open-source, with libraries that make it easy to add support in your apps.
Even if those projects can’t be used for your cause, you could at least reuse their format for your output.
If your input is not too noisy, i.e. if you can trust some regularities like the indentation which is larger for dialogs as opposed to comments, I would use a simple Context Free Grammar. You have good implementations in all languages and you'll find lot of information on SO.
If your input varies a lot, then take the machine learning route, but you'll need to have a big number of inputs with human-validated output for training, which might be a hassle.
In any case, I would never, ever use regular expressions for problems like that.

Cannot locate a text_field with dynamic id

<div id="temp_1333021214801">
<input type="text"/>
</div>
$browser.text_field(:xpath,".//*[#id='temp_1333018770709']/input").set("apple")
I am getting error "unable to locate element", because the ID changes dynamically.
Please help me to set the text in the text field.
It seems like your dynamic id is temp_ so this should do it given information above:
browser.div(:id, /temp_\d+/).text_field.set 'something'
Issues with my solution is that it assumes id will always be temp_ regex matching any number set consecutively, which seems to be the case with your sample above. Also, it assumes there is no other div(:id, /temp_\d+/) combination in the DOM of that page, most likely should not be an issue.
If you have dynamic IDs I can suggest the following:
Code to object counts. For example
$browser.text_field(:index => 2)
gives the third text_field on the page.
Code to what is around the thing you're trying to find.
$browser.div(:name => 'mydiv').text_field(:index=>2)
gives the third text field in the div called 'mydiv'.
HOWEVER
If your front-end is less-than-testable in this way I highly suggest you put time into thinking over your commitment to automated testing in the first place. Any minor change to the software is going to have you working until 9pm pulling your hair out and rocking back and forth as you update all your scripts, so unless code maintenance is your weekend hobby think about semi-automation or exploratory testing or manual scripts. Talk to development (whomever that might be. It might be you!) or the higher-ups (unless that's you too) to see if it can be made more testable. Also don't use xpaths unless you take some deviant pleasure in it.
Hope that was helpful, I can't do anything specific without the source HTML.

Is HTML Email Obfuscation safe enough to stop bots?

I know that most javascript email obfuscation solutions stop bots dead in their tracks - but sometimes it's hard to use/insert javascript in places.
To that end I was wondering if anyone knew if the bots were smart enough to translate HTML entities in HEX and DEC into valid email strings?
For example, lets say I have a function that randomly converts the string characters into one of three forms - is this enough?
hide_email($email)
{
$s='';
foreach(str_split($email)as$l)
{
switch(rand(1,3))
{
case 1:$s.='&#'.ord($l).';';break;
case 2:$s.='&#x'.dechex(ord($l)).';';break;
case 3:$s.=$l;
}
}
return$s;
}
which makes first.last#email.com into something like:
first.last#email.com
I would assume that the bot creators would have already added a regex pattern for something like this this...
I would not think this particularly safe. Were I writing code to interpret HTML, decoding entities to their corresponding characters would be among the first bits of code to go in.
As a further defense, I would suggest judicious use of tags (such as the <span> tag), perhaps even nested. That takes more effort to decode and still does not require Javascript.
I wouldn't be shocked if a bot used a client that did an HtmlDecode before returning the results.
There was an interesting article I read awhile ago about a guy who posted a web page with nine different methods of obfuscation, and waited a year to see how much each e-mail address got.
Here's a link to the article: Nine Ways to Obfuscate E-mail Addresses Compared. Some of the pictures in the sidebar may not be safe for work, if your work frowns on girls in bikinis.

Managing Unregistered User Posts by Screening

I am considering allowing users to post to my site without having them register or provide any identifying information. If each post is sent to a db queue and I then manually screen these posts, what sort of issues might I run into? How might I handle those issues?
Screening every post would be tedious and tiresome. And prone to annoying admin spam. My suggestion would be to automate as much of screening as possible. And besides, providing any identifying information does nothing to prevent spam (a bot will just generate it).
A lot of projects implement recognition system: first the user has to post 1-2 posts that are approved, then by IP and (maybe) a cookie he's identified as a trusted poster, so his posts appear automatically (and later can be marked as spam).
Also some heuristics on the content of the post could be used (like amount of links in the post) to automatically discard potential spam posts.
The most obvious issue is that you'll get overwhelmed by the number of submissions to screen, if your site is sufficiently popular.
I would make sure to add some admin tools, so you can automatically kill all posts from a particular IP address, or that match a particular regex. That should help get rid of obvious spam faster, but again, you'd have to be behind the wheel for all of that.
Tedium seems to be the greatest concern – screening posts manually is effective against spam (I'm assuming this is what you want to weed out) but very boring.
It could be best fixed with a cup of coffee and nice music to listen to while weeding?
I've found that asking for the answer to a simple question sent the browser as an image (like "2 + 3 - 4 =", a varient of a 'captcha' but not so annoying), with a wee bit of Javascript does quite well.
Send your form with the image and answer field, and a hidden field with a "challenge" (some randomly generated string). When the user submits the form, hash the challenge and the answer, and send the result back to the server. The server can check for a valid answer before adding it to the database for review.
It seems like a lot of work up front, but it will save hours of review time. Using jQuery:
<script type="text/javascript">
// Hash function to mask the answer
function answerMask()
{
var a = $('#a').val();
var c = $('#c').val();
var h = hex_md5(hex_md5(a) + c);
$('#a').val(h);
}
</script>
<form onsubmit="answerMask()" action="/cgi-bin/comment.py" method="POST">
<table>
<tr><td>Comment</td><td><input type="text" name="comment" /></td></tr>
<tr><td># put image here #</td><td><input id="p" type="text" name="a" size="30" /></td></tr>
<tr><td><input id="c" type="hidden" value="ddd8c315d759a74c75421055a16f6c52" name="c" /></td><td><input type="submit" value=" Go "></td></tr>
</p>
</form>
Edit update...
I saw this technique on a web site, I'm not sure which one, so this idea isn't mine but you might find it useful.
Provide a form with a challenge field and a comment field. Prefix the challenge with "Pick the third word from: glark snerm hork morf" so the words, and which one to pick, are easy to generate on the server and easy to validate when the form contents come back.
The point is to make the user do something, apply a few brain cells, and more work than it's worth for a script kiddie.
posts that attempt to look legit but aren't
the sheer volume
These are the issues that I see on my blog.

Resources