I have some plain-text kinda-structured screenplays, formatted like the example at the end of this post. I would like to parse each into some format where:
It will be easy to pull up just stage directions that deal with a specific place.
It will be easy to pull up just dialogue belonging to a particular character.
The most obvious approach I can think of is using sed or perl or php to put div tags around each block, with classes representing character, location, and whether it's stage directions or dialogue. Then, open it up as a web-page and use jQuery to pull out whatever I'm interested in. But this sounds like a roundabout way to do it and maybe it only seems like a good idea because these are tools I'm accustomed to. But I'm sure this is a recurring problem that's been solved before, so can anybody recommend a more efficient workflow that can be used on a Linux box? Thanks.
Here is some sample input:
SOMEWHERE CORPORATION - OPTIONAL COMMENT
A guy named BOB is sitting at his computer.
BOB
Mmmm. Stackoverflow. I like.
Footsteps are heard approaching.
ALICE
Where's that report you said you'd have for me?
Closeup of clock ticking.
BOB (looking up)
Huh? What?
ALICE
Some more dialogue.
Some more stage directions.
Here is what sample output might look like:
<div class='scene somewhere_corporation'>
<div class='comment'>OPTIONAL COMMENT</div>
<div class='direction'>A guy named BOB is sitting at his computer.</div>
<div class='dialogue bob'>Mmmm. Stackoverflow. I like.</div>
<div class='direction'>Footsteps are heard approaching.</div>
<div class='dialogue alice'>Where's that report you said you'd have for me?</div>
<div class='direction'>Closeup of clock ticking.</div>
<div class='comment bob'>looking up</div>
<div class='dialogue bob'>Huh? What?</div>
<div class='dialogue alice'>Some more dialogue.</div>
<div class='direction'>Some more stage directions.</div>
</div>
I'm using DOM as an example, but again, only because that's something I understand. I'm open to whatever is considered a best practice for this type of text-processing task if, as I suspect, roll-your-own regexps and jQuery is not the best practice. Thanks.
You could use Celtx to import plain text scripts and export them to HTML (and RDF/XML for the metadata) (see this related thread and this blog post, which describes the file structure).
Other screenplay editors like Trelby might offer this feature, too.
There is also Fountain, a plain text markup language for screenwriting. They offer libraries which you might (I did not check if they offer something for importing and converting) use for your cause:
Fountain is free and open-source, with libraries that make it easy to add support in your apps.
Even if those projects can’t be used for your cause, you could at least reuse their format for your output.
If your input is not too noisy, i.e. if you can trust some regularities like the indentation which is larger for dialogs as opposed to comments, I would use a simple Context Free Grammar. You have good implementations in all languages and you'll find lot of information on SO.
If your input varies a lot, then take the machine learning route, but you'll need to have a big number of inputs with human-validated output for training, which might be a hassle.
In any case, I would never, ever use regular expressions for problems like that.
Related
my script that i'm using to extract the review for one of the book is:
URL:
www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird
from selenium import webdriver
import time
driver = webdriver.Chrome()
time.sleep(3)
driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')
time.sleep(5)
reviews = driver.find_elements_by_css_selector("div.reviewText")
for r in reviews:
spanText = r.find_element_by_css_selector("span.readable:nth-child(2)").text
print("Span text:", spanText)
I'm facing the problem that i am not able to extract the whole text from the div.reviewText>span as in that div>span there are two nested spans one contains little text(for getting full text have to click on ...more link) not full one and the second span in the div.contains the full text, so i want to get the text frm the second span. Can someone help me please?
HTML(or you can visit the site as link is given above)
<div class="reviewText stacked">
<span id="reviewTextContainer35272288" class="readable">
<span id="freeTextContainer13558188749606170457">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
<br>
<br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not
</span>
<span id="freeText13558188749606170457" style="display:none">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
<br>
<br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not to bury, but to praise. Written in the late fifties, TKAM is free of the social changes and conventions that people at the time were (and are, to some extent) still grating at. The primary dividing line in TKAM is not one of race, but is rather one of good people versus bad people -- something that, of course, Atticus and the children can discern effortlessly.
<br>
<br>The characters are one dimensional. Calpurnia is the Negro who knows her place and loves the children; Atticus is a good father, wise and patient; Tom Robinson is the innocent wronged; Boo is the kind eccentric; Jem is the little boy who grows up; Scout is the precocious, knowledgable child. They have no identity outside of these roles. The children have no guile, no shrewdness--there is none of the delightfully subversive slyness that real children have, the sneakiness that will ultimately allow them to grow up. Jem and Scout will be children forever, existing in a world of black and white in which lacking knowledge allows people to see the truth in all of its simple, nuanceless glory.
<br>
<br>I think that's why people find it soothing: TKAM privileges, celebrates, even, the child's point of view. Other YA classics--Huckleberry Finn; Catcher in the Rye; A Wrinkle in Time; The Day No Pigs Would Die; Are You There, God? It's Me, Margaret; Bridge to Terabithia--feature protagonists who are, if not actively fighting to become adults, at least fighting to find themselves as people. There is an active struggle throughout each of those books to make sense of the world, to define the world as something larger than oneself, as something that the protagonist can somehow be a part of. To Kill A Mockingbird has no struggle to become part of the world--in it, the children *are* the world, and everything else is just only relevant in as much as it affects them. There's no struggle to make sense of things, because to them, it already makes sense; there's no struggle to be a part of something, because they're already a part of everything. There's no sense of maturation--their world changes, but it leaves them, in many ways, unchanged, and because of that, it fails as a story for me. The whole point of a coming of age story--which is what TKAM is generally billed as--is that the characters come of age, or at least mature in some fashion, and it just doesn't happen.
<br>
<br>All thematic issues aside, I think that the writing is very, er, uneven, shall we say? Overwhelmingly episodic, not terribly consistent, and largely as dimensionless as the characters.
<br>
</span>
<a data-text-id="13558188749606170457" href="#" onclick="swapContent($(this));; return false;">...more</a>
</span>
</div>
use get_attribute() to extract hidden content and you don't need unnecessary sleep
driver = webdriver.Chrome()
driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')
reviews = driver.find_elements_by_css_selector("span.readable span:nth-child(2)")
for r in reviews:
spanText = r.get_attribute('textContent')
print("Span text:", spanText)
Second span is hidden, so you cannot get its content with text property.
You need to try
spanText = r.find_elements_by_css_selector("span.readable > span")[-1].get_attribute('textContent')
to get content of hidden element
I'm developing a Liferay portlet supposed to offer a way of categorizing its content. I created multiple vocabularies (e.g. frogs, apes, birds).
In the view of THIS portlet I want to offer the categories of the "frogs"-vocabulary only. I know I could write some code to read the categories contained in the vocabulary to offer them in a combo-box.
But, isn't there a way of convincing the built-in liferay-ui:asset-categories-selector-tag to show one vocabulary only? Or may be there's some other tag? (I'm stuck here.)
Here's my current code that lists all vocabularies:
<liferay-ui:asset-categories-selector
className=" <%= JournalArticle.class.getName() %>"
/>
Unfortunately this taglib's documentation is quite tumbleweed. You might need to look into the implementation for the actual content of the attribute, but curCategoryIds might be a good choice to start trying out if this is foreseen.
Alternatively it might be worth creating another tag (based on this one, in a new taglib) - if you do this, you might want to file an issue or feature request and contribute it back into the liferay-ui taglib.
As a beginner programmer I've been practicing vim on vimgolf recently and saw that the command "g?" was used effectively to switch many lines of 'Ivm' to become 'Vim'. I understand that this shifts each alphabetical letter 13 times to the right but do not understand how this would prove useful except in unique circumstances like these.
The g? command (type :help g? for brief documentation) implements the rot13 algorithm, which rotates each letter 13 places forward or backward in the alphabet.
I'm not sure how commonly it's used today, but on Usenet it used to be a common way to encode spoilers. For example, if I'm writing a post that gives away the ending of something that not everyone has seen, I might use rot13 to weakly encrypt part of the article. It's enough to make it impossible to read accidentally (unless you've had a lot of practice), but easy to read if you're using a newsreader that has a built-in rot13 function -- as most of them do.
For example:
Pretend this is a spoiler. Filter with rot13 to read it.
would become
Cergraq guvf vf n fcbvyre. Svygre jvgu ebg13 gb ernq vg.
If I don't want to read the spoiler, I can ignore it. If I do want to read it, I can decrypt it easily enough.
I have been using Vim since 4 years and learned about that command very early on but, even if I knew perfectly well what ROT13 was, I never found a use for g?.
Until a couple of weeks ago when I needed to add a bunch of <li> with unique IDs to a <ul> in an HTML prototype…
The starting point:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
</ul>
After duplicating the two <li>:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
</ul>
After g?i" on the two new <li>'s ids:
<ul>
<li id="lorem">foo</li>
<li id="ipsum">foo</li>
<li id="yberz">foo</li>
<li id="vcfhz">foo</li>
</ul>
There! I found a practical use for g? in actual "programming"! Celebration!!!
It can prove useful on the case where you want to quickly hide some part of text that you typed in a visible vim buffer from onlookers.
For example some piece of password or token which you put in your code (but only do this temporarily, when you must).
Perhaps you want to invite a team-mate to look at some of your code, or you work in a place where people use to walk behind your back all the time so you can just rot13 the string and it is useless to them (at least in a glance).
It probably works best against non technical passerby's or for short exposure period.
Keep in mind it does not rotate numbers and for security purposes it was even better if it could take a rotation size.
It can also become useful when you solve a CTF that has a rot13 challenge...
I'm relatively new to Expression Engine, and as I'm learning it I am seeing some stuff missing that WordPress has had for a while. A big one for me is shortcodes, since I will use these to allow CMS users to place more complex content in place with their other content.
I'm not seeing any real equivalent to this in EE, apart from a forthcoming plugin that's in private beta.
As an initial test I'm attempting to fake shortcodes by using delimited strings (e.g. #foo#) in the content field, then using a regex to pull those out and pass them to a function that can retrieve the content out of EE's database.
This brings me to a second question, which is that in looking at EE's API docs, there doesn't appear to be a simple means of retrieving the channel entries programmatically (thinking of something akin to WP's built-in get_posts function).
So my questions are:
a) Can this be done?
b) If so, is my method of approaching it reasonable? Or is there something stupidly obvious I'm missing in my approach?
To reiterate, my main objective here is to have some means of allowing people managing content to drop a code in place in their content that will be replaced with channel content.
Thanks for any advice or help you can give me.
Here's a simple example of the functionality you're looking for.
1) Start by installing Low Replace.
2) Create two Global Variables called gv_hello and gv_goodbye with the values "Hello" and "Goodbye" respectively.
3) Put this text into the body of an entry:
[say_hello]
Nice to see you.
[say_goodbye]
4) Put this into your template, wrapping the Low Replace tag around your body field.
{exp:low_replace
find="[say_hello]|[say_goodbye]"
replace="{gv_hello}|{gv_goodbye}"
multiple="yes"
}
{body}
{/exp:low_replace}
5) It should output this into your browser:
Hello
Nice to see you.
Goodbye
Obviously, this is a really simple example. You can put full blown HTML into your global variable. For example, we've used that to render a complex, interactive graphic that isn't editable but can be easily dropped into a page by any editor.
Unfortunately, due to parse order issues, EE tags won't work inside Global Variables. If you need EE tags in your short code output, you'll need to use Low Variables addon instead of Global Variables.
Continued from the comment:
Do you have examples of the kind of shortcodes you want to support/include? Because i have doubts if controlling the page-layout from a text-field or wysiwyg-field is the way to go.
If you want editors to be able to adjust layout or show/hide extra parts on the page, giving them access to some extra fields in the channel, is (imo) much more manageable and future-proof. For instance some selectfields, a relationship (or playa) field, or a matrix, to let them choose which parts to include/exclude on a page, or which entry from another channel to pull content from.
As said in the comment: i totally understand if you want to replace some #foo# tags with images or data from another field (see other answers: nsm-transplant, low_replace). But, giving an editor access to shortcodes and picking them out, is like writing a template-engine to generate ee-template code for the ee-template-engine.
Using some custom fields to let editors pick and choose parts to embed is, i think, much more manageable.
That being said, you could make a plugin to parse the shortcodes from a textareas content, and then program a lot, to fetch data from other modules you want to support. For channel entries you could build out of the channel data library by objectiveHTML. https://github.com/objectivehtml/Channel-Data
I hear you, I too miss shortcodes from WP -- though the reason they work so easily there is the ubiquity of the_content(). With the great flexibility of EE comes fewer blanket solutions.
I'd suggest looking at NSM Transplant. It should fit the bill for you.
There is also a plugin called Shortcode, which you can find here at
Devot-ee
A quote from the page:
Shortcode aims to allow for more dynamic use of content by authors and
editors, allowing for injection of reusable bits of content or even
whole pieces of functionality into any field in EE
<div id="temp_1333021214801">
<input type="text"/>
</div>
$browser.text_field(:xpath,".//*[#id='temp_1333018770709']/input").set("apple")
I am getting error "unable to locate element", because the ID changes dynamically.
Please help me to set the text in the text field.
It seems like your dynamic id is temp_ so this should do it given information above:
browser.div(:id, /temp_\d+/).text_field.set 'something'
Issues with my solution is that it assumes id will always be temp_ regex matching any number set consecutively, which seems to be the case with your sample above. Also, it assumes there is no other div(:id, /temp_\d+/) combination in the DOM of that page, most likely should not be an issue.
If you have dynamic IDs I can suggest the following:
Code to object counts. For example
$browser.text_field(:index => 2)
gives the third text_field on the page.
Code to what is around the thing you're trying to find.
$browser.div(:name => 'mydiv').text_field(:index=>2)
gives the third text field in the div called 'mydiv'.
HOWEVER
If your front-end is less-than-testable in this way I highly suggest you put time into thinking over your commitment to automated testing in the first place. Any minor change to the software is going to have you working until 9pm pulling your hair out and rocking back and forth as you update all your scripts, so unless code maintenance is your weekend hobby think about semi-automation or exploratory testing or manual scripts. Talk to development (whomever that might be. It might be you!) or the higher-ups (unless that's you too) to see if it can be made more testable. Also don't use xpaths unless you take some deviant pleasure in it.
Hope that was helpful, I can't do anything specific without the source HTML.