Limit readable text with a 'read more' link? - text

I am having this issue how to make a limit readable text with a 'read more' link below it. I am making my own cms system as a school project, so i would like to add that feature to it.
I have made so i am able to create an article, but the text just keep going on if i write like 1000 words in one article. I only have a little clue how to make the text limit to 50-100 words.
Anyone have made this before that could help me out how they did it?

The idea is to have two methods GetPreview(id, n) and GetArticle(id) where the first returns only first n words from an Article with a link to a page that calls second method.
Assuming that $articleText is the content of your article and a word is considered to be delimited by spaces this is how a part of the GetPreview function will look like:
$words = explode(" ", $articleText);
if (count($words) <= $n)
return $articleText; //unchanged
$preview = array_splice($words, 0, $n); //n is the number of words
return implode(" ", $preview) . "... <br />" .
Read More ";
where ViewArticle.php should be another page that loads the entire article.
However, this is an example that will work for plain text articles. If your articles contain HTML the method to extract the preview becomes more complex because you don't want it break the content in the middle of a HTML tag.
Another approach, which is actually safer and used in popular CMS like WordPress is to decorate your article with a specific HTML comment like:
<!-- more -->
When you display the entire article you don't have to do anything special because this will not be displayed by the browser (remember, it's a comment).
However, in your GetPreview method you search for this string and if you find it:
you get the text before it
you append a 'Read More' link that redirects you to the full article
There are multiple advantages of this method:
you can safely use HTML in your article.
the author has full control on how the preview looks like.

Related

NodeJS Jade (Pug) inline link in dynamic text

I have this NodeJS application, that uses Jade as template language. On one particular page, one text block is retrieved from the server, which reads the text from database.
The problem is, the returned text might contain line-breaks and links, and an operator might change this text at any time. How do I make these elements display correctly?
Most answers suggest using a new line:
p
| this is the start of the para
a(href='http://example.com') a link
| and this is the rest of the paragraph
But I cannot do this, since I cannot know when the a element appears. I've solved how to get newline correct, by this trick:
p
each l in line.description.split(/\n/)
= l
br
But I cannot seem to solve how to get links to render correctly. Does anyone know?
Edit:
I am open to any kind of format for links in the database, whatever would solve the issue. For example, say database contains the following text:
Hello!
We would like you to visit [a("http://www.google.com")Google]
Then we would like that to output text that looks like this:
Hello!
We would like you to visit Google
Looks like what you're looking for is unescaped string interpolation. The link does not work in the output because Pug automatically escapes it. Wrap the content you want to insert with !{} and it should stop breaking links. (Disclaimer: Make sure you don't leave user input unescaped - this only is a viable option if you know for sure the content of your DB does not have unwanted HTML/JS code in it.)
See this CodePen for illustration.
With this approach, you would need to use standard HTML tags (<a>) in your DB text. If you don't want that, you could have a look at Pug filters such as markdown-it (you will still need to un-escape the compilation output of that filter).

Getting thumbnails in OpenSearchServer search results

I need an alternative to Google Custom Search for a website I look after, it has to be something that will crawl a website, index it, allow fiddling of priorities, and then allow search queries via REST or something similar and return XML or JSON etc. It needs to run on a Windows Server instance.
So, I'm up and running with http://www.opensearchserver.com/ and it seems to do the trick, but can't, for the life of me, work out how to get thumbnail images in the results? I've searched the documentation and read everything I could, but can't find out how to do this (or how to get my head around it).
I'm crawling standard web pages and they all have thumbnail meta data, which I'm assuming should be able to be parsed somehow for results and included in the JSON results?
Any pointers at all would be very helpful, thanks!
I figured this out, in case anyone else is struggling, here's how I did it. The answer is in the documentations, it's just not that simple.
Read: http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md - it contains the method
Assume you set up a 'web crawler' index.
Assuming you're using a meta thumbnail like this:
<meta name="thumbnail" content="http://my_cdn.com/news/images/29637.jpg">
Go into Schema / Fields. Add a new field called 'thumbnail' with index no, store yes, vector no, analyser Text, copy of blank. Save that.
Now go to schema / parser list, edit HTML parser. Go to 'field mapping', now add a new regex for the thumbnail in the html. We map from the 'htmlSource' to the thumbnail' with the matching regex.
My imperfect regex (that works though) is:
htmlSource -> linked in: thumbnail -> captured by:
(?s)<meta name="thumbnail" content="(.*?)">
Now SAVE this and go to crawl/manual crawl, enter a url that has a thumbnail and then check if the field now appears in the list below when it's read. If not check your regex, and check you actually saved the HTML Parser changes.
To get the thumb in your results, simply add the fieldname to the JSON you send with the query:
"returnedFields": [ "
"url",
"thumbnail"
],

How can I prevent certain element to get displayed in Google search excerpt?

Currently Google displays elements in the result excerpts that belongs to the functional part of the site. Is there a way to exclude these elements to get crawled/displayed in google?
Like eEdit, eDelete, etc in the example above.
To exclude the pages from Google's index, block them using the Robots.txt file or if it is just the content then use the "rel="nofollow" tag.
Hope this helps.
Update on my particular situation here: I just found out that the frontend code has been generated in a way where the title and the description meta was identical.
Google is smart enough to expect that if a copy is already displayed in the title of the search result there's no reason to add in to the excerpt as well, instead looks for content - believed to be valuable - from the actual page.
Lessons learned:
there's no way to hide elements from google but keep it visible for your users
if you'd like to have control over the content displayed in google searches, avoid using the same copy in your title and description

Calculate length of string using yahoo pipes

I am using yahoo pipes to fetch articles from various sources including google, however articles from google include the title and source of title in the description, is there a way in yahoo pipes to remove the title & source and leave the rest of article intact. I tried to use sub-string however it requires length of the string which is variable for each article. I guess if there is way to calculate the length of title and source and pass it to sub-string module this may work.
Any help would be great.
Regards
Take a look at http://pipes.yahoo.com/pipes/pipe.info?_id=8KZMRx473hGtVMYsP27D0g, which can be used as a subpipe (i.e. within a loop) to calculate the length of a string. It should be relatively straightforward to add a second text input module and modify the Pipe to cater for your second text string.

How to extract the first few sentences from a body of text on web page

We are building some sort of digg site and want to automatically fetch limited text (2-3 sentences). It can be last 3 sentences of article.if that would be easier. At the momemt we fetch web page content without the problem but want to make universal script to get few sentences. We want to avoid making custom scripts for each web site from which we want to get content.
I was thinking to find the text block by dots. To find dots in a close range and than to get words around dots. That is raw idea. Does someone has some other idea how to extract just par of the text.
We don't want to scrape full content.
Thank you.
You could look for large portions of the document that have less markup and less vertical whitespace. Download the page's source and strip out any markup using strip_tags(). Then you can search for, say, five consecutive sentences using regular expressions.
Here's an example script. It uses a class not included (an abstraction of curl_multi functions), but that class isn't really relevant for your question.
<?php
require_once("./../MultipleRequester.php");
$requester = new MultipleRequester();
$requester->addGetRequest( 'test', 'http://www.businessweek.com/news/2011-08-24/gold-tumbles-most-since-march-2008-as-demand-for-haven-wanes.html');
$requester->execute();
$content = $requester->getContent('test');
$plainText = strip_tags( $content );
$search = preg_match('/(\h{0,2}\v{0,2}\h{0,2}[A-Z]{1}[A-z0-9 ,\'")(.$]{10,1000}\.){2,5}/', $plainText, $matches);
if( $search )
print trim($matches[0]);
else
print "Could not extract anything.";
print "\n\n";
?>
This prints:
The dollar rose against a basket of six major currencies amid speculation about whether Federal Reserve Chairman Ben S. Bernanke will say this week that the central bank is willing to provide more stimulus to the economy. Central bankers meet this week in Jackson Hole, Wyoming, to address the U.S. recovery.
You may still have trouble with sites that mark up their content a lot. You might want to make the regular expression more lenient, particularly towards whitespace.
The regexp is a little messy, but you can tune it or write your own.

Resources