Is there a mark-up language to describe meanings? - text

I am curious if there is actually a mark-up language to describe the meanings of a text:
Here some example of what I mean and how it could look like:
<text>Stack Overflow is a programming Q & A site that’s free. Free to ask questions,
free to answer questions, free to read, free to index, built with plain old HTML, no
fake rot13 text on the home page, no scammy google-cloaking tactics, no salespeople, no
JavaScript windows dropping down in front of the answer asking for $12.95 to go away.
You can register if you want to collect karma and win valuable flair that will appear
next to your name, but otherwise, it’s just free. And fast. Very, very fast.</text>
And now I want to add meta-information to it so that i can give the text a meaning:
<mark from="0" to="14" object="Stack Overflow">Stack Overflow is a online community for coders.
The website is: www.stackoverflow.com</mark>
<mark from="20" to="31" object="programming" source="en.wikipedia.org/wiki/programming">
Computer programming (often shortened to programming or coding) is the process of
designing, writing, testing, debugging, and maintaining the source code of computer
programs</mark>
I hope there is already some language out there which I didn't find because of my bad "search"-abilities.
EDIT: I dont mean languages that are actually like HTML.
This is for me a standard html markup:
<p>My really <span class="important">interessting</span> paragraph</p>
I want to enable intersections and describe one part multiple times and not touch the original text like in my example above:
<text>Stack Overflow is a programming Q & A site that’s free. Free to ask questions,
free to answer questions, free to read, free to index, built with plain old HTML, no
fake rot13 text on the home page, no scammy google-cloaking tactics, no salespeople, no
JavaScript windows dropping down in front of the answer asking for $12.95 to go away.
You can register if you want to collect karma and win valuable flair that will appear
next to your name, but otherwise, it’s just free. And fast. Very, very fast.</text>
Now I want to markup "Stack Overflow" at the first line and describe it. Next I want to describe "programming" and tell what this is and next "Q & A" and after that comes a tricky part: I want to describe what a "programming Q & A site" is.
Here something i just make up:
<mark type="description" line="1" from="0" to="15" subject="Stack Overflow" language="English" source="http://en.wikipedia.org/wiki/Stack_overflow">
In software, a stack overflow occurs when too much memory is used on the call stack.
</mark>
<mark type="description" line="1" from="25" to="40" subject="programming" language="English" source="http://en.wikipedia.org/wiki/programming">
Computer programming (often shortened to programming or coding) is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs.
</mark>
<mark type="alias" line="1" from="42" to="47" subject="Q & A" language="English">
Question and Answer
</mark>
<mark type="description" line="1" from="25" to="52" subject="programming Q & A site" language="English" author="xMRW">
A website that offers people answers on questions related to the subject programming.
</mark>
<mark type="description" line="1" from="25" to="52" subject="programming Q & A site" language="German" author="xMRW">
Das gleiche in Deutsch.
</mark>

Yes, it's called semantic markup. For an example you can read about RDF.
Start reading this article on Wipikepia about Semantic Web and this introduction about Semantic Markup, they are a good startpoint with many other links and references.
You may also be interested in:
Wiki, it's not really semantic markup but it matches more what you write in your example.
HTML 5 microdata: to simply embed semantic metadata in your HTML markup.

Related

Searching across multiple languages -- how to?

TLDR: I wanna build multi-language search on my website ala Pinterest, how do I do that?
I am starting a website, where people can publish content that gets metadata typed by the user. People can then interact with the content by looking at it, liking it, commenting on it, sharing it to social media. Also content discovery is mostly done through search.
I do not wish to create geographic boundaries on my website. I would like people who speak any language to find content that is relevant to them in any language. This requirement makes sense because the content is highly visual, ala Pinterest. So even if I don't understand that the word "car" is written in French in the description, it's fine because I'll mostly be interested in seeing the car.
Pinterest is really really good with search across language. For example, on uk.pinterest.com I typed "coupe carrée" which is the French for "bob haircut" and all the results are visually relevant. Even if the pin metadata is in English and the original web site is all in English.
How is that possible? how was Pinterest able to match to my french search query content whose text is all in English? is there translation at some step: coupe carrée > bob haircut > content containing "bob haircut"?
I looked at their engineering blog and all I found is tech to detect the original country and language of a website. Nothing about managing language in search.
please let me know if this is the wrong place to ask the how-it-works question.
Thanks in advance for any help/pointers you will be able to share!
The general strategy in this case is to index your content with every language translation you wish to search.
This would require use of a language translation API at index-time. And a language identification model. Here's a Solr example.

Is the h1 tag the thing google is looking for?

I've heard from a friend that I should use more h1 tags, cause that will be the first thing Google will look for. Is he right?
EDIT
Sorry for asking on stack overflow. I'll try web masters.
It's true that the contents of h1 tags is important for the search engines, but it works only if h1 tag really looks like the main heading on the page. It's considered goo practice to have only one h1 per page. Using many h1 tags for "increasing the importance of some search terms" usually has the opposite effect (i.e. is bad for SEO).
See also recommendations from Google in the following PDF (on p.20): https://static.googleusercontent.com/media/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf

Mysterious additional description about my site on Google search [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
When I use Google search service, the result is somehow odd. Google adds this description below my website link,
Enjoy the gaming fun by playing lots of games & win the exciting online casino uk, so keep playing. Many blackjack online games comes with exciting offers, the ...
But I never added this text no any part of the WordPress or... I don't know. Where does this text come from?
This is my website link.
Probably this:
<!-- /all in one seo pack pro -->
<meta name="generator" content="Powered by Visual Composer - drag and drop page builder for WordPress."/>
<div id="caa7rt"><!--googleoff: snippet--> Enjoy the gaming fun by playing lots of games & win the exciting online casino uk, so keep playing.
Many blackjack online games comes with exciting offers, the players want to grab the opportunity.
Get the details of digital encryption system and other security features adopted by Vegas slots via pokernews live reporting.
<!--googleon: snippet--> <script type="text/javascript">document.getElementById("caa7rt").innerHTML = '';</script> </div><style type="text/css">.broken_link, a.broken_link {
text-decoration: line-through;
}</style>

What text processing tool is recommended for parsing screenplays?

I have some plain-text kinda-structured screenplays, formatted like the example at the end of this post. I would like to parse each into some format where:
It will be easy to pull up just stage directions that deal with a specific place.
It will be easy to pull up just dialogue belonging to a particular character.
The most obvious approach I can think of is using sed or perl or php to put div tags around each block, with classes representing character, location, and whether it's stage directions or dialogue. Then, open it up as a web-page and use jQuery to pull out whatever I'm interested in. But this sounds like a roundabout way to do it and maybe it only seems like a good idea because these are tools I'm accustomed to. But I'm sure this is a recurring problem that's been solved before, so can anybody recommend a more efficient workflow that can be used on a Linux box? Thanks.
Here is some sample input:
SOMEWHERE CORPORATION - OPTIONAL COMMENT
A guy named BOB is sitting at his computer.
BOB
Mmmm. Stackoverflow. I like.
Footsteps are heard approaching.
ALICE
Where's that report you said you'd have for me?
Closeup of clock ticking.
BOB (looking up)
Huh? What?
ALICE
Some more dialogue.
Some more stage directions.
Here is what sample output might look like:
<div class='scene somewhere_corporation'>
<div class='comment'>OPTIONAL COMMENT</div>
<div class='direction'>A guy named BOB is sitting at his computer.</div>
<div class='dialogue bob'>Mmmm. Stackoverflow. I like.</div>
<div class='direction'>Footsteps are heard approaching.</div>
<div class='dialogue alice'>Where's that report you said you'd have for me?</div>
<div class='direction'>Closeup of clock ticking.</div>
<div class='comment bob'>looking up</div>
<div class='dialogue bob'>Huh? What?</div>
<div class='dialogue alice'>Some more dialogue.</div>
<div class='direction'>Some more stage directions.</div>
</div>
I'm using DOM as an example, but again, only because that's something I understand. I'm open to whatever is considered a best practice for this type of text-processing task if, as I suspect, roll-your-own regexps and jQuery is not the best practice. Thanks.
You could use Celtx to import plain text scripts and export them to HTML (and RDF/XML for the metadata) (see this related thread and this blog post, which describes the file structure).
Other screenplay editors like Trelby might offer this feature, too.
There is also Fountain, a plain text markup language for screenwriting. They offer libraries which you might (I did not check if they offer something for importing and converting) use for your cause:
Fountain is free and open-source, with libraries that make it easy to add support in your apps.
Even if those projects can’t be used for your cause, you could at least reuse their format for your output.
If your input is not too noisy, i.e. if you can trust some regularities like the indentation which is larger for dialogs as opposed to comments, I would use a simple Context Free Grammar. You have good implementations in all languages and you'll find lot of information on SO.
If your input varies a lot, then take the machine learning route, but you'll need to have a big number of inputs with human-validated output for training, which might be a hassle.
In any case, I would never, ever use regular expressions for problems like that.

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Resources