Do not remove extra lines while processing the crawled text from webpages

Do not remove extra lines while processing the crawled text from webpages - nutch

I searched the plugins of parse-html and I am not getting where to change the code so that it does not removes the extra line from html page.
while crawling with nutch, it is removing all the extra lines from the crawled text. I want to keep the text and whatever the new lines are present on the website. for example: on crawling this page https://www.modernfamilydental.net/,
the expected output is :\n\n\n\nSan Francisco, CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal window\n\n\n\n\n\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\n\n\n\n\n\n\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n(415) 752-5244\n\n\n \n\n\n\n\n\n\n\n\n\nMenu\n\n\n\n\nHome\n\n\nServices\n \nLatest Equipment\n\n\nInsurance\n\n\nTeeth Whitening\n\n\nCrowns & Bridges\n\n\nSmile Makeovers\n\n\nResin Composite Bonding\n\n\nVeneers\n\n\nImplant Retained Dentures\n\n\nNight Guards\n\n\nMetal-Free Restoration\n\n\nInvisalign\n\n\nDental Examination
but the output from nutch is :
San Francisco, CA Dentist\nWould you like to switch to the accessible version of this site?\nGo to accessible site\nClose modal window\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n(415) 752-5244\nMenu\nHome\nServices\nLatest Equipment\nInsurance\nTeeth Whitening\nCrowns & Bridges\nSmile Makeovers\n\n\nResin Composite Bonding\nVeneers\nImplant Retained Dentures\nNight Guards\nMetal-Free Restoration\nInvisalign\nDental Examination
May I know which plugin code i should change or I should change code of parse_text.

As I have already Answered here in the comment section.
If you do not want to read from the /content folder from segments.
you can do the following things. I'm assuming you must be using parse-html|parse-tika plugins to parse HTML content.
If you are using any one of them. then Nutch plugins use DOMContentUtils API to extract the parsed Text from HTML.
**// this method extract text from Node object and append to
StringBuffer sb**
public boolean getText(StringBuffer sb, Node node,
boolean abortOnNestedAnchors) {
if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
return true;
}
return false;
}
In getTextHelper method you can comment out line text = text.replaceAll("\\s+", " "); so that it will not replace multiple [ \t\r\n\f] with single occurrence.
private boolean getTextHelper(StringBuffer sb, Node node,
boolean abortOnNestedAnchors, int anchorDepth) {
boolean abort = false;
NodeWalker walker = new NodeWalker(node);
while (walker.hasNext()) {
Node currentNode = walker.nextNode();
String nodeName = currentNode.getNodeName();
short nodeType = currentNode.getNodeType();
Node previousSibling = currentNode.getPreviousSibling();
if (previousSibling != null
&& blockNodes.contains(previousSibling.getNodeName().toLowerCase())) {
appendParagraphSeparator(sb);
} else if (blockNodes.contains(nodeName.toLowerCase())) {
appendParagraphSeparator(sb);
}
if ("script".equalsIgnoreCase(nodeName)) {
walker.skipChildren();
}
if ("style".equalsIgnoreCase(nodeName)) {
walker.skipChildren();
}
if (abortOnNestedAnchors && "a".equalsIgnoreCase(nodeName)) {
anchorDepth++;
if (anchorDepth > 1) {
abort = true;
break;
}
}
if (nodeType == Node.COMMENT_NODE) {
walker.skipChildren();
}
if (nodeType == Node.TEXT_NODE) {
// cleanup and trim the value
String text = currentNode.getNodeValue();
**text = text.replaceAll("\\s+", " ");**
text = text.trim();
if (text.length() > 0) {
appendSpace(sb);
sb.append(text);
} else {
appendParagraphSeparator(sb);
}
}
}
return abort;
}

Related

What would be the reason that I can't make the ElementIDs of these objects in Revit match ones in a Revit file?

I am creating a plugin that makes use of the code available from BCFier to select elements from an external server version of the file and highlight them in a Revit view, except the elements are clearly not found in Revit as all elements appear and none are highlighted. The specific pieces of code I am using are:
private void SelectElements(Viewpoint v)
{
var elementsToSelect = new List<ElementId>();
var elementsToHide = new List<ElementId>();
var elementsToShow = new List<ElementId>();
var visibleElems = new FilteredElementCollector(OpenPlugin.doc, OpenPlugin.doc.ActiveView.Id)
.WhereElementIsNotElementType()
.WhereElementIsViewIndependent()
.ToElementIds()
.Where(e => OpenPlugin.doc.GetElement(e).CanBeHidden(OpenPlugin.doc.ActiveView)); //might affect performance, but it's necessary
bool canSetVisibility = (v.Components.Visibility != null &&
v.Components.Visibility.DefaultVisibility &&
v.Components.Visibility.Exceptions.Any());
bool canSetSelection = (v.Components.Selection != null && v.Components.Selection.Any());
//loop elements
foreach (var e in visibleElems)
{
//string guid = ExportUtils.GetExportId(OpenPlugin.doc, e).ToString();
var guid = IfcGuid.ToIfcGuid(ExportUtils.GetExportId(OpenPlugin.doc, e));
Trace.WriteLine(guid.ToString());
if (canSetVisibility)
{
if (v.Components.Visibility.DefaultVisibility)
{
if (v.Components.Visibility.Exceptions.Any(x => x.IfcGuid == guid))
elementsToHide.Add(e);
}
else
{
if (v.Components.Visibility.Exceptions.Any(x => x.IfcGuid == guid))
elementsToShow.Add(e);
}
}
if (canSetSelection)
{
if (v.Components.Selection.Any(x => x.IfcGuid == guid))
elementsToSelect.Add(e);
}
}
try
{
OpenPlugin.HandlerSelect.elementsToSelect = elementsToSelect;
OpenPlugin.HandlerSelect.elementsToHide = elementsToHide;
OpenPlugin.HandlerSelect.elementsToShow = elementsToShow;
OpenPlugin.selectEvent.Raise();
} catch (System.Exception ex)
{
TaskDialog.Show("Exception", ex.Message);
}
}
Which is the section that should filter the lists, which it does do as it produces IDs that look like this:
3GB5RcUGnAzQe9amE4i4IN
3GB5RcUGnAzQe9amE4i4Ib
3GB5RcUGnAzQe9amE4i4J6
3GB5RcUGnAzQe9amE4i4JH
3GB5RcUGnAzQe9amE4i4Ji
3GB5RcUGnAzQe9amE4i4J$
3GB5RcUGnAzQe9amE4i4GD
3GB5RcUGnAzQe9amE4i4Gy
3GB5RcUGnAzQe9amE4i4HM
3GB5RcUGnAzQe9amE4i4HX
3GB5RcUGnAzQe9amE4i4Hf
068MKId$X7hf9uMEB2S_no
The trouble with this is, comparing it to the list of IDs in the IFC file that we imported it from reveals that these IDs do not appear in the IFC file, and looking at it in Revit I found that none of the Guids in Revit weren't in the list that appeared either. Almost all the objects also matched the same main part of the IDs as well, and I'm not experienced enough to know how likely that is.
So my question is, is it something in this code that is an issue?

The IFC GUID is based on the Revit UniqueId but not identical. Please read about the Element Identifiers in RVT, IFC, NW and Forge to learn how they are connected.

How to insert an empty folder with desired name under an item in Sitecore programmatically?

I need to create empty folders in each sections (content, Layout, renderings, MediaLibrary, Templates) under Sitecore node programmatically.
Please advise.

A folder in Sitecore is a Item, with for example Template: /sitecore/templates/Common/Folder {A87A00B1-E6DB-45AB-8B54-636FEC3B5523}
So you need code to add a item:
See: How to programmatically populate Sitecore items (Add item and fields)?
https://briancaos.wordpress.com/2011/01/14/create-and-publish-items-in-sitecore/
http://learnsitecore.cmsuniverse.net/en/Developers/Articles/2009/06/ProgramaticallyItems2.aspx
For Example Under the Layout folder you can use a other Template, Template: /sitecore/templates/System/Layout/Renderings/Sublayout Folder
So there are more folder templates, and of course you can create your own, and adding the insert options you need or set a nice icoon in the standard values.
Summarized:
You need privileges to create an Sitecore item, You can use the SecurityDisabler or User Switcher.
Get the parent item.
Create the item with the template you want.

//Get the master database first
Sitecore.Data.Database masterDB = Sitecore.Configuration.Factory.GetDatabase("master");
//Creating a folder under "Renderings". Change path as per requirement
Sitecore.Data.Items.Item parentNode= masterDB.GetItem("/sitecore/layout/Renderings");
//Always get the folder template from this location
Sitecore.Data.Items.Item folder = masterDB.GetItem("/sitecore/templates/Common/Folder");
//Add the folder at desired location
parentNode.Add("Folder Name", new TemplateItem(folder));

As Jan Bluemink said:
public static Item AddFolder(String name, Item Parrent = null)
{
Database myDatabase = Sitecore.Context.Database;
if (Parrent == null)
{
return null;
}
Item kiddo = null;
try
{
Sitecore.Data.Items.TemplateItem FolderTemplate = myDatabase.GetTemplate("{EB395152-CC2F-4ECB-8FDD-DE6822517BC8}");
using (new Sitecore.SecurityModel.SecurityDisabler())
{
kiddo = Parrent.Add(name, FolderTemplate);
//Insert values in fileds
// posibly you need to change language and add version to update;
// let say a template "article with some id {00101010101010-100110-1010100-12323}" with two fiels single line text, multiline text or rich text editor
//kiddo.Editing.BeginEdit();
//try
//{
// kiddo.Fields["Title"].Value = "Title 1";
// kiddo.Fields["Description"].Value = "description 1";
// kiddo.Editing.EndEdit();
//}
//catch
//{
// kiddo.Editing.CancelEdit();
//}
}
}
catch (Exception ex) {
return null;
}
return kiddo;
}
and the call:
Item content = Sitecore.Context.Database.GetItem("/sitecore/content");
Item contentFolder = AddFolder("folder", content);
Item medialib = Sitecore.Context.Database.GetItem("/sitecore/media library/medialib");
Item medialibFolder = AddFolder("folder", medialib);

Remove an ICC profile from a PDF with ABCpdf

We have a small utility program that removes ICC profile, and I'm trying to optimize it.
The current program uses roughly this method to remove an ICC profile:
foreach (var item in doc.ObjectSoup)
{
if (item != null && doc.GetInfo(item.ID, "/ColorSpace*[0]*:Name").Equals("ICCBased", StringComparison.InvariantCultureIgnoreCase))
{
int profileId = doc.GetInfoInt(item.ID, "/ColorSpace*[1]:Ref"); // note the [1]: why is it there?
if (profileId != 0)
{
doc.GetInfo(profileId, "Decompress");
string profileData = doc.GetInfo(profileId, "Stream");
// this outputs the ICC profile raw data, with the profile's name somewhere up top
Console.WriteLine(string.Format("ICC profile for object ID {0}: {1}", item.ID, profileData));
doc.SetInfo(profileId, "Stream", string.Empty);
doc.GetInfo(profileId, "Compress");
}
}
}
Now, I want to optimize it, to be able to remove only some profiles (depending on the names), or only RGB profiles for instance (and keep CMYK ones). So I wanted to use actual objects :
foreach (var item in doc.ObjectSoup)
{
if (doc.GetInfo(item.ID, "Type") == "jpeg") // only work on PixMaps
{
PixMap pm = (PixMap)item;
if (pm.ColorSpaceType == ColorSpaceType.ICCBased)
{
// pm.ColorSpace.IccProfile is always null so I can't really set it to null or Recolor() it because it would change noting
// Also, there should already be an ICC profile (ColorSpaceType = ICCBased) but ColorSpace.IccProfile creates one (which is by design if there is none)
Console.WriteLine(string.Format("ICC profile for object ID {0}: {1}", item.ID, pm.ColorSpace.IccProfile));
}
}
}
Here is a sample program that showcases the problem : https://github.com/tbroust-trepia/abcpdf8-icc-profiles
Am I doing something wrong ? Is there something weird going on with the provided images ?
I'm using ABCPdf 8.
Thanks for your help.

Getting text layer shadow parameters (ExtendScript CS5, Photoshop scripting)

Is there any way to get text(or any other) layer shadow params in Adobe Photoshop CS5 using ExtendScript for further convertion to CSS3 like text string?
Thanks!

There is a way.
You have to use the ActionManager:
var ref = new ActionReference();
ref.putEnumerated( charIDToTypeID("Lyr "), charIDToTypeID("Ordn"), charIDToTypeID("Trgt") );
var desc = executeActionGet(ref).getObjectValue(stringIDToTypeID('layerEffects')).getObjectValue(stringIDToTypeID('dropShadow'));
desc.getUnitDoubleValue(stringIDToTypeID('distance'))
Where "dropShadow" is the layereffect you want to read and for example "distance" is the parameter that will be returned. Other layereffects and parameters are only known as eventids. Look in the documentation (bad documented) if you need other eventids.
The next AM-Code will check if there is a layerstyle shadow.
var res = false;
var ref = new ActionReference();
ref.putEnumerated( charIDToTypeID("Lyr "), charIDToTypeID("Ordn"), charIDToTypeID("Trgt") );
var hasFX = executeActionGet(ref).hasKey(stringIDToTypeID('layerEffects'));
if ( hasFX ){
var ref = new ActionReference();
ref.putEnumerated( charIDToTypeID("Lyr "), charIDToTypeID("Ordn"), charIDToTypeID("Trgt") );
res = executeActionGet(ref).getObjectValue(stringIDToTypeID('layerEffects')).hasKey(stringIDToTypeID('dropShadow'));
}
return res;
This will explain http://forums.adobe.com/thread/714406 more.
If you find a way to SET the shadow, without setting other params, let me know...

Probably not the answer you're looking for but there is really no way to access the individual properties of layer styles from extendscript. The only method in the API (as of CS6) that references layer styles is ArtLayer.applyStyle(name). You actually have to create a style in Photoshop and save to the palette by name in order to use this.
The only thing I can think of is to actually parse the .asl files found in adobe/Adobe Photoshop/presets/styles/ using C/C++. These files contain several layer styles saved in a proprietary format. I haven't found any libraries to parse these files but they may exist.

If you have Photoshop CS6.1 (or later), you can check out the implementation of the "Copy CSS to Clipboard" feature to see how to access the drop shadow parameters.
On Windows, the source code for this is in
Adobe Photoshop CS6\Required\CopyCSSToClipboard.jsx
On the Mac, the source code is in:
Adobe Photoshop CS6/Adobe Photoshop CS6.app/Contents/Required/CopyCSSToClipboard.jsx
(if you're looking in the Finder on the Mac, you'll need to control-click on the Photoshop app icon and select "Show Package Contents" to get to the Contents/Required folder).
Look for the routine cssToClip.addDropShadow for an example of how to extract the information. If you want to use routines from CopyCSSToClipboard.jsx in your own code, add the following snippet to your JSX file:
runCopyCSSFromScript = true;
if (typeof cssToClip == "undefined")
$.evalFile( app.path + "/" + localize("$$$/ScriptingSupport/Required=Required") + "/CopyCSSToClipboard.jsx" );
Also, at the bottom of CopyCSSToClipboard.jsx, there are sample calls to cssToClip.dumpLayerAttr. This is a useful way to explore parameters you may want to access from your scripts that aren't accessible from the Photoshop DOM.
Be forewarned that code in the Required folder is subject to change in future versions.

I was able to make an ActionPrinter method that dumps out a tree of all the data in an action using C# and the photoshop COM wrapper.
The PrintCurrentLayer method will dump all the data in a layer, including all of the Layer Effects data.
static void PrintCurrentLayer(Application ps)
{
var action = new ActionReference();
action.PutEnumerated(ps.CharIDToTypeID("Lyr "), ps.CharIDToTypeID("Ordn"), ps.CharIDToTypeID("Trgt"));
var desc = ps.ExecuteActionGet(action);//.GetObjectValue(ps.StringIDToTypeID("layerEffects"));//.getObjectValue(ps.StringIDToTypeID('dropShadow'));
ActionPrinter(desc);
}
static void ActionPrinter(ActionDescriptor action)
{
for (int i = 0; i < action.Count; i++)
{
var key = action.GetKey(i);
if (action.HasKey(key))
{
//var charId = action.Application.TypeIDToCharID((int)key);
//Debug.WriteLine(charId);
switch (action.GetType(key))
{
case PsDescValueType.psIntegerType:
Debug.WriteLine("{0}: {1}", (PSConstants)key, action.GetInteger(key));
break;
case PsDescValueType.psStringType:
Debug.WriteLine("{0}: \"{1}\"", (PSConstants)key, action.GetString(key));
break;
case PsDescValueType.psBooleanType:
Debug.WriteLine("{0}: {1}", (PSConstants)key, action.GetBoolean(key));
break;
case PsDescValueType.psDoubleType:
Debug.WriteLine("{0}: {1}", (PSConstants)key, action.GetDouble(key));
break;
case PsDescValueType.psUnitDoubleType:
Debug.WriteLine("{0}: {1} {2}", (PSConstants)key, action.GetUnitDoubleValue(key), (PSConstants)action.GetUnitDoubleType(key));
break;
case PsDescValueType.psEnumeratedType:
Debug.WriteLine("{0}: {1} {2}", (PSConstants)key, (PSConstants)action.GetEnumerationType(key), (PSConstants)action.GetEnumerationValue(key));
break;
case PsDescValueType.psObjectType:
Debug.WriteLine($"{(PSConstants)key}: {(PSConstants)action.GetObjectType(key)} ");
Debug.Indent();
ActionPrinter(action.GetObjectValue(key));
Debug.Unindent();
break;
case PsDescValueType.psListType:
var list = action.GetList(key);
Debug.WriteLine($"{(PSConstants)key}: List of {list.Count} Items");
Debug.Indent();
for (int count = 0; count < list.Count; count++)
{
var type = list.GetType(count);
Debug.WriteLine($"{count}: {type} ");
Debug.Indent();
switch (type)
{
case PsDescValueType.psObjectType:
ActionPrinter(list.GetObjectValue(count));
break;
case PsDescValueType.psReferenceType:
var reference = list.GetReference(count);
Debug.WriteLine(" Reference to a {0}", (PSConstants)reference.GetDesiredClass());
break;
case PsDescValueType.psEnumeratedType:
Debug.WriteLine(" {0} {1}", (PSConstants)list.GetEnumerationType(count), (PSConstants)list.GetEnumerationValue(count));
break;
default:
Debug.WriteLine($"UNHANDLED LIST TYPE {type}");
break;
}
Debug.Unindent();
}
Debug.Unindent();
break;
default:
Debug.WriteLine($"{(PSConstants)key} UNHANDLED TYPE {action.GetType(key)}");
break;
}
}
}
}

recognize multple lines on info.selectionText from Context Menu

My extension adds a context menu whenever a user selects some text on the page.
Then, using info.selectionText, I use the selected text on a function executed whenever the user selects one of the items from my context menu. (from http://code.google.com/chrome/extensions/contextMenus.html)
So far, all works ok.
Now, I got this cool request from one of the extension users, to execute that same function once per line of the selected text.
A user would select, for example, 3 lines of text, and my function would be called 3 times, once per line, with the corresponding line of text.
I haven't been able to split the info.selectionText so far, in order to recognize each line...
info.selectionText returns a single line of text, and could not find a way to split it.
Anyone knows if there's a way to do so? is there any "hidden" character to use for the split?
Thanks in advance... in case you're interested, here's the link to the extension
https://chrome.google.com/webstore/detail/aagminaekdpcfimcbhknlgjmpnnnmooo

Ok, as OnClickData's selectionText is only ever going to be text you'll never be able to do it using this approach.
What I would do then is inject a content script into each page and use something similar to the below example (as inspired by reading this SO post - get selected text's html in div)
You could still use the context menu OnClickData hook like you do now but when you receive it instead of reading selectionText you use the event notification to then trigger your context script to read the selection using x.Selector.getSelected() instead. That should give you what you want. The text stays selected in your extension after using the context menu so you should have no problem reading the selected text.
if (!window.x) {
x = {};
}
// https://stackoverflow.com/questions/5669448/get-selected-texts-html-in-div
x.Selector = {};
x.Selector.getSelected = function() {
var html = "";
if (typeof window.getSelection != "undefined") {
var sel = window.getSelection();
if (sel.rangeCount) {
var container = document.createElement("div");
for (var i = 0, len = sel.rangeCount; i < len; ++i) {
container.appendChild(sel.getRangeAt(i).cloneContents());
}
html = container.innerHTML;
}
} else if (typeof document.selection != "undefined") {
if (document.selection.type == "Text") {
html = document.selection.createRange().htmlText;
}
}
return html;
}
$(document).ready(function() {
$(document).bind("mouseup", function() {
var mytext = x.Selector.getSelected();
alert(mytext);
console.log(mytext);
});
});
http://jsfiddle.net/richhollis/vfBGJ/4/
See also: Chrome Extension: how to capture selected text and send to a web service

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Do not remove extra lines while processing the crawled text from webpages - nutch

Related

What would be the reason that I can't make the ElementIDs of these objects in Revit match ones in a Revit file?

How to insert an empty folder with desired name under an item in Sitecore programmatically?

Remove an ICC profile from a PDF with ABCpdf

Getting text layer shadow parameters (ExtendScript CS5, Photoshop scripting)

recognize multple lines on info.selectionText from Context Menu

Categories

Resources