Stormcrawler XPathFilter - internal representation - stormcrawler

When Stormcrawler fetches a website, it applies a configured XPathFilter on an HTML representation which is not the original one. E.g., tags are inserted, or DIVs will become H3, etc. E.g., the following configuration puts HTML code in Elasticsearch which is not the original one:
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.html": [
"//HTML"
]
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
}
]
}
This makes it hard to write XPath Expressions based on the original source code of the website. Is there any way to configure Stormcrawler in a way that it applies the XPathFilter expressions on the original website source code?

Which version of StormCrawler are you on? Are you using Tika for the parsing or Jsoup? AFAIK Jsoup does not modify the content but Tika probably does. I would recommend to use the JSoup-based ParserBolt for HTML content and Tika for anything else.
You can use the DebugParseFilter to see what the DOM looks like.

Related

How to build JSON from remote Swagger documentation (websocket based API)

I am currently grabbing the entire swagger.json contents:
swagger_link = "https://<SWAGGER_URL>/swagger/v1/swagger.json"
swagger_link_contents = requests.get(swagger_link)
#write the returned swagger to a file
swagger_return = swagger_link_contents.json()
with open ("swagger_raw.json", "w") as f:
f.write(json.dumps(swagger_return, indent=4))
The file is filled with beautiful JSON now. But there are a ton of objects in there, quite a few I don't need. Say I wanted to grab the command "UpdateCookieCommand" from my document:
"parameters": [
{
"name": "command",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/UpdateCookieCommand"
},
"x-nullable": false
}
],
but it also has additional mentions later in the document...
UpdateCookieCommand": {
"allOf": [
{
"$ref": "#/definitions/BaseCommand"
},
{
"type": "object",
"required": [
"command",
"id"
],
The latter object is really what I want to take from the document. It tells me what's required for a nominal API command to Update a Cookie. The hope is I could have a script that will look at every release and be able to absorb changes of the swagger and build new commands.
What methodology would you all recommend to accomplish this? I considered making subprocess Linux commands to just awk, sed, grep it to what I want but maybe there's a more elegant way? Also I won't know the amount of lines or location of anything as a new Swagger will be produced each release and my script will need to automatically run with a CI/CD pipeline.

How to read json files with nested categories in node.js

I am using Perspective API (you can check out at: http://perspectiveapi.com/) for my discord application. I am sending an analyze request and api returning this:
{
"attributeScores": {
"TOXICITY": {
"spanScores": [
{
"begin": 0,
"end": 22,
"score": {
"value": 0.9345592,
"type": "PROBABILITY"
}
}
],
"summaryScore": {
"value": 0.9345592,
"type": "PROBABILITY"
}
}
},
"languages": [
"en"
],
"detectedLanguages": [
"en"
]
}
I need to get "value" in "summaryScore" as an integer. I searched it on Google, but i just found reading value for not categorized or only 1 times categorized json files. How can i do that?
Note: Sorry if i asked something really easy or if i slaughtered english. My primary language is not english and i am not much experienced on node.js
First you must make sure the object you have recived is presived by nodeJS as a JSON object, look at this answer for how first. After the object is stored as a JSON object you can do the following:
Reading from nested objects or arrays is as easy as doing this:
object.attributeScores.TOXICITY.summaryScore.value
If you look closer to the object and its structure you can see that the root object (the first {}) contains 3 values: "attributeScores", "languages" and "detectedLanguages".
The field you are looking for exists inside the "summeryScore" object that exists inside the "TOXICITY" object and so on. Thus you need to traverse the object structure until you get to the value you need.

Azure CORS Variable Allowed Origins

Whenever a new release pipeline is ran in Azure DevOps, the URL Is changed.. currently my ARM template has a hard-coded URL which can be annoying to keep on adding in manually.
"cors": {
"allowedOrigins": [
"[concat('https://',parameters('storage_account_name'),'.z10.web.core.windows.net')]"
}
The only thing that changes is the 10 part in the z10 so essentially i want it to be something like
[concat('https://',parameters('storage_account_name'),'.z', '*', '.web.core.windows.net')] I dont know if something like that is valid but essentially its so that the cors policy will accept the URL regardless of the z number.
Basically speaking this is not possible, because of the CORS standard (see docs).
which allows only for exact origins, wildcard, or null.
For instance, ARM for Azure Storage is also following this pattern allowing you to put a list of exact origins or a wildcard (see ARM docs)
However, if you know your website name, in your ARM you can receive the full host and use it in your CORS:
"[reference(resourceId('Microsoft.Web/sites', parameters('SiteName')), '2018-02-01').defaultHostName]"
The same with a static website (which is your case I guess) if you know the storage account name:
"[reference(concat('Microsoft.Storage/storageAccounts/', variables('storageAccountName')), '2019-06-01', 'Full').properties.primaryEndpoints.web]"
Advance reference output manipulation
Answering on comment - if you would like to replace some characters in the output from the reference function the easiest way is to use build-in replace function (see docs)
In case you need a more advanced scenario I am pasting my solution by introducing a custom function which is removing https:// and / from the end so https://contonso.com/ is transformed to contonso.com:
"functions": [
{
"namespace": "lmc",
"members": {
"replaceUri": {
"parameters": [
{
"name": "uriString",
"type": "string"
}
],
"output": {
"type": "string",
"value": "[replace(replace(parameters('uriString'), 'https://',''), '/','')]"
}
}
}
}
],
# ...(some code)...
"resources": [
# ... (some resource)...:
"properties": {
"hostName": "[lmc.replaceUri(reference(variables('storageNameCdn')).primaryEndpoints.blob)]"
}
]

ElasticSearch: Full-Text Search made easy

I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.

Drupal 7 Search Autocomplete module never loads "suggestions"

I am attempting to use the module Search Autocomplete 7.x-4.0-alpha2.
I have added a form in the "search_autocomplete" configuration section.
It is enabled.
I created a view that returns taxonomy in json format.
Here is an example of the json output from the json view
[{
"value": "aquaculture",
"fields": {
"name_i18n": "aquaculture"
},
"group": {
"group_id": "aquaculture",
"group_name": "aquaculture"
}
}, {
"value": "climate change",
"fields": {
"name_i18n": "climate change"
},
"group": {
"group_id": "climatechange",
"group_name": "climate change"
}
}, {
"value": "coastal development",
"fields": {
"name_i18n": "coastal development"
},
"group": {
"group_id": "coastaldevelopment",
"group_name": "coastal development"
}
}, {
"value": "deforestation",
"fields": {
"name_i18n": "deforestation"
},
"group": {
"group_id": "deforestation",
"group_name": "deforestation"
}
}, {
"value": "extinction",
"fields": {
"name_i18n": "extinction"
},
"group": {
"group_id": "extinction",
"group_name": "extinction"
}
}]
I set the Suggestion Source to be the view. I used the autocomplete feature of it so I know that my "search autocomplete" suggestion source is configured right. The id selector of a form in a different view (not the json taxonomy one) is used. The permissions for the module are correct.
Now, when I load my view that has the search api form I see a little blue circle icon that is circling to the right of the search api form field. It is circling the whole time and no suggestions are ever populated in the search text box.
I know I have the right form configured because if I set a different form id for the "searchautocomplete" configuration and reload the view page, the circling blue circle is missing.
Does anyone have any idea what might be wrong?
UPDATE: I was going to my modules page and saw this error (i wasn't changing anything on the modules page, just going there) and saw the error on the top of the modules page regarding the Search Autocomplete module
Update: I changed the Search Autocomplete configuration section to not point to my json view but point to an outside url, http://google.com. Of course this is not a valid json endpoint, but I wanted to see if I could see it at least attempt to get it's json data from google.com. Watching through firebug has shown that it doesn't even attempt to go to google.com for it's json data. I think something similar is happening with my json views (it's just not even going there for the data).
That was probably due to a bug in the alpha-version? When you configure the JSON Endpoint by using the Views UI, you should see a list of items in the "preview"-section underneath. The items that are listed there should be the ones that appear as suggestions in the search.

Resources