Origen: Problems with flow branching using test ids - origen-sdk

I'm having some trouble with conditional flow branching based on the result of a previous test. This flow code is intended generate a fall-back test point if the first test fails:
bist :cpu, ip: :L2, testmode: :speed, cond: :pmin, id: :cpu_pmin
bist :cpu, ip: :L2, testmode: :speed, cond: :pmax, if_failed: :cpu_pmin
Using origen to render the flow with this code generates two consecutive tests with no branching:
run(cpu_L2_speed_pmin_95CE6EC);
run(cpu_L2_speed_pmax_95CE6EC);
This does appear to work correctly when I use an id attached to a group, but not an individual test.
If I replace the second test call with a call to bin instead, I get an error:
bist :cpu, ip: :L2, testmode: :speed, cond: :pmin, id: :cpu_pmin
bin 10, if_failed: :cpu_pmin
produces the error message:
[ERROR] 1.464[0.927] || Test ID cpu_pmin is referenced in flow func in the following lines, but it is never defined:
[ERROR] 1.465[0.000] || /<origen_pathname>/program/func.rb:41

Pretty sure your bist method in the interface is not passing the options along when it is generating the flow entry.
You probably have something like:
flow.test(my_test_suite)
but you need to let the flow generator know about the :id, :if_failed and any other flow control options which are in play. This will probably do it:
flow.test(my_test_suite, options)
If you were to take your interface method out of the equation for a minute and change to this, then you should see it working:
flow.test :cpu, ip: :L2, testmode: :speed, cond: :pmin, id: :cpu_pmin
flow.test :cpu, ip: :L2, testmode: :speed, cond: :pmax, if_failed: :cpu_pmin

Related

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.
I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:
For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page
After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.
For example, the text in the following XML path is not returned/indexed:
<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)
While the text in this path is returned:
<html> <body> <div> <div.container> <div.row> <p> (text)
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
config:
topology.workers: 3
topology.message.timeout.secs: 1000
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 100
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "My crawler"
http.agent.version: "1.0"
http.agent.description: ""
http.agent.url: ""
http.agent.email: ""
# The maximum number of bytes for returned HTTP response bodies.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# text extraction for JSoupParserBolt
# textextractor.include.pattern:
# - DIV[id="maincontent"]
# - DIV[itemprop="articleBody"]
# - ARTICLE
# textextractor.exclude.tags:
# - STYLE
# - SCRIPT
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
selenium.addresses: "http://localhost:9515"
es-conf.yaml
config:
# ES indexer bolt
es.indexer.addresses: "localhost"
es.indexer.index.name: "content"
# es.indexer.pipeline: "_PIPELINE_"
es.indexer.create: false
es.indexer.bulkActions: 100
es.indexer.flushInterval: "2s"
es.indexer.concurrentRequests: 1
# ES metricsConsumer
es.metrics.addresses: "http://localhost:9200"
es.metrics.index.name: "metrics"
# ES spout and persistence bolt
es.status.addresses: "http://localhost:9200"
es.status.index.name: "status"
es.status.routing: true
es.status.routing.fieldname: "key"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
# spout config #
# positive or negative filters parsable by the Lucene Query Parser
# es.status.filterQuery:
# - "-(key:stormcrawler.net)"
# - "-(key:digitalpebble.com)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
spout.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
spout.min.delay.queries: 2000
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
spout.reset.fetchdate.after: 120
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "key"
# fields to sort the URLs within a bucket
es.status.bucket.sort.field:
- "nextFetchDate"
- "url"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# max allowed duration of a query in sec
es.status.query.timeout: -1
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
es-crawler.flux
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 3
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 3
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 3
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 3
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 12
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 3
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 3
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 3
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
parsefilters.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.description": [
"//*[#name=\"description\"]/#content",
"//*[#name=\"Description\"]/#content"
],
"parse.title": [
"//TITLE",
"//META[#name=\"title\"]/#content"
],
"parse.keywords": "//META[#name=\"keywords\"]/#content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "LinkParseFilter",
"params": {
"pattern": "//FRAME/#src"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
"name": "CommaSeparatedToMultivaluedMetadata",
"params": {
"keys": ["parse.keywords"]
}
}
]
}
Attempting to use Chromedriver
I installed the latest versions of Chromedriver and Google Chrome for Ubuntu.
First I start chromedriver in headless mode at localhost:9515 as the stormcrawler user (via a separate python shell, as shown below), and then I restart the stormcrawler topology (also as stormcrawler user) but end up with a stack of errors related to Chrome. The odd thing however is that I can confirm chromedriver is running OK within the Python shell directly, and I can confirm that both the driver and browser are actively running via ps -ef). This same stack of errors also occurs when I attempt to simply start chromedriver from the command line (i.e., chromedriver --headless &).
Starting chromedriver in headless mode (in python3 shell)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')
Stack trace from starting stormcrawler topology
Run command: storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000
9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>
at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...
Confirming that chromedriver and chrome are both running and reachable
~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857 0 14:28 pts/0 00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862 0 14:28 pts/0 00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877 0 14:28 pts/0 00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100
~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 localhost:9222 0.0.0.0:* LISTEN 18026/google-chrome
tcp 0 0 localhost:9515 0.0.0.0:* LISTEN 18020/chromedriver
IIRC you need to set some additional config to work with ChomeDriver.
Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

queryText being sent to Dialogflow is not the original user query

The user input / queryText being sent to Dialogflow is not the expected, original user query.
simulator query manipulation
I enabled "Log interactions to Google Cloud" in my Dialogflow project's settings. What I'm seeing is multiple "assistant_action" resources before the actual request that goes to DF. In the example above, this is what I see:
GCP logs
With the first debug resource showing post data with:
"inputs":[{"rawInputs":[{"inputType":"UNSPECIFIED_INPUT_TYPE","query":"how long has it been on the market"}]
And
resource: {
type: "assistant_action"
labels: {
project_id: "<MY-PROJECT-ID>"
version_id: ""
action_id: ""
}
},
timestamp: "2021-03-05T18:41:44.142202856Z"
severity: "DEBUG"
labels: {
channel: "production"
querystream: "GOOGLE_USER"
source: "AOG_REQUEST_RESPONSE"
}
The subsequent requests are the same but with modified input queries ("how long has it been on the market" -> "how long has something been on the market" -> "how long has us FDA been on the market"), the last one being the actual user query sent, the channel being preview and the action_id "actions.intent.TEXT".
resource: {
type: "assistant_action"
labels: {
project_id: "<MY-PROJECT-ID>"
version_id: ""
action_id: "actions.intent.TEXT"
}
},
timestamp: "2021-03-05T18:41:45.942019959Z"
severity: "DEBUG"
labels: {
channel: "preview"
querystream: "GOOGLE_USER"
source: "AOG_REQUEST_RESPONSE"
}
I should note that I am testing current drafts of an AoG project and have no releases let alone a production release. I have a denied beta, because of branding issues which I address with separate AoG/DF projects for PROD. I do not have any intents enabled for slot filling or any required entity parameters. This is just one example, but I have been noticing many occurrences of this issue.
What is happening here? Why is the original user input being manipulated? What are all these interactions we are seeing before the expected request/response cycle?
After having contacted someone at Google Cloud, I was informed this was something that had been raised by others and that AoG devs were looking into it.
As of a Mar 24 2021 release, I can no longer replicate this Entity Resolution issue.

Grok Filter want to see only the processname

I want to see only the Process cmd.exe.
Example:
New Process Name: C:\Windows\System32\cmd.exe Token Elevation Type: %%1938 Creator Process ID: 0x1a0`enter code here`
Grok Filter:
New Process Name: %{GREEDYDATA}\\%{GREEDYDATA:Process}
Output:
{
"Process": [
[
"cmd.exe Token Elevation Type: %%1938 Creator Process ID: 0x1a0`enter code here`"
]
]
}
How i get to see only cmd.exe and not Token Elevation Type: %%1938 Creator Process ID: 0x1a0`enter?
GREEDYDATA usually means "everything". I find it usually not be useful except at the end of a pattern (as a catch-all).
So, you're asking for everything after the backslash, which is what you're getting.
How about:
New Process Name: %{GREEDYDATA}\\%{NOTSPACE:Process}

how to read ansible yml config file in groovy

I want to write a DSL job build script for jenkins in groovy which automatically make deploy job for our projects. There is a general yml file for ansible roles and hosts parameter in each project which I want to read it and use its contents to configure the job.
The problem is that so far I'm using the snakeyml for reading the yml file, but it returns an arraylist (more like a map) which I cannot use efficiently.
anyone knows a better solution?
my yml sample file:
---
- hosts: app.host
roles:
- role: app-db
db_name: myproje_db
db_port: "3306"
migrate_module: "my-proje-api"
- role: java-app
app_name: "myproje-api"
app_artifact_name: "my-proje-api"
app_links:
- myproje_db
I read the file from workspace in my main groovy script:
InputStream configFile = streamFileFromWorkspace('data/config.yml')
and process it in another function of another class:
public String configFileReader(def out, InputStream configFile){
def map
Yaml configFileYml = new Yaml()
map = configFileYml.load(configFile)
}
it returns map class type as arraylist.
It's an expected output, this configuration is starting with a "-" which represent a list. It's "a collection of hosts, and each host have a set of roles".
If you wants to iterate on each host, you can do :
Yaml configFileYml = new Yaml()
configFileYml.load(configFile).each { host -> ... }
When this configuration is read, it's equivalent to the following structure (in groovy format):
[ // collection of map (host)
[ // 1 map for each host
hosts:"app.host",
roles:[ // collection of map (role)
[ // 1 map for each role
role: 'app-db',
db_name: 'myproje_db',
db_port: "3306",
migrate_module: "my-proje-api"
],
[
role: 'java-app',
app_name: "myproje-api",
app_artifact_name: "my-proje-api",
app_links:['myproje_db']
]
]
]
]

I am using hapi-job-queue for certain cron jobs but having trouble with schedule?

From the documentation of hapi-job-queue I found that it supports the Later style time definition in schedule params. So I tried like
server.register([
{
register: require('hapi-job-queue'), options: {
connectionUrl: Config.database.url,
endpoint: '',
auth: false,
jobs: [
{
name: 'test-job',
enabled: true,
schedule: 'at 04:59 pm',
method: someMethods
}
]
}
}
]
But I think the code is not working.. if i try schedule: 'every 5 seconds'
everything works find and i even tried schedule: 'at 5:00 pm' which is a valid Later style time definition. Am i missing something?
I tried your code and it seems to work correctly. By the way you can verify the correct parsing of the timing you specified by just checking the 'Jobs' collection on the mongoDb instance you specified in
'Config.database.url'.
Look for the document having 'test-job' as _id field and check the 'nextRun' property; you should see the correct timing: '2016-08-13T16:59:00.000+0000' (in my case)

Resources