arangoimport: edge attribute missing or invalid - arangodb

ArangoDB Version: 3.8
Storage Engine:
Deployment Mode: Single Server
Deployment Strategy: Manual Start
Operating System: Ubuntu 20.04
Total RAM in your machine: 32Gb
Disks in use: < SSD
Used Package: < Ubuntu .deb
Affected feature: arangoimport
(base) raphy#pc:~$ arangodb
2021-11-04T09:34:45+01:00 |INFO| Starting arangodb version 0.15.3, build 814f8be component=arangodb
2021-11-04T09:34:45+01:00 |INFO| Using storage engine 'rocksdb' component=arangodb
2021-11-04T09:34:45+01:00 |INFO| Serving as master with ID 'ef664d42' on :8528... component=arangodb
2021-11-04T09:34:45+01:00 |INFO| Waiting for 3 servers to show up.
component=arangodb
2021-11-04T09:34:45+01:00 |INFO| Use the following commands to start other servers: component=arangodb
arangodb --starter.data-dir=./db2 --starter.join 127.0.0.1
arangodb --starter.data-dir=./db3 --starter.join 127.0.0.1
2021-11-04T09:34:45+01:00 |INFO| ArangoDB Starter listening on 0.0.0.0:8528 (:8528) component=arangodb
I'm trying to import data in this way:
(base) raphy#pc:~$ arangoimport --server.database "ConceptNet" --collection "rel_type" "./ConceptNet/conceptnet.jsonl"
But I get these errors:
Connected to ArangoDB 'http+tcp://127.0.0.1:8529, version: 3.8.2, database: 'ConceptNet', username: 'root'
----------------------------------------
database: ConceptNet
collection: rel_type
create: no
create database: no
source filename: ./ConceptNet/conceptnet.jsonl
file type: json
threads: 2
on duplicate: error
connect timeout: 5
request timeout: 1200
----------------------------------------
Starting JSON import...
2021-11-04T14:49:48Z [165643] INFO [9ddf3] {general} processed 1945 bytes (3%) of input file
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 0: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"pm","_to":"am","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 1: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"red","_to":"amber","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 2: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"proprium","_to":"apelativum","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 3: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"s","_to":"beze\t","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 4: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"euphoria","_to":"bad_trip","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 5: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"gooder","_to":"badder","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 6: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"goodest","_to":"baddest","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 7: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"goodie","_to":"baddie","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2","contributor":"/s/resource/wiktionary/fr"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 8: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"windy","_to":"calm","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 9: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"anger","_to":"calm_down","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
2021-11-04T14:49:48Z [165643] WARNING [e5a29] {general} at position 10: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"get_angry","_to":"calm_down","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
created: 0
warnings/errors: 11
updated/replaced: 0
ignored: 0
This is the jsonl file I'm trying to import :
conceptnet.jsonl :
{"_from":"pm","_to":"am","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
{"_from":"red","_to":"amber","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"proprium","_to":"apelativum","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"s","_to":"beze\t","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"euphoria","_to":"bad_trip","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"gooder","_to":"badder","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"goodest","_to":"baddest","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"goodie","_to":"baddie","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2","contributor":"/s/resource>
{"_from":"windy","_to":"calm","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"anger","_to":"calm_down","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
{"_from":"get_angry","_to":"calm_down","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
I tried to modify the line in the jsonl file as follows:
{"_from":"pm","_to":"am","rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}
But still get this error:
(base) raphy#pc:~$ arangoimport --server.database "ConceptNet" --collection "rel_type" "./ConceptNet/conceptnet.jsonl"
Please specify a password:
Connected to ArangoDB 'http+tcp://127.0.0.1:8529, version: 3.8.2, database: 'ConceptNet', username: 'root'
----------------------------------------
database: ConceptNet
collection: rel_type
create: no
create database: no
source filename: ./ConceptNet/conceptnet.jsonl
file type: json
threads: 2
on duplicate: error
connect timeout: 5
request timeout: 1200
----------------------------------------
Starting JSON import...
2021-11-04T18:48:55Z [37684] WARNING [e5a29] {general} at position 0: creating document failed with error 'edge attribute missing or invalid', offending document: {"_from":"pm","_to":"am","rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}
What am I doing wrongly or missing? How to solve the problem?

I found that saving the documents into the jsonl file as following, solves the problem:
conceptnet.jsonl :
{"_from":"conceptnet/pm","_to":"conceptnet/am","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/fr","process":"/s/process/wikiparsec/2"}}
{"_from":"conceptnet/red","_to":"conceptnet/amber","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}
{"_from":"conceptnet/proprium","_to":"conceptnet/apelativum","rel":{"rel_type":"Antonym","language":"en","license":"-sa/4.0","sources":"/s/resource/wiktionary/en","process":"/s/process/wikiparsec/2"}}

Related

How to format the file path in an MLTable for Azure Machine Learning uploaded during a pipeline job?

How is the path to a (.csv) file to be expressed in a MLTable file
that is created in a local folder but then uploaded as part of a
pipline job?
I'm following the Jupyter notebook automl-forecasting-task-energy-demand-advance from the azuerml-examples repo (article and notebook). This example has a MLTable file as below referencing a .csv file with a relative path. Then in the pipeline the MLTable is uploaded to be accessible to a remote compute (a few things are omitted for brevity)
my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)
compute = AmlCompute(
name=compute_name, size="STANDARD_D2_V2", min_instances=0, max_instances=4
)
forecasting_job = automl.forecasting(
compute=compute_name, # name of the compute target we created above
# name="dpv2-forecasting-job-02",
experiment_name=exp_name,
training_data=my_training_data_input,
# validation_data = my_validation_data_input,
target_column_name="demand",
primary_metric="NormalizedRootMeanSquaredError",
n_cross_validations="auto",
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"},
)
returned_job = ml_client.jobs.create_or_update(
forecasting_job
)
ml_client.jobs.stream(returned_job.name)
But running this gives the error
Error meassage:
Encountered user error while fetching data from Dataset. Error: UserErrorException:
Message: MLTable yaml schema is invalid:
Error Code: Validation
Validation Error Code: Invalid MLTable
Validation Target: MLTableToDataflow
Error Message: Failed to convert a MLTable to dataflow
uri path is not a valid datastore uri path
| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "MLTable yaml schema is invalid: \nError Code: Validation\nValidation Error Code: Invalid MLTable\nValidation Target: MLTableToDataflow\nError Message: Failed to convert a MLTable to dataflow\nuri path is not a valid datastore uri path\n| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8"
}
}
paths:
- file: ./nyc_energy_training_clean.csv
transformations:
- read_delimited:
delimiter: ','
encoding: 'ascii'
- convert_column_types:
- columns: demand
column_type: float
- columns: precip
column_type: float
- columns: temp
column_type: float
How am I supposed to run this? Thanks in advance!
For Remote PATH you can use the below and here is the document for create data assets.
It's important to note that the path specified in the MLTable file must be a valid path in the cloud, not just a valid path on your local machine.

Rust-analyzer in VSCode to substitute env variables in Cargo path imports

I have a Cargo workspace and want to use a WORKSPACE_HOME environment variable in the import path of local crates.
For instance: in $WORKSPACE_HOME/services/api/Cargo.toml
[dependencies]
...
addrbook = { path = "${WORKSPACE_HOME}/pkg/addrbook" }
...
I tried adding the environment value to VSCode settings.json (under rust-analyzer.cargo.extraEnv) and also tried creating a .cargo/config.toml as described here
[env]
WORKSPACE_HOME = { value = "", relative = true }
Unfortunately, cargo metadata keeps failing saying that it is unable to resolve the path
[ERROR rust_analyzer::lsp_utils] rust-analyzer failed to load workspace: Failed to read Cargo metadata from Cargo.toml file /Users/nickdecooman/Documents/Workspace/foobar/Cargo.toml, Some(Version { major: 1, minor: 63, patch: 0 }): Failed to run `"cargo" "metadata" "--format-version" "1" "--manifest-path" "/Users/nickdecooman/Documents/Workspace/foobar/Cargo.toml" "--filter-platform" "x86_64-apple-darwin"`: `cargo metadata` exited with an error: error: failed to load manifest for workspace member `/Users/nickdecooman/Documents/Workspace/foobar/services/api`
Caused by:
failed to load manifest for dependency `addrbook`
Caused by:
failed to read `/Users/nickdecooman/Documents/Workspace/foobar/services/api/${WORKSPACE_HOME}/pkg/addrbook/Cargo.toml`
Caused by:
No such file or directory (os error 2)

Input format for Tensorflow models on GCP AI Platform

I have a uploaded a model to GCP AI Platform Models. It's a simple Keras, Multistep Model, with 5 features trained on 168 lagged values. When I am trying to test the models in, I'm getting this strange error message:
"error": "Prediction failed: Error during model execution: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.FAILED_PRECONDITION\n\tdetails = \"Error while reading resource variable dense_7/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_7/bias)\n\t [[{{node model_2/dense_7/BiasAdd/ReadVariableOp}}]]\"\n\tdebug_error_string = \"{\"created\":\"#1618946146.138507164\",\"description\":\"Error received from peer ipv4:127.0.0.1:8081\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1061,\"grpc_message\":\"Error while reading resource variable dense_7/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_7/bias)\\n\\t [[{{node model_2/dense_7/BiasAdd/ReadVariableOp}}]]\",\"grpc_status\":9}\"\n>"
The input is on the following format, a list ((1, 168, 5))
See below of example:
{
"instances":
[[[ 3.10978284e-01, 2.94650396e-01, 8.83664149e-01,
1.60210423e+00, -1.47402699e+00],
[ 3.10978284e-01, 2.94650396e-01, 5.23466315e-01,
1.60210423e+00, -1.47402699e+00],
[ 8.68576328e-01, 7.78699823e-01, 2.83334426e-01,
1.60210423e+00, -1.47402699e+00]]]
}

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.
I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:
For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page
After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.
For example, the text in the following XML path is not returned/indexed:
<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)
While the text in this path is returned:
<html> <body> <div> <div.container> <div.row> <p> (text)
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
config:
topology.workers: 3
topology.message.timeout.secs: 1000
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 100
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "My crawler"
http.agent.version: "1.0"
http.agent.description: ""
http.agent.url: ""
http.agent.email: ""
# The maximum number of bytes for returned HTTP response bodies.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# text extraction for JSoupParserBolt
# textextractor.include.pattern:
# - DIV[id="maincontent"]
# - DIV[itemprop="articleBody"]
# - ARTICLE
# textextractor.exclude.tags:
# - STYLE
# - SCRIPT
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
selenium.addresses: "http://localhost:9515"
es-conf.yaml
config:
# ES indexer bolt
es.indexer.addresses: "localhost"
es.indexer.index.name: "content"
# es.indexer.pipeline: "_PIPELINE_"
es.indexer.create: false
es.indexer.bulkActions: 100
es.indexer.flushInterval: "2s"
es.indexer.concurrentRequests: 1
# ES metricsConsumer
es.metrics.addresses: "http://localhost:9200"
es.metrics.index.name: "metrics"
# ES spout and persistence bolt
es.status.addresses: "http://localhost:9200"
es.status.index.name: "status"
es.status.routing: true
es.status.routing.fieldname: "key"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
# spout config #
# positive or negative filters parsable by the Lucene Query Parser
# es.status.filterQuery:
# - "-(key:stormcrawler.net)"
# - "-(key:digitalpebble.com)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
spout.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
spout.min.delay.queries: 2000
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
spout.reset.fetchdate.after: 120
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "key"
# fields to sort the URLs within a bucket
es.status.bucket.sort.field:
- "nextFetchDate"
- "url"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# max allowed duration of a query in sec
es.status.query.timeout: -1
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
es-crawler.flux
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 3
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 3
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 3
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 3
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 12
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 3
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 3
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 3
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
parsefilters.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.description": [
"//*[#name=\"description\"]/#content",
"//*[#name=\"Description\"]/#content"
],
"parse.title": [
"//TITLE",
"//META[#name=\"title\"]/#content"
],
"parse.keywords": "//META[#name=\"keywords\"]/#content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "LinkParseFilter",
"params": {
"pattern": "//FRAME/#src"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
"name": "CommaSeparatedToMultivaluedMetadata",
"params": {
"keys": ["parse.keywords"]
}
}
]
}
Attempting to use Chromedriver
I installed the latest versions of Chromedriver and Google Chrome for Ubuntu.
First I start chromedriver in headless mode at localhost:9515 as the stormcrawler user (via a separate python shell, as shown below), and then I restart the stormcrawler topology (also as stormcrawler user) but end up with a stack of errors related to Chrome. The odd thing however is that I can confirm chromedriver is running OK within the Python shell directly, and I can confirm that both the driver and browser are actively running via ps -ef). This same stack of errors also occurs when I attempt to simply start chromedriver from the command line (i.e., chromedriver --headless &).
Starting chromedriver in headless mode (in python3 shell)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')
Stack trace from starting stormcrawler topology
Run command: storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000
9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>
at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...
Confirming that chromedriver and chrome are both running and reachable
~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857 0 14:28 pts/0 00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862 0 14:28 pts/0 00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877 0 14:28 pts/0 00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100
~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 localhost:9222 0.0.0.0:* LISTEN 18026/google-chrome
tcp 0 0 localhost:9515 0.0.0.0:* LISTEN 18020/chromedriver
IIRC you need to set some additional config to work with ChomeDriver.
Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

AWS SAM template error : collections.OrderedDict' object has no attribute 'startswith

I am getting this error while using SAM template for deploying resources
below is the script
- sam package --template-file test.json --s3-bucket $s3_bucket --s3-prefix packages/my_folder/ --output-template-file samtemplate.yml
getting this error even tried after rollbacking to previous working status
return any([url.startswith(prefix) for prefix in ["s3://", "http://", "https://"]])
File "/usr/local/lib/python3.8/site-packages/samcli/lib/providers/sam_stack_provider.py", line 250, in
return any([url.startswith(prefix) for prefix in ["s3://", "http://", "https://"]])
AttributeError: 'collections.OrderedDict' object has no attribute 'startswith'
After adding some debug message I got this error
2021-04-22 06:42:32,820 | Unable to resolve property S3bucketname: OrderedDict([('Fn::Select', ['0', OrderedDict([('Fn::Split', ['/', OrderedDict([('Ref', 'TemplateS3BucketName')])])])])]). Leaving as is.

Resources