Stormcrawler pages with noindex nofollow are crawled - stormcrawler

We are using Stormcrawler 1.13 to crawl site pages. When using in one environment, it's not crawling pages having robots meta noindex nofollow but when we are deploying the same modules in another environment, pages with noindex nofollow are also crawled. Below is our crawler-conf.yaml.
# Custom configuration for StormCrawler
# This is used to override the default values from crawler-default.xml and provide additional ones
# for your custom components.
# Use this file with the parameter -conf when launching your extension of ConfigurableTopology.
# This file does not contain all the key values but only the most frequently used ones. See crawler-default.xml for an extensive list.
config:
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50
# give 2gb to the workers
worker.heap.memory.mb: 2048
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "Anonymous Coward"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Archetype ${version}"
http.agent.url: "http://someorganization.com/"
http.agent.email: "someone#someorganization.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
Please let me know if need to do some changes in above code or any other configurations in storm-crawler.
Thank you.

The behaviour of meta noindex is not configurable in 1.13 so any difference between your environments can't be due to a difference in configuration.
How did you generate the topology? Did you use the archetype?
PS: it is good practice to set the http.agent.* configs.

Related

Configure log in aks

I'm trying to limit the aks logs for the various containers. Following this guide https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-agent-config I created my config map:
kind: ConfigMap
apiVersion: v1
data:
schema-version:
#string.used by agent to parse config. supported versions are {v1}. Configs with other schema versions will be rejected by the agent.
v1
config-version:
#string.used by customer to keep track of this config file's version in their source control/repository (max allowed 10 chars, other chars will be truncated)
ver1
log-data-collection-settings: |-
# Log data collection settings
# Any errors related to config map settings can be found in the KubeMonAgentEvents table in the Log Analytics workspace that the cluster is sending data to.
[log_collection_settings]
[log_collection_settings.stdout]
# In the absense of this configmap, default value for enabled is true
enabled = false
# exclude_namespaces setting holds good only if enabled is set to true
# kube-system,gatekeeper-system log collection are disabled by default in the absence of 'log_collection_settings.stdout' setting. If you want to enable kube-system,gatekeeper-system, remove them from the following setting.
# If you want to continue to disable kube-system,gatekeeper-system log collection keep the namespaces in the following setting and add any other namespace you want to disable log collection to the array.
# In the absense of this configmap, default value for exclude_namespaces = ["kube-system","gatekeeper-system"]
# exclude_namespaces = ["kube-system","gatekeeper-system","kube-node-lease","kube-public","default","nsbpo","nscommon","nsregistry","aks-command"]
[log_collection_settings.stderr]
# Default value for enabled is true
enabled = true
# exclude_namespaces setting holds good only if enabled is set to true
# kube-system,gatekeeper-system log collection are disabled by default in the absence of 'log_collection_settings.stderr' setting. If you want to enable kube-system,gatekeeper-system, remove them from the following setting.
# If you want to continue to disable kube-system,gatekeeper-system log collection keep the namespaces in the following setting and add any other namespace you want to disable log collection to the array.
# In the absense of this cofigmap, default value for exclude_namespaces = ["kube-system","gatekeeper-system"]
exclude_namespaces = []
[log_collection_settings.env_var]
# In the absense of this configmap, default value for enabled is true
enabled = false
[log_collection_settings.enrich_container_logs]
# In the absense of this configmap, default value for enrich_container_logs is false
enabled = false
# When this is enabled (enabled = true), every container log entry (both stdout & stderr) will be enriched with container Name & container Image
[log_collection_settings.collect_all_kube_events]
# In the absense of this configmap, default value for collect_all_kube_events is false
# When the setting is set to false, only the kube events with !normal event type will be collected
enabled = false
# When this is enabled (enabled = true), all kube events including normal events will be collected
#[log_collection_settings.schema]
# In the absence of this configmap, default value for containerlog_schema_version is "v1"
# Supported values for this setting are "v1","v2"
# See documentation at https://aka.ms/ContainerLogv2 for benefits of v2 schema over v1 schema before opting for "v2" schema
# containerlog_schema_version = "v2"
metric_collection_settings: |-
# Metrics collection settings for metrics sent to Log Analytics and MDM
[metric_collection_settings.collect_kube_system_pv_metrics]
# In the absense of this configmap, default value for collect_kube_system_pv_metrics is false
# When the setting is set to false, only the persistent volume metrics outside the kube-system namespace will be collected
enabled = false
# When this is enabled (enabled = true), persistent volume metrics including those in the kube-system namespace will be collected
alertable-metrics-configuration-settings: |-
# Alertable metrics configuration settings for container resource utilization
[alertable_metrics_configuration_settings.container_resource_utilization_thresholds]
# The threshold(Type Float) will be rounded off to 2 decimal points
# Threshold for container cpu, metric will be sent only when cpu utilization exceeds or becomes equal to the following percentage
container_cpu_threshold_percentage = 95.0
# Threshold for container memoryRss, metric will be sent only when memory rss exceeds or becomes equal to the following percentage
container_memory_rss_threshold_percentage = 95.0
# Threshold for container memoryWorkingSet, metric will be sent only when memory working set exceeds or becomes equal to the following percentage
container_memory_working_set_threshold_percentage = 95.0
# Alertable metrics configuration settings for persistent volume utilization
[alertable_metrics_configuration_settings.pv_utilization_thresholds]
# Threshold for persistent volume usage bytes, metric will be sent only when persistent volume utilization exceeds or becomes equal to the following percentage
pv_usage_threshold_percentage = 60.0
# Alertable metrics configuration settings for completed jobs count
[alertable_metrics_configuration_settings.job_completion_threshold]
# Threshold for completed job count , metric will be sent only for those jobs which were completed earlier than the following threshold
job_completion_threshold_time_minutes = 360
integrations: |-
[integrations.azure_network_policy_manager]
collect_basic_metrics = false
collect_advanced_metrics = false
[integrations.azure_subnet_ip_usage]
enabled = false
# Doc - https://github.com/microsoft/Docker-Provider/blob/ci_prod/Documentation/AgentSettings/ReadMe.md
agent-settings: |-
# prometheus scrape fluent bit settings for high scale
# buffer size should be greater than or equal to chunk size else we set it to chunk size.
#[agent_settings.prometheus_fbit_settings]
# tcp_listener_chunk_size = 10
# tcp_listener_buffer_size = 10
# tcp_listener_mem_buf_limit = 200
# The following settings are "undocumented", we don't recommend uncommenting them unless directed by Microsoft.
# They increase the maximum stdout/stderr log collection rate but will also cause higher cpu/memory usage.
## Ref for more details about Ignore_Older - https://docs.fluentbit.io/manual/v/1.7/pipeline/inputs/tail
# [agent_settings.fbit_config]
# log_flush_interval_secs = "1" # default value is 15
# tail_mem_buf_limit_megabytes = "10" # default value is 10
# tail_buf_chunksize_megabytes = "1" # default value is 32kb (comment out this line for default)
# tail_buf_maxsize_megabytes = "1" # defautl value is 32kb (comment out this line for default)
# tail_ignore_older = "5m" # default value same as fluent-bit default i.e.0m
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
Reading the agent logs I find a couple of weird things, in the figure below it says that the config map has been changed, but I also find that both stderr and stdout have an exclusion, what I am wondering is, since stdout is disabled as is this possible? which means config :: No ADX database name set, using default value: containerinsights, I tried to search but can't find any information.
Also in the Log Analytics workspace I see that the stdout logs are still retrieved in the
ContainerLog resource.
I wonder if I have not misinterpreted the guide or if I have misconfigured
I tired to reproduce the same issue in my environment and got the expected results
I have created and deployed the config file
Vi container-azm-ms-agentconfig.yaml
kubectl apply -f container-azm-ms-agentconfig.yaml
We can check the logs using below command
kubectl get pods -n kube-system
We can check the logs using below command
kubectl logs pod_name -n kubesystem
When I check the logs got the same like config :: No ADX database name set, using default value: containerinsights
This is not the error, here we didn't create any ADX database so containerinsights will take the default value
if we need we can create the ADX sample data base then it won't show the message
you can refer this link

How to set a `User cap` for a particular domain in Gitlab

Original question:
I want to limit the number of users from a particular domain that can register into my Gitlab instance. I noticed that I could set a "user cap", but it wasn't specific to a domain.
For example:
I want to limit the number of users registered from these domains. 20 users from testdomain1.com and 30 users from testdomain2.com are allowed to sign up. So, if there are already 20 users registered sucessfully from testdomain1.com, new user from testdomain1.com will not be allowed to sign up.
What should I do for it?
2021.11.18 Edited:
I added a validate to the User model:
# gitlab/app/models/user.rb
class User < ApplicationRecord
# ...
validate :email_domain, :ensure_user_email_count
# ...
def email_domain
email_domain = /\#.*?$/.match(email)[0]
email_domain
end
def ensure_user_email_count
# select count(*) from users where email like '%#test.com';
if User.where("email LIKE ?", "%#{email_domain}" ).count >= 30
errors.add(email_domain, _('already has 30 registered email.'))
end
end
end
This validate can set "user cap = 30" for each domain but it's still not able to set a "User cap" for a particular domain.
Since the related issue post did not get any response yet. I'm tring to implement it by myself. And it seems like that I need to extend the UI of the Admin Settings page and add some related tables to database to set different "user cap" for different email domain.
The GitLab user cap seems to be per GitLab instance.
So if both your domains are reference the same GitLab instance, you would have only one user cap possible.
But if each of your domain redirects to one autonomous GitLab instance (per domain), then you should be able to set user cap per domain.
The OP Ann Lin has created the issue 345557 to follow that feature request.
TRhe OP reports:
A particular table is needed to store the caps.
But I don’t have enough time now to modify the UI so I found a simple way to do this:
The Allowed domains for sign-ups which called domain_allowlist in database is a text:
gitlabhq_production=# \d application_settings
...
domain_allowlist | text | | |
...
gitlabhq_production=# select domain_allowlist from >application_settings;
domain_allowlist
-------------------
--- +
- testdomain1.com+
- testdomain2.com+
(1 row)
I can modify the testdomain1.com to testdomain1.com#30 to store the user cap and use Regex to get the number 30.
I will modify the UI and add the database table later. And I’ll create a pull request on Gitlab when I’m done.

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?
Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.
LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

django-viewflow - getting task URL without request

Knowing task instance is there a way to get the url of it? For example in the cookbook: https://github.com/viewflow/cookbook/blob/master/helloworld/demo/helloworld/flows.py - how do I get the url of assign task of approve flow_task?
I know there is flow_task.get_task_url(task, url_type='guess', namespace='', **kwargs), but the point is that from what I can see the namespace is usually fetched from self.request.resolver_match.namespace. That's not ideal - what if we are in other part of the app and we simply want to provide links to the tasks directly?
Same as with django reverse you need to pass a namespace to get an URL. In case of build-in viewflow frontend the namespace is viewflow:[app_label]:[flow_label] ex: "viewflow:helloworld:helloworld"
If you have the task object in a template, you can extract the url as follows:
Task Link
This could be added as a template filter if used often.
This is a dirty hack untill I understand how the namespaces work.
To get the url of a task all you need is the app_name(app_namespace), flow_namespace and flow_label.
The most challenging item here is the flow_namespace (if you have not used the frontend urls).
To resolve this, you could use a map borrowing from FlowListMixin's ns_map. Defining the flow_namespace for every flow in your project.
You then determine the flows namespace and url_name from the above.
ns_map = {'MyFlow':'flow_namespace', 'AnotherFlow':'flow_namespace2'}
# flow_namespace as defined in the urls.py
# e.g if you defined your flow urls as
# flow_urls = FlowViewSet(MyFlow).urls
# flow_urls2 = FLowViewSet(MyFlow2).urls
# urlpatterns = [url(r'flow_path/', include(flow_urls, name=flow_namespace)),
# url(r'flow_path2/', include(flow_urls2, name=flow_namespace2)),
# ]
# With this is included in the root_url as
# urlpatterns = [
# url(r'app/' include(app_urls, namespace='app_namespace')
#]
What you need is to reverse the flow like this
reverse('app_name:flow_namespace:flow_label', kwargs={'process_pk':ppk, 'task_pk':tpk})
flow_class_name = task.process.flow_class.__name__
flow_namespace = ns_map.get(flow_class_name)
app_name = task.process.flow_class.__module__.split('.')[0]
flow_label = task.flow_task.name
url_name = "{}:{}:{}".format(app_name, flow_label, url_name)
Then you can reverse your task url
url = reverse(url_name, kwargs={"task_pk", task.pk, "process_pk":task.flow_process.pk}
# If you know where you are you could use resolver match to determine
# the app namespace. Be Sure of this, see more of that [here][1]
NOTE: I am assuming that you namespaced your apps as app_name
If it is different you have to find alternatives to finding the app_names namespace but that should not be too difficult.

How to handle edge cases in creation of an EFS resource on AWS, using boto3

I'm creating AWS Elastic File System resources using the boto3 SDK.
In the boto3 docs for EFS (linked above), there are no waiters (unlike for other actions such as launching EC2 instances). So I can't call a waiter to hold execution until the resource is created, and have to write my own. There are also a bunch of edge cases that spring to mind, and I can't find examples that handle them.
client = # Attach credentials and create an efs boto3 client
def find_or_create_file_system(self, a_token):
fs = self.client.create_file_system(CreationToken=a_token, PerformanceMode='generalPurpose')
# Returns either:
# {
# 'OwnerId': 'string',
# 'CreationToken': 'string',
# 'FileSystemId': 'string',
# 'CreationTime': datetime(2015, 1, 1),
# 'LifeCycleState': 'creating'|'available'|'deleting'|'deleted',
# 'Name': 'string',
# 'NumberOfMountTargets': 123,
# 'SizeInBytes': {
# 'Value': 123,
# 'Timestamp': datetime(2015, 1, 1)
# },
# 'PerformanceMode': 'generalPurpose'|'maxIO'
# }
# Or, if an FS is available with that creation token already, the above returns
# an error. According to boto3 docs, the error will contain the existing fs id.
# Is this an error I need to manage with try/catch? What is the syntax to get
# the id out of the error?
if there_is_an_error
# EFS already exists
if fs['LifeCycleState'] == 'creating'
# Need to wait until it's created then return its id
elif fs['LifeCycleState'] != 'available'
# It is being / has been deleted.
# What now? Is that token never usable again? Does it eventually disappear so I can reuse it? How long do I have to wait before recreating it?
# Wait until available
fs_desc = self.client.describe_file_systems(FileSystemId=fs.id)
# TODO figure out whether there's a waiter for this
while fs_desc['FileSystems'][0]['LifeCycleState'] == 'creating':
time.sleep(5)
fs_desc.update() # Updates metadata
print("EFS state: {0}".format(fs_desc['FileSystems'][0]['LifeCycleState']))
return fs.id
Question 1 Am I correct that I have to write my own waiter? Could I hijack/repurpose a waiter from other elsewhere in the API, or are there undocumented waiters?
Question 2 How do I catch the error that occurs when an instance with that token already exists? And how do I get the id out of the error message to handle that case?
Question 3 Can tokens be reused once a file system is deleted (i.e. does AWS eventually clear up, or does that token persist)?
The reason I ask Q3 is that there are no Filter={} options in client.describe_file_systems(). So at present, I'm using a token containing a simple unique text handle, to create and later retrieve an EFS unique to a customer. I could use a random UUID token, then tag with organisation name... but can't retrieve based on tag!!!
Question 4 Is that while loop robust? i.e. is there a circumstance in which AWS will perpetually return 'creating' status (which would throw me into an infinite loop)?
Thanks for any help!

Resources