On WARC-Type of entries in StormCrawler WARC files

On WARC-Type of entries in StormCrawler WARC files - stormcrawler

Following an upgrade of our crawler from StormCrawler 1.8 to 1.14 we have noticed that response type of our WARC entries had changed from "WARC-Type: response" to "WARC-Type: resource".
Any suggestion on how to switch back to "WARC-Type: response"?

Nothing has changed in the WARCRecordFormat between 1.8 and 1.14 - if there is a verbatim HTTP response header available, a response record is written. If there is no HTTP header, a WARC resource record is used instead.
In order to store the HTTP headers, the following configuration is required:
http.store.headers: true
http.protocol.implementation: com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
https.protocol.implementation: com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
More information is found in the README of the WARC module.

Related

Python requests module SSLError(CertificateError("hostname 'x.x.x.x' doesn't match 'x.x.x.x")))

I'm using the "requests" module for getting JSON from my web service, using the next code:
import requests
import SSL
# With or without this line of code below, the output is the same
ssl.match_hostname = lambda cert, hostname: True
response = requests.get("MY_URL", cert=("client.pem", "client-key.pem"), verify="CAcert.cer")
When the SSL step seems to fail with the following message:
HTTPSConnectionPool(host='x.x.x.x', port=443): Max retries exceeded with url: {WEBSERVICE_URL_PATTERN} (Caused by SSLError(CertificateError("hostname 'x.x.x.x' doesn't match 'x.x.x.x'")))
I'm using Python 3.10.5 with the latest version of the "requests" module.
Does anyone know what could cause this kind of error and how to fix it?

I assume you've redacted actual names which are in fact different, because if you really did have a host named x.x.x.x using a cert with the same name it would match (unless it wasn't really the same because the CA, or a potentially-bogus 'subject'/'subscriber', used lookalike characters).
From the documentation of match_hostname
Changed in version 3.7: The function is no longer used to TLS connections. Hostname matching is now performed by OpenSSL. ...
Deprecated since version 3.7.
At the python.ssl level, or http.client or urllib.requests, you can still turn off only hostname checking with check_hostname=False in the SSLContext. However AFAIK requests doesn't give you access to the SSL level except for setting the cert(s) as you do or the sledgehammer option of turning off all verification with verify=False.
If at all possible, you should try to use a hostname and a host cert that do match. Note changing either the name you request or the cert can accomplish this.

The problem was solved, using a Subject Alternative Name (SAN) for the server, with a value of its own IP address.
I've found out that we use Simple-CA, and the request of getting a signed certificate from it was with a Common Name (CN), when we don't have a domain name.
After changing the signing action to SAN instead of CN, the problem was solved.
Thanks for the helpers!

Batch headers are not considered for individual requests with Cloud sdk 3.66

Ours is a dwc based application Master Data Proxy Service (MDPS).
We are getting an error due to the required Dwc headers (dwc-tenant, dwc-subdomain, dwc-jwt) etc,
not being propagated to individual request contexts from a batch request.
I did some debugging on this and here are my observations:
We create a destination with DwcHeaderProvider as header provider with the following code:
DefaultHttpDestination.builder(megacliteUri + MEGACLITE_VERSION + serviceBinding)
.keyStore(dwcUtil.getKeyStore())
.keyStorePassword("changeit")
.proxyType(ProxyType.INTERNET)
.headerProviders(new DwcHeaderProvider()
// The destination that Megaclite should use to perform the request
.header(Constants.DESTINATION_NAME, Constants.DEFAULT_DESTINATION_VALUE)
.build();
DwcHeaderProvider in turns gets all the relevant headers including the dwc headers. But with the new version its not happening.
I can see that internally the headers are fetched from a headerFacade, which in the previous versions used to be DefaultRequestHeaderFacade.
Now the facade is getting initialized as com.sap.cds.integration.cloudsdk.facade.CdsRequestHeaderFacade and this comes from a jar
com/sap/cds/cds-integration-cloud-sdk/1.23.0/cds-integration-cloud-sdk-1.23.0.jar
cds-integration-cloud-sdk-1.23.0.jar
Can you look into it this? It is a high prio issue for us, since batch requests are completely not working, and our UI relies on such requests.
Thanks,
Sachin

Update: Please use CAP 1.24 - the issue is fixed already.
Until the problem is solved and a proper fix is released, can you try a workaround?
Before instantiating the destination run the following code snippet:
import com.sap.cloud.sdk.cloudplatform.requestheader.RequestHeaderAccessor;
import com.sap.cloud.sdk.cloudplatform.requestheader.DefaultRequestHeaderFacade;
RequestHeaderAccessor.setHeaderFacade(new DefaultRequestHeaderFacade());

Add-PnPApp : The request message was already sent. Cannot send the same request message multiple times

I'm new to using Azure DevOps. I continue to receive this error "Add-PnPApp : The request message was already sent. Cannot send the same request message multiple times."
Azure DevOps Release fails because of AddPnP error with "...same request".
Build shows version that changes my version (old) to a new version(gulp's version?).
Image of build
I'm told that it could be the version that starts with zero because SharePoint doesn't like it. I can't seem to change the new version to 1.0.0.1 because it seems like it's being changed in the gulp-file.js. Is there something else that I am missing?
image of release

Is it possible that you need the overwrite parameter in the AddPnP command? Or is it possible you would need to iterate the version between each release?
https://learn.microsoft.com/en-us/powershell/module/sharepoint-pnp/add-pnpapp?view=sharepoint-ps

In my case the same message
The request message was already sent. Cannot send the same request message multiple times
was misleading. I tried to update dynamically an .sppkg package (which is in fact a ZIP file) during an automated deployment, but the file _rels/.rels was getting lost in the process (because of the bug Unable to compress hidden files with Compress-Archive) and the resulting package was corrupted.
Once I fixed the package by making sure the file _rels/.rels was kept, the deployment would succeed.

Error 404: ProxyServlet: /activiti-explorer

So, well, I am trying to get started with Workflow for XPages however, I am not able to get the activiti-explorer working. I am trying to setup the application named "Sample Application on Activiti" (this application is present as one of the sample).
I followed the following steps as listed in the tutorial:
Installed Extension Library and H2 Database; and copied the active server plugins.
Installed the site.xml on client and server.
Started the h2 database.
Updated the activiti.workflow file
According to these steps when I visit "myServer/activiti-explorer" a login should come but instead I am getting the above mentioned error. The only solution I thought about was that the activiti-explorer.war file was to be kept in the domino directory somewhere (similar to webapps in tomcat: which works very well btw.). I tried putting it in "domino/data" directory with no success.
Also, there is an open topic on openntf discussion tab of the same project with no response yet. Just for reference the link is:
http://www.openntf.org/main.nsf/project.xsp?r=project/Workflow%20for%20XPages/discussions/6B020B585420E70886257B6E004E5A32
Edit 1: After digging in more, I found out that the domino throws a different error the very first time I try to access the activiti-explorer and rest all times it throws the same error mentioned in the title. However, the console looks something like this after the first request.
18-06-2014 12:17:17 HTTP JVM: Activiti Explore Started
18-06-2014 12:17:17 HTTP JVM: Web Container : com.ibm.pvc.internal.webcontainer.webapp.BundleDeployedModule : CWPWC0032E: Error occured while processing annotation for the servlet java.lang.ClassNotFoundException: org.activiti.explorer.servlet.ExplorerApplicationServlet.
18-06-2014 12:17:17 HTTP JVM: 1. The supported annotations are #DeclareRoles,#PreDestroy and #PostConstruct for servlets.
18-06-2014 12:17:17 HTTP JVM: 2. Check that the class resides in the proper package directory.
18-06-2014 12:17:17 HTTP JVM: 3. Check that the classname has been defined in the server using the proper case and fully qualified package.
18-06-2014 12:17:17 HTTP JVM: 4. Check that the class was transfered to the filesystem using a binary transfer mode.
18-06-2014 12:17:17 HTTP JVM: 5. Check that the class was compiled using the proper case (as defined in the class definition).
18-06-2014 12:17:17 HTTP JVM: 6. Check that the class file was not renamed after it was compiled. Thread[Thread-6,5,main]. For more detailed information, please consult error-log-0.xml located in C:/Program Files/IBM/Lotus/Domino/data/domino/workspace/logs
I understand it is a class not found exception but I am not sure how to get through it.
Any help would really be appreciated here. Thanks.

Max request length exceeded

I have a user receiving the following error in response to an ItemQueryRq with the QuickBooks Web Connector and IIS 7.
Version:
1.6
Message:
ReceiveResponseXML failed
Description:
QBWC1042: ReceiveResponseXML failed
Error message: There was an exception running the extensions specified in the config file. --> Maximum request length exceeded. See QWCLog for more details. Remember to turn logging on.
The log shows the prior request to be
QBWebConnector.SOAPWebService.ProcessRequestXML() : Response received from QuickBooks: size (bytes) = 3048763
In IIS 7, the max allowed content length is set to 30000000, so I'm not sure what I need to change to allow this response through. Can someone point me in the right direction?

Chances are, your web server is rejecting the Web Connector's HTTP request because you're trying to POST too much data to it. It's tough to tell for sure though, because it doesn't look like you have the Web Connector in VERBOSE mode, and you didn't really post enough of the log to be able to see the rest of what happened, and you didn't post the ItemQuery request you sent or an idea of how many items you're getting back in the response.
If I had to guess, you're sending a very generic ItemQueryRq to try to fetch ALL items, which has a high likelihood of returning A LOT of data, and thus having IIS reject the HTTP request.
Whenever you're fetching a large amount of data using the Web Connector, you should be using iterators. Iterators allow you to break up the result set into smaller chunks.
qbXML Iterator example
other qbXML examples

If you just need to determine if an item exists in QB you can simply add IncludeRetElement to your ItemQuery
So you should post something like
<ItemQueryRq requestID="55">
<FullName>Prepay Discount</FullName>
<IncludeRetElement>ListID</IncludeRetElement>
</ItemQueryRq>
And in Item query response just check the status code. If it is equal to 500 then it means that you should push your item into QB, if it is equal to 0 then it means that item exists
That workaround will save plenty of bytes in your response

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

On WARC-Type of entries in StormCrawler WARC files - stormcrawler

Following an upgrade of our crawler from StormCrawler 1.8 to 1.14 we have noticed that response type of our WARC entries had changed from "WARC-Type: response" to "WARC-Type: resource". Any suggestion on how to switch back to "WARC-Type: response"?

Related

Python requests module SSLError(CertificateError("hostname 'x.x.x.x' doesn't match 'x.x.x.x")))

Batch headers are not considered for individual requests with Cloud sdk 3.66

Add-PnPApp : The request message was already sent. Cannot send the same request message multiple times

Error 404: ProxyServlet: /activiti-explorer

Max request length exceeded

Categories

Resources