Most of the site sources, opens with a simple request, usualy by file_gets_contents() or curl_init().
I've tried a lot of combinations of stream_context_create() and curl_setopt(), and none returned any thing different of 400 bad request.
Is there an explanation for why some server-sites ( like https://phys.org/ ) do not return de source code by quoted methods?
obs.: if you were able to get the source of the exemple ( https://phys.org/ ), using file_gets_contents() or curl_init(), or any other method with php, please post the code, thanks.
Some Website's are validating the request if it comes from a real/allowed client (bot/user).
This can have multiple reasons.
Maybe the bots are sending to many requests, or the specific site is blocked behind a paywall/firewall. But there are many other people who can explain it to you better then me.
Here are some known Examples how they did it:
Some Site's are supporting request with an API-Token.
Google API's are an great example.
Some Site's are validing the User-Agent.
It looks like that your example site is doing this.
When I'm sending a custom User-Agent Header the result is returning to an error.
And Of Course can some site's check for the User IP Address :)
I believe in your example there should be a good solution to get a result.
Related
I want to save the data column from this url to an array in python. I tried it with, for instance, pandas.save_table:
import pandas as pd
pd.read_table('https://adventofcode.com/2019/day/1/input', sep='')
but I get HTTPError: HTTP Error 400: Bad Request and I think this is not the right way to do that.
Can someone help me with that?
If you try to open the link in your question (in a browser using incognito mode or something similar i.e. delete your cookies) you'll see that you need login into the website to access the page. This is why the you're getting a 400 Bad Request error as a response from the server.
From the FAQ section of the website that you're trying to access:
How does authentication work? Advent of Code uses OAuth to confirm
your identity through other services. When you log in, you only ever
give your credentials to that service - never to Advent of Code. Then,
the service you use tells the Advent of Code servers that you're
really you. In general, this reveals no information about you beyond
what is already public; here are examples from Reddit and GitHub.
Advent of Code will remember your unique ID, names, URL, and image
from the service you use to authenticate.
The website uses OAuth to handle logins to the url that you create will need these access tokens. You can use a library like python-oauth2 to help you with this (there are others so you can read around and decide which you'd like to use). Creating and understanding how to make http requests is beyond the scope of this answer. I'd suggest you have a look around on the internet for some explanations and try again, if you have get stuck please ask another question. Otherwise it'll probably be easier to save the file from your browser...But I'll leave this answer here for the next person who runs into the same problem.
I have searched all over the internet for an answer and although I can find a million people with the same question I cannot find an official solution to the problem im experiencing.
I always get "Cannot display preview. You can post as is, or try another link." displayed.
I've stripped a page down to only the required open graph meta tags so I know they work (run through multiple OG validators), Ive disabled any kind of robots blocking, any kind of redirects, disabled the firewall on a test server, made sure the LinkedIn bot requests are hitting the server. All I see in the browser console all the time is a status 500 being returned from LinkedIn's preview generator API.
We are hosting on Windows Server in IIS 8.5, it seems if I create a demo and host it somewhere else it works, which makes me think it is server related or IIS settings.
Reading this Linkedin post's picture doesn't appear in summary its seems like a similar issue. We are not serving over SSL so nothing to do with that.
I have already asked this question on LinkedIn's forum but having no luck, so im hoping someone on here can help or someone from LinkedIn's tech team can help.
Thanks
So we had this issue as well and it turns out parts of our system that use user generated themes were not adding the "Content-Type" header to the response.
So examine the response headers coming from your server and make absolutely sure they are correct and that they include the correct "Content-Type" (with correct encoding) and "Content-Length".
I recently began working with JavaScript and am looking at various get and post requests one can send to a server.
For get, as far as I know, all of the information of the query is contained in the URL that the user triggers. On the server side this has to be dissected to retrieve the necessary parameters.
I was just wondering how larger and more detailed requests are handled with this get method? For instance what if I had millions and millions of parameters that make up my whole request? Would they all be jumbled into the URL? Is there a limit as to the number of unique URLs one can have? I read this post:
How do URL shorteners guarantee unique URLs when they don't expire?
I would really like some more input.
Thank You!
I know that stackoverflow is not the correct place to post this question, but i already post this at serverfault and the place seems generally dead.
--
I noticed weird log entries (unless there's something i don't understand) in my IIS (7.5) logs.
it's an online dictionary with requests ( user friendly url rewriting ) and most of them are GET. However I noticed weird POST requests which are taking place by a person who is trying to crawl our content ( tens of thousands of such requests )
2013-11-09 20:39:27 GET /dict/mylang/word1 - y.y.y.y Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 200 296
2013-11-09 20:39:29 GET /dict/mylang/word2 - z.z.z.z Mozilla/5.0+(iPhone;+CPU+iPhone+OS+6_0+like+Mac+OS+X)+AppleWebKit/536.26+(KHTML,+like+Gecko)+Version/6.0+Mobile/10A5376e+Safari/8536.25+(compatible;+Googlebot-Mobile/2.1;++http://www.google.com/bot.html) - 200 468
2013-11-09 20:39:29 POST /dict/mylang/word3 - x.x.x.x - - 200 2593
The two first requests are legal. Now for the third request, I don't think I have allowed cross domain POST. if that what the third log line means.
all those POST requests take that much time for unknown reasons to me. I would like to know how are those POST requests possible and how can I stop them.
p.s. I have masked the IPs on purpose.
any help would be appreciated! thank you in advance.
blocking POST generally is not an option, i extensively use AJAX. i want to know how does he do this kind of POST request and how to stop him. I've got tens of thousands of requests, i constantly ban IP ranges through firewall but he just hops through proxies.
this is how a normal POST request ( through ajax happens ):
2013-11-10 10:16:54 POST /dict/mylang/displaySem.php - 85.73.156.122 Mozilla/5.0+(Windows+NT+6.1;+rv:25.0)+Gecko/20100101+Firefox/25.0 http://www.mydomain.com/dict/mylang/randomword 200 171
Http allows anyone to POST a request to your site. Your application (not IIS) should check if it is a valid request before starting the long processing algorithm.
Some common validation methods are:
If you think he is directly POSTing to your site using an automated script, you could use a CAPTCHA to make it hard for him: http://en.wikipedia.org/wiki/CAPTCHA
If you think he is hijacking the session of other people, you can use a CSRF field in your form: http://en.wikipedia.org/wiki/Cross-site_request_forgery
I'm wanting to know if it's possible to detect which website a user has come from and serve to them different content based on which website they have just come from.
So if they've come from any other website on the internet and landed on my page, they will see my normal html and css page, but if they come from a specific website (this specific website would have also been developed by me so I have control over the code server-side and client-side) then I want them to see something slightly different.
It's a very small difference that I want them to see, and that's why I don't want to consider taking them to a different version of the website or a different page.
I'm also not sure if this solution will be placed on the page they coming from or the page that they arriving on?
Hope that's clear. Thanks!
I would add a URL parameter like http://example.com?source=othersite. This way you can easily adjust the parameter and can use javascript to detect this and slightly alter your landing page.
Otherwise, you can use the HTTP referrer sent via the browser to detect where they came from, but you would need to tell us your back end technology to get an example of that, as it differs a bit.
In javascript, you can do something as easy as
if(window.location.href.indexOf('source=othersite') > 0)
{
// alter DOM here
}
Or you can use a URL Parameter parser as suggested here: How to get the value from the GET parameters?
What you want is the Referer: HTTP header. It will give the URL of the page which the user came from. Bear in mind that the Referer can easily be spoofed, so don't take it as a guarantee if security is an issue.
Browsers may disable the referer, though. Why not just use a URL parameter?