use .htaccess to complete document extension

use .htaccess to complete document extension - .htaccess

On my last server I had a .htaccess file that completed the urls without redirecting, I didn't quite understand how this was configured, but I am sure it was in the .htaccess file.
All the documentations I read until now only handle one file extension to be replaced (.htm gets redirected to .html).
Lets say I have following files on my server:
0: 001.jpg
1: 002.jpg
2: 002.png
3: staff.php
when I request example.com/001 file 0 gets displayed, in my searchbar is no extension to be seen
when I request example.com/002 I get the errorcode 404, since there are 2 files starting with "002"
when I request example.com/002.j I get the image 1
when I request example.com/staff I get the output of the staff.php
How can I let .htaccess do that?

Related

Downloading an image from the web and saving

I am trying to download an image from Wikipedia and save it to a file locally (using Python 3.9.x). Following this link I tried:
import urllib.request
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, when I try to open this file (Mac OS) I get an error: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
I did some more search and came across this article which suggests modifying the User-Agent. Following that I modified the above code as follows:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0')]
urllib.request.install_opener(opener)
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
However, modifying the User-Agent did NOT help and I still get the same error while trying to open the file: The file “test.jpg” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
Another piece of information: the downloaded file (that does not open) is 235 KB. But if I download the image manually (Right Click -> Save Image As...) it is 455 KB.
I was wondering what else am I missing? Thank you!

The problem is, you're trying to download the web page with the .jpg format.
This link you used is actually not a photo link, but a Web site contains a photograph.
That's why the photo size is 455KB and the size of the file you're downloading is 235KB.
Instead of this :
http = 'https://en.wikipedia.org/wiki/Abacus#/media/File:Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
Use this :
http = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Abacus_4.jpg/800px-Abacus_4.jpg'
urllib.request.urlretrieve(http, 'test.jpg')
It is better to open any photo you want to use first with the "open image in new tab" option in your browser and then copy the url.

Python Web Scraping - After Authentication - Traverse - Download from URL with no Extension

In Chrome, I will Log In to a website.
I will then inspect the site, go to Network and clear out existing.
I will then click on a link and snag the header which stores the cookie.
In Python, I then store the header in a dictionary and use that for the rest of the code.
def site_session(url, header_dict):
session = requests.Session()
t = session.get(url, headers=header_dict)
return soup(t.text, 'html.parser')
site = site_session('https://website.com', headers)
# Scrape the Site as usual until I reach a file I can't download..
This a video file but has no extension.
"https://website.sub.com/download"
Clicking on this link will open up the save dialog and I can save it. But not in Python..
Examining the Network, it appears it redirects to another url in which I was able to scrape.
"https://website.sub.com/download/file.mp4?jibberish-jibberish-jibberish
Trying to shorten it to just "https://website.sub.com/download/file.mp4" does not open.
with open(shortened_url, 'wb') as file:
response = requests.get(shorened_url)
file.write(response.content)
I've tried both the full url and shortened url and receive:
OSError: [Errno 22] Invalid argument.
Any help with this would be awesome!
Thanks!

I had to use the full url with the query and include the headers.
# Read the Header of the first URL to get the file URL
fake_url = requests.head(url, allow_redirects=True, headers=headers)
# Get the real file URL
real_url = requests.get(fake_url.url, headers=headers)
# strip out the name from the url here since it's a loop
# Open a file on pc and write the contents of the real_url
with open(stripped_name, 'wb') as file:
file.write(real_url.content)

Image download with Python

I'm trying to download images with Python 3.9.1
Other than the first 2-3 images, all images are 1 kb in size. How do I download all pictures? Please, help me.
Sample book: http://web2.anl.az:81/read/page.php?bibid=568450&pno=1
import urllib.request
import os
bibID = input("ID: ")
first = int(input("First page: "))
last = int(input("Last page: "))
if not os.path.exists(bibID):
os.makedirs(bibID)
for i in range(first,last+1):
url=f"http://web2.anl.az:81/read/img.php?bibid={bibID}&pno={i}"
urllib.request.urlretrieve(url,f"{bibID}/{i}.jpg")

Doesn't look like there is an issue with your script. It has to do with the APIs you are hitting and the sequence required.
A GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page> just on its own doesn't seem to work right away. Instead, it returns No FILE HERE
The reason this happens is that the retrieval of the images is linked to your cookie. You first need to initiate your read session that's generated when first visiting the page and clicking the TƏSDİQLƏYIRƏM button
From what I could tell you need to do the following:
POST http://web2.anl.az:81/read/page.php?bibid=568450 with Content-Type: multipart/form-data body. It should have a single key value of approve: TƏSDİQLƏYIRƏM - this starts a session and generates a cookie for you which you have to add as a header for all of your API calls from now on.
E.g.
requests.post('http://web2.anl.az:81/read/page.php?bibid=568450', files=dict(approve='TƏSDİQLƏYIRƏM'))
Do the following in your for-loop of pages:
a. GET http://web2.anl.az:81/read/page.php?bibid=568450&pno=<page number> - page won't show up if you don't do this first
b. GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page number> - finally get the image!

requests.get raises IncompleteRead Error with redirecting url

So, i have an url that redirects you. it works with webbrowser.open nicely but requests raises incompleteread error. I'm using python 3.8 on Windows 10
Id='16977332'
D = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id={Id}&cmd=prlinks&retmode=ref'
re_dlink = requests.get(D)
run this, and...
raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)
UPDATE:
I tried this and it returned 400, bad request. Maybe the URL processed before it gets the target URL.
import http.client
http.client.HTTPConnection._http_vsn = 10
I can get redirected url with selenium but i don't want to use external exe. With that, I decided to choose the long way so i removed retmode=ref from url. Here, you can find info about these urls under the Elink header

You've written the Id in str format. Write that in int like this
Id = 16977332
And it's done.

Nutch not working on linux environment

I've installed Nutch on linux system. When I go in directory 'bin' and run ./nutch, it shows following -
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the plugin-based indexer on parsed segments and linkdb
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
clean remove HTTP 301 and 404 documents from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
The above output shows that Nutch is correctly installed on this system. But next when I run ./nutch crawlresult urls -dir crawl -depth 3 I get following output -
./nutch: line 272: /usr/java/jdk1.7.0/bin/java: Success
whereas I was expecting the nutch to begin crawling and show the logs. Please tell me what is wrong?

You can't run this command "crawlresult". You have to use a command from the command list of nutch script. if you crawl with nutch, you use this tutorials

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

use .htaccess to complete document extension - .htaccess

Related

Downloading an image from the web and saving

Python Web Scraping - After Authentication - Traverse - Download from URL with no Extension

Image download with Python

requests.get raises IncompleteRead Error with redirecting url

Nutch not working on linux environment

Categories

Resources