Download file from website directly into Linux directory - Python - python-3.x

If I manually click on button, the browser starts downloading a CSV file (2GB) onto my computer. But I want to automate this.
This is the link to download:
https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD
Issue; when I use either (requests or pandas) libraries it just hangs. I have no idea if it is being downloaded or not.
My goal is to:
Know if the file is being downloaded and
Have the CSV downloaded to a specified directory ie.
~/mydirectory
Can someone provide the code to do this?

Try this...
import requests
URL = "https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD"
response = requests.get(URL)
print('Download Complete')
open("/mydirectory/downloaded_file.csv", "wb").write(response.content)
Or you could do it this way and have a progress bar ...
import wget
wget.download('https://data.cityofnewyork.us/api/views/bnx9-e6tj/rows.csv?accessType=DOWNLOAD')
The output will look like this:
11% [........ ] 73728 / 633847

Related

Download xml file from the server with Python3

am trying to download a xml file from public data bank
http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=xml
I tried to do it with requests:
import requests
response = requests.get(url)
response.encoding = 'utf-8' #or response.apparent_encoding
print(response.content)
and wget
import wget
wget.download(url, './my.xml')
But both of the ways provide mess instead of a correct file (it looks like a broken encoding, but I cannot fix it)
If I try to download the file via web browser I get correct a UTF-8 xml file.
What am I doing wrong in the code?

How to get WKHTMLTOPDF working on Heroku?

I created a website which generates PDF using PDFKIT and I know how to install and setup environment variable path on Window. I managed to deploy my first website on Heroku but now I'm getting error "No wkhtmltopdf executable found: "b''" When trying to generate the PDF.
I have no idea, How to install and setup WKHTMLTOPDF on Heroku because this is first time I'm dealing with Linux.
I really tried everything before asking this but even following this not working for me.
Python 3 flask install wkhtmltopdf on heroku
If possible, please guide me with step by step on how to install and setup this.
I followed all the resource and everything but couldn't make it work. Every time I get the same error.
I'm using Django version 2. Python version 3.7.
This is what I get if I do heroku stack
Available Stacks
cedar-14
container
heroku-16
* heroku-18
Error, I'm getting when generating the PDF.
No wkhtmltopdf executable found: "b''"
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
My website works very well on localhost without any problem and as far as I know, I'm sure that I have done something wrong in installing wkhtmltopdf.
Thank you
It's non-trivial. If you want to avoid all of the below's headache, you can just use my service, api2pdf: https://github.com/api2pdf/api2pdf.python. Otherwise, if you want to try and work through it, see below.
1) Add this to your requirements.txt to install a special wkhtmltopdf pack for heroku as well as pdfkit.
git+git://github.com/johnfraney/wkhtmltopdf-pack.git
pdfkit==0.6.1
2) I created a pdf_manager.py in my flask app. In pdf_manager.py I have a method:
def _get_pdfkit_config():
"""wkhtmltopdf lives and functions differently depending on Windows or Linux. We
need to support both since we develop on windows but deploy on Heroku.
Returns:
A pdfkit configuration
"""
if platform.system() == 'Windows':
return pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')], stdout=subprocess.PIPE).communicate()[0].strip()
return pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
The reason I have the platform statement in there is that I develop on a windows machine and I have the local wkhtmltopdf binary on my PC. But when I deploy to Heroku, it runs in their linux containers so I need to detect first which platform we're on before running the binary.
3) Then I created two more methods - one to convert a url to pdf and another to convert raw html to pdf.
def make_pdf_from_url(url, options=None):
"""Produces a pdf from a website's url.
Args:
url (str): A valid url
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the website
"""
return pdfkit.from_url(url, False, configuration=_get_pdfkit_config(), options=options)
def make_pdf_from_raw_html(html, options=None):
"""Produces a pdf from raw html.
Args:
html (str): Valid html
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the supplied html
"""
return pdfkit.from_string(html, False, configuration=_get_pdfkit_config(), options=options)
I use these methods to convert to PDF.
Just follow these steps to Deploy Django app(pdfkit) on Heroku:
Step 1:: Add following packages in requirements.txt file
wkhtmltopdf-pack==0.12.3.0
pdfkit==0.6.0
Step 2: Add below lines in the views.py to add path of binary file
import os, sys, subprocess, platform
if platform.system() == "Windows":
pdfkit_config = pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
os.environ['PATH'] += os.pathsep + os.path.dirname(sys.executable)
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')],
stdout=subprocess.PIPE).communicate()[0].strip()
pdfkit_config = pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
Step 3: And then pass pdfkit_config as argument as below
pdf = pdfkit.from_string(html,False,options, configuration=pdfkit_config)

Creating a Spark RDD from a file located in Google Drive using Python on Colab.Research.Google

I have been successful in running Python 3 / Spark 2.2.1 program in Google's Colab.Research platform :
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
!tar xf spark-2.2.1-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
this works perfectly when I uploaded text files from my local computer to the Unix VM using
from google.colab import files
datafile = files.upload()
and read them as follows :
textRDD = spark.read.text('hobbit.txt').rdd
so far so good ..
My problem starts when I am trying to read a file that is lying in my Google drive colab directory.
Following instructions I have authenticated user and created a drive service
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
after which I have been able to access the file lying in the drive as follows :
file_id = '1RELUMtExjMTSfoWF765Hr8JwNCSL7AgH'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
print('Downloaded file contents are: {}'.format(downloaded.read()))
Downloaded file contents are: b'The king beneath the mountain\r\nThe king of ......
even this works perfectly ..
downloaded.seek(0)
print(downloaded.read().decode('utf-8'))
and gets the data
The king beneath the mountain
The king of carven stone
The lord of silver fountain ...
where things FINALLY GO WRONG is where I try to grab this data and put it into a spark RDD
downloaded.seek(0)
tRDD = spark.read.text(downloaded.read().decode('utf-8'))
and I get the error ..
AnalysisException: 'Path does not exist: file:/content/The king beneath the mountain\ ....
Evidently, I am not using the correct method / parameters to read the file into spark. I have tried quite a few of the methods described
I would be very grateful if someone can help me figure out how to read this file for subsequent processing.
A complete solution to this problem is available in another StackOverflow question that is available at this URL.
Here is the notebook where this solution is demonstrated.
I have tested it and it works!
It seems that spark.read.text expects a file name. But you give it the file content instead. You can try either of these:
save it to a file then give the name
use just downloaded instead of downloaded.read().decode('utf-8')
You can also simplify downloading from Google Drive with pydrive. I gave an example here.
https://gist.github.com/korakot/d56c925ff3eccb86ea5a16726a70b224
Downloading is just
fid = drive.ListFile({'q':"title='hobbit.txt'"}).GetList()[0]['id']
f = drive.CreateFile({'id': fid})
f.GetContentFile('hobbit.txt')

Is there any way to extract a rar file on cpanel

I have a website script, it 212MB and it's in RAR format , I could not upload it via filezilla ftp , it gave me a timeout error after sometime, I could not upload it from the filemanager of cpanel as it also kept showing an error. Then I used a php script to upload it directly from the link but now I can not extract it as its RAR not ZIP. I converted the RAR into ZIP and have it on drop box and google drive but there is no direct link which I can use to upload via the php script, SO, Is there any way to extract the rar file from cpanel or using a php script or some other tweak. I have been working on it for 2 hours now and can not find a way around.
create a php file and extra the .rar with that php file. use the following code
$archive = RarArchive::open('archive.rar');
$entries = $archive->getEntries();
foreach ($entries as $entry) {
$entry->extract('/extract/to/this/path');
}
$archive->close();

opkg-cl update 2 download error

Am trying to update using opkg-cl. Getting the following errors. Does anyone know how I go about troubleshooting this?
[root#wrap /root]$ /etc/opkg/opkg_update.sh
Downloading /Packages.gz.
Downloading file:///mnt/usb/packages/Packages.gz.
Downloading https://beacon-repo.shoppertrak.com/repos/stable/Packages.gz.
Inflating https://beacon-repo.shoppertrak.com/repos/stable/Packages.gz.
Updated list of available packages in /var/lib/opkg/lists/all-remote-shoppertrak.
Downloading https://beacon-repo.shoppertrak.com/repos/base/Packages.gz.
Inflating https://beacon-repo.shoppertrak.com/repos/base/Packages.gz.
Updated list of available packages in /var/lib/opkg/lists/all-remote-base.
Collected errors:
* opkg_download: Failed to download /Packages.gz: URL using bad/illegal format or missing URL.
* copy_file: ///mnt/usb/packages/Packages.gz: No such file or directory.
* file_copy: Failed to copy file ///mnt/usb/packages/Packages.gz to /tmp/opkg-8FAiHb/update-iCH5Eo/all-local.gz.
[root#wrap /root]$ ls /mnt/usb/
[root#wrap /root]$
[root#wrap /root]$
Could you provide more information on your configuration?
Such as:
opkg version
content of your opkg feeds config file /etc/opkg/*.conf)
At first glance it looks like you have a local feed configured at file:///mnt/usb/packages/, which is lacking a Package.gz file.

Resources