How can I make web resources avialable offline? - linux

Their is a folder in my PC with Linux OS, which contains a website (webpages etc.). The webpages and other complimentary files in the folder use cdns to bring resources like jquery, datatables etc.
I want to make these resources offline. I know I can manually search all files for occurrence of "http" keyword, download files from these URLs keep them in folder and accordingly change source file path. But as these are too many files it seems troublesome. I want to ask is there any better and elegant way of doing so. Thanks in advance

I made a python script to do the job:
import re
import os
import aiohttp
import asyncio
import pathlib
import string
import random
import chardet
# Decode byte sequence using chardet to avoid "Type error"
def decode_bytes(byte_sequence):
result = chardet.detect(byte_sequence)
encoding = result['encoding']
return byte_sequence.decode(encoding)
VALID_URL_REGEX = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
# Downloader, I lazily have used resp.status as success criteria but it have logical issue you can also include other logic
async def download_file(session, url, local_path):
async with session.get(url, allow_redirects=True, ssl=False) as resp:
if resp.status == 200:
print("Content path is "+str(local_path))
with open(local_path, "wb") as f:
while True:
print(local_path)
chunk = await resp.content.read(4196)
if not chunk:
break
chunk = chunk.encode("utf-8")
f.write(chunk)
downloaded_urls = set()
async def process_file(file_path, session):
print("File during Read "+str(file_path))
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
contents = f.read()
try:
contents = decode_bytes(contents)
except UnicodeDecodeError as e:
# To avoid Type error
print(f"Error decoding file {file_path}: {e}")
return
urls = re.findall(VALID_URL_REGEX, contents)
try:
for url in urls:
file_name = url.split("/")[-1]
if len(file_name)==0:
continue
if url in downloaded_urls:
local_path = downloaded_urls[url]
# generating random strings to avoid same file name but different urls
res = ''.join(random.choices(string.ascii_uppercase +string.digits, k=5))
file_name=res+file_name
local_path = os.path.join("downloaded", file_name)
if not os.path.exists(local_path):
await download_file(session, url, local_path)
# To avoid redownloading
downloaded_urls.add(url)
contents = contents.replace(url, local_path)
except:
pass
print("File during write "+str(file_path))
with open(file_path, "w", encoding="utf-8", errors="ignore") as f:
f.write(contents)
async def process_directory(directory):
if not os.path.exists("downloaded"):
os.makedirs("downloaded")
conn = aiohttp.TCPConnector(limit=2200,limit_per_host=20,ttl_dns_cache=22)
async with aiohttp.ClientSession(connector=conn) as session:
tasks = []
try:
for filepath in pathlib.Path(directory).glob('**/*'):
fp=filepath.absolute()
if str(fp).endswith(".md") or str(fp).endswith(".txt"):
continue
if os.path.isfile(fp):
tasks.append(process_file(fp, session))
except:
pass
await asyncio.gather(*tasks)
if __name__ == '__main__':
directory = input("Enter root directory") asyncio.run(process_directory(directory))
I will also try "substitution" module and update answer accordingly.

Related

Python-asyncio and subprocess deployment on IIS: returning HTTP response without running another script completely

I'm facing an issue in creating Realtime status update for merging new datasets with old one and machine learning model creation results via Web framework. The tasks are simple in following steps.
An user/ client will send a new datasets in .CSV file to the server,
On server side my windows machine will receive a file then send an acknowledge,
Merge the new dataset with the old one for new machine learning model creation and
Run another python script(that is to create a new sequential deep-learning model). After the successful completion of another python script my code have to return the response to the client!
I have deployed my python-flask application on IIS-10. To run an another python script, this main flask-api script should have to wait for completing that model creation script. On model creation python script it contains several process like loading datasets, tokenizing, oneHot Encoding, padding techniques, model training for 100 epochs and finally prediction results.
My exact goal is this Flask-API should have to wait for until completing the entire process. I'm sure definitely it will take 8-9 minutes to complete the whole script mentioned in subprocess.run(). While testing this code on development mode it's working excellently without any issues! But while testing it on production mode on IIS no it's not waiting for the whole process and within 6-7 seconds it returning response to the client.
For debugging purpose I included logging to record all events in both Flask script and machine learning model creation script! Through that I came to understand that model creation script only ran 10%!. First I tried simple methods with async def and await to run the subprocess.run() it didn't make any sense! Then I included threading and get_event_loop() and then run_until_complete() to make my parent code wait until finishing the whole process. But finally I'm helpless!! I couldn't able to find a rightful solution. Please let me know what I did wrong.. Thank you.
Configurations:
Python 3.7.9
Windows server 2019 and
IIS 10.0 Express
My code:
import os
import time
import glob
import subprocess
import pandas as pd
from flask import Flask, request, jsonify
from werkzeug.utils import secure_filename
from datetime import datetime
import logging
import asyncio
from concurrent.futures import ThreadPoolExecutor
ALLOWED_EXTENSIONS = {'csv', 'xlsx'}
_executor = ThreadPoolExecutor(1)
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = "C:\\inetpub\\wwwroot\\iAssist_IT_support\\New_IT_support_datasets"
currentDateTime = datetime.now()
filenames = None
logger = logging.getLogger(__name__)
app.logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s:%(name)s:%(message)s')
file_handler = logging.FileHandler('model-creation-status.log')
file_handler.setFormatter(formatter)
# stream_handler = logging.StreamHandler()
# stream_handler.setFormatter(formatter)
app.logger.addHandler(file_handler)
# app.logger.addHandler(stream_handler)
def allowed_file(filename):
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
#app.route('/file_upload')
def home():
return jsonify("Hello, This is a file-upload API, To send the file, use http://13.213.81.139/file_upload/send_file")
#app.route('/file_upload/status1', methods=['POST'])
def upload_file():
app.logger.debug("/file_upload/status1 is execution")
# check if the post request has the file part
if 'file' not in request.files:
app.logger.debug("No file part in the request")
response = jsonify({'message': 'No file part in the request'})
response.status_code = 400
return response
file = request.files['file']
if file.filename == '':
app.logger.debug("No file selected for uploading")
response = jsonify({'message': 'No file selected for uploading'})
response.status_code = 400
return response
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
print(filename)
print(file)
app.logger.debug("Spreadsheet received successfully")
response = jsonify({'message': 'Spreadsheet uploaded successfully'})
response.status_code = 201
return response
else:
app.logger.debug("Allowed file types are csv or xlsx")
response = jsonify({'message': 'Allowed file types are csv or xlsx'})
response.status_code = 400
return response
#app.route('/file_upload/status2', methods=['POST'])
def status1():
global filenames
app.logger.debug("file_upload/status2 route is executed")
if request.method == 'POST':
# Get data in json format
if request.get_json():
filenames = request.get_json()
app.logger.debug(filenames)
filenames = filenames['data']
# print(filenames)
folderpath = glob.glob('C:\\inetpub\\wwwroot\\iAssist_IT_support\\New_IT_support_datasets\\*.csv')
latest_file = max(folderpath, key=os.path.getctime)
# print(latest_file)
time.sleep(3)
if filenames in latest_file:
df1 = pd.read_csv("C:\\inetpub\\wwwroot\\iAssist_IT_support\\New_IT_support_datasets\\" +
filenames, names=["errors", "solutions"])
df1 = df1.drop(0)
# print(df1.head())
df2 = pd.read_csv("C:\\inetpub\\wwwroot\\iAssist_IT_support\\existing_tickets.csv",
names=["errors", "solutions"])
combined_csv = pd.concat([df2, df1])
combined_csv.to_csv("C:\\inetpub\\wwwroot\\iAssist_IT_support\\new_tickets-chatdataset.csv",
index=False, encoding='utf-8-sig')
time.sleep(2)
# return redirect('/file_upload/status2')
return jsonify('New data merged with existing datasets')
#app.route('/file_upload/status3', methods=['POST'])
def status2():
app.logger.debug("file_upload/status3 route is executed")
if request.method == 'POST':
# Get data in json format
if request.get_json():
message = request.get_json()
message = message['data']
app.logger.debug(message)
return jsonify("New model training is in progress don't upload new file")
#app.route('/file_upload/status4', methods=['POST'])
def model_creation():
app.logger.debug("file_upload/status4 route is executed")
if request.method == 'POST':
# Get data in json format
if request.get_json():
message = request.get_json()
message = message['data']
app.logger.debug(message)
app.logger.debug(currentDateTime)
def model_run():
app.logger.debug("model script starts to run")
subprocess.run("python C:\\.....\\IT_support_chatbot-master\\"
"Python_files\\main.py", shell=True)
# time.sleep(20)
app.logger.debug("script ran successfully")
async def subprocess_call():
# run blocking function in another thread,
# and wait for it's result:
app.logger.debug("sub function execution starts")
await loop.run_in_executor(_executor, model_run)
asyncio.set_event_loop(asyncio.SelectorEventLoop())
loop = asyncio.get_event_loop()
loop.run_until_complete(subprocess_call())
loop.close()
return jsonify("Model created successfully for sent file %s" % filenames)
if __name__ == "__main__":
app.run()

how to download and iterate over csv file

I'm trying to download and iterate over csv file but I'm only reading the headers but no more lines after it
tried using this answer but with no luck
this is my code:
from datetime import datetime
import requests
import csv
def main():
print("python main function")
datetime_object = datetime.now().date()
url = f'https://markets.cboe.com/us/equities/market_statistics/volume_reports/day/{datetime_object}/csv/?mkt=bzx'
print(url)
response = requests.get(url, stream=True)
csv_content = response.content.decode('utf-8')
print(csv_content)
cr = csv.reader(csv_content.splitlines(), delimiter='~')
my_list = list(cr)
for row in my_list:
print(row)
if __name__ == '__main__':
main()
cr = csv.reader(csv_content.splitlines(), delimiter='~')
change to
cr = csv.reader(csv_content.splitlines(), delimiter=',')
And check if You download full file or file with header only use URL in browser ;)

Writing to same file(s) with multiprocessing (avoid lock)

I'm running a script on multiple csv files using multiprocessing.
If a line matches the regex, it writes the line to (a) new file(s) (new file name equals match).
I've noticed a problem writing to the same file(s) from different processes (file lock). How can i fix this ?
My code:
import re
import glob
import os
import multiprocessing
pattern ='abc|def|ghi|jkl|mno'
regex = re.compile(pattern, re.IGNORECASE)
def process_files (file):
res_path = r'd:\results'
with open(file, 'r+', buffering=1) as ifile:
for line in ifile:
matches = set(regex.findall(line))
for match in matches:
res_file = os.path.join(res_path, match + '.csv')
with open(res_file, 'a') as rf:
rf.write(line)
def main():
p = multiprocessing.Pool()
for file in glob.iglob(r'D:\csv_files\**\*.csv', recursive=True):
p.apply_async(process, [file])
p.close()
p.join()
if __name__ == '__main__':
main()
Thanks in advance!
Make the filename unique for each subprocess:
def process_files (file, id):
res_path = r'd:\results'
for line in file:
matches = set(regex.findall(line))
for match in matches:
filename = "{}_{}.csv".format(match, id)
res_file = os.path.join(res_path, filename)
with open(res_file, 'a') as rf:
rf.write(line)
def main():
p = multiprocessing.Pool()
for id, file in enumerate(glob.iglob(r'D:\csv_files\**\*.csv', recursive=True)):
p.apply_async(process, [file, id])
then you will have to add some code to consolidate the different "_.csv" files in single ".csv" files.
Concurrent writes on a same file is something you want to avoid - either you don't have file locks and you end up with corrupted data, or you have file locks and then it will slow down the process, which defeats the whole point of parallelizing it.

How can I increase the speed of this python requests session?

I am using Anaconda - Python 3.5.2
I have a list of 280,000 urls.
I am grabbing the data and trying to keep track of the url-to-data.
I've made about 30K requests. I am averaging 1 request per second.
response_df = pd.DataFrame()
# create the session
with requests.Session() as s:
# loop through the list of urls
for url in url_list:
# call the resource
resp = s.get(url)
# check the response
if resp.status_code == requests.codes.ok:
# create a new dataframe with the response
ftest = json_normalize(resp.json())
ftest['url'] = url
response_df = response_df.append(ftest, ignore_index=True)
else:
print("Something went wrong! Hide your wife! Hide the kids!")
response_df.to_csv(results_csv)
I ended up ditching requests, I used async and aiohttp instead. I was pulling about 1 per second with requests. The new method averages about 5 per second, and only utilizes about 20% of my system resources. I ended up using something very similar to this:
https://www.blog.pythonlibrary.org/2016/07/26/python-3-an-intro-to-asyncio/
import aiohttp
import asyncio
import async_timeout
import os
async def download_coroutine(session, url):
with async_timeout.timeout(10):
async with session.get(url) as response:
filename = os.path.basename(url)
with open(filename, 'wb') as f_handle:
while True:
chunk = await response.content.read(1024)
if not chunk:
break
f_handle.write(chunk)
return await response.release()
async def main(loop):
urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040a.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040ez.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040es.pdf",
"http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"]
async with aiohttp.ClientSession(loop=loop) as session:
for url in urls:
await download_coroutine(session, url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
also, this was helpful:
https://snarky.ca/how-the-heck-does-async-await-work-in-python-3-5/
http://www.pythonsandbarracudas.com/blog/2015/11/22/developing-a-computational-pipeline-using-the-asyncio-module-in-python-3

segmentation fault in python3

I am running python3 on a Ubuntu machine and have noticed that the following block of code is fickle. Sometimes it runs just fine, other times it produces a segmentation fault. I don't understand why. Can someone explain what might be going on?
Basically what the code does is try to read S&P companies from Wikipedia and write the list of tickers to a file in the same directory as the script. If no connection to Wikipedia can be established, the script tries instead to read an existing list from file.
from urllib import request
from urllib.error import URLError
from bs4 import BeautifulSoup
import os
import pickle
import dateutil.relativedelta as dr
import sys
sys.setrecursionlimit(100000)
def get_standard_and_poors_500_constituents():
fname = (
os.path.abspath(os.path.dirname(__file__)) + "/sp500_constituents.pkl"
)
try:
# URL request, URL opener, read content.
req = request.Request(
"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
opener = request.urlopen(req)
# Convert bytes to UTF-8.
content = opener.read().decode()
soup = BeautifulSoup(content, "lxml")
# HTML table we actually need is the first.
tables = soup.find_all("table")
external_class = tables[0].findAll("a", {"class":"external text"})
c = [ext.string for ext in external_class if not "reports" in ext]
with open(fname, "wb") as f:
pickle.dump(c, f)
except URLError:
with open(fname, "rb") as f:
c = pickle.load(f)
finally:
return c
sp500_constituents = get_standard_and_poors_500_constituents()
spdr_etf = "SPY"
sp500_index = "^GSPC"
def main():
X = get_standard_and_poors_500_constituents()
print(X)
if __name__ == "__main__":
main()

Resources