I am trying to read a pdf in AWS lambda. The pdf is stored in an s3 bucket. I need to extract the text from pdf and translate them into any required language. I am able to run my code in my notebook but when I run it on Lambda I get this error message in my cloudwatch logs - task timed out after 3.01 seconds.
import fitz
import base64
from io import BytesIO
from PIL import Image
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
try:
print("Inside handler")
s3_bucket = "my_bucket"
pdf_file_name = 'sample.pdf'
pdf_file = s3.get_object(Bucket=s3_bucket, Key=pdf_file_name)
file_content = pdf_file['Body'].read()
print("Before reading ")
with fitz.open(stream=file_content, filetype="pdf") as doc:
Try to extend the timeout, which by default is set at 3 sec.
If that does not help, try to increase the allocated memory.
Also, you may consider pushing
s3 = boto3.client('s3')
client_textract = boto3.client('textract')
translate_client = boto3.client('translate')
out of your handler. Put it right after the imports. The function will run more efficiently on frequent invocation.
I'm working on a documentation (personal) for nested matplotlib (MPL) library, which differs from MPL own provided, by interested submodule packages. I'm writing Python script which I hope will automate document generation from future MPL releases.
I selected interested submodules/packages and want to list their main classes from which I'll generate list and process it with pydoc
Problem is that I can't find a way to instruct Python to load submodule from string. Here is example of what I tried:
import matplotlib.text as text
x = dir(text)
.
i = __import__('matplotlib.text')
y = dir(i)
.
j = __import__('matplotlib')
z = dir(j)
And here is 3 way comparison of above lists through pprint:
I don't understand what's loaded in y object - it's base matplotlib plus something else, but it lack information that I wanted and that is main classes from matplotlib.text package. It's top blue coloured part on screenshot (x list)
Please don't suggest Sphinx as different approach.
The __import__ function can be a bit hard to understand.
If you change
i = __import__('matplotlib.text')
to
i = __import__('matplotlib.text', fromlist=[''])
then i will refer to matplotlib.text.
In Python 3.1 or later, you can use importlib:
import importlib
i = importlib.import_module("matplotlib.text")
Some notes
If you're trying to import something from a sub-folder e.g. ./feature/email.py, the code will look like importlib.import_module("feature.email")
Before Python 3.3 you could not import anything if there was no __init__.py in the folder with file you were trying to import (see caveats before deciding if you want to keep the file for backward compatibility e.g. with pytest).
importlib.import_module is what you are looking for. It returns the imported module.
import importlib
# equiv. of your `import matplotlib.text as text`
text = importlib.import_module('matplotlib.text')
You can thereafter access anything in the module as text.myclass, text.myfunction, etc.
spent some time trying to import modules from a list, and this is the thread that got me most of the way there - but I didnt grasp the use of ___import____ -
so here's how to import a module from a string, and get the same behavior as just import. And try/except the error case, too. :)
pipmodules = ['pycurl', 'ansible', 'bad_module_no_beer']
for module in pipmodules:
try:
# because we want to import using a variable, do it this way
module_obj = __import__(module)
# create a global object containging our module
globals()[module] = module_obj
except ImportError:
sys.stderr.write("ERROR: missing python module: " + module + "\n")
sys.exit(1)
and yes, for python 2.7> you have other options - but for 2.6<, this works.
Apart from using the importlib one can also use exec method to import a module from a string variable.
Here I am showing an example of importing the combinations method from itertools package using the exec method:
MODULES = [
['itertools','combinations'],
]
for ITEM in MODULES:
import_str = "from {0} import {1}".format(ITEM[0],', '.join(str(i) for i in ITEM[1:]))
exec(import_str)
ar = list(combinations([1, 2, 3, 4], 2))
for elements in ar:
print(elements)
Output:
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Module auto-install & import from list
Below script works fine with both submodules and pseudo submodules.
# PyPI imports
import pkg_resources, subprocess, sys
modules = {'lxml.etree', 'pandas', 'screeninfo'}
required = {m.split('.')[0] for m in modules}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed
if missing:
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade', 'pip'])
subprocess.check_call([sys.executable, '-m', 'pip', 'install', *missing])
for module in set.union(required, modules):
globals()[module] = __import__(module)
Tests:
print(pandas.__version__)
print(lxml.etree.LXML_VERSION)
I developed these 3 useful functions:
def loadModule(moduleName):
module = None
try:
import sys
del sys.modules[moduleName]
except BaseException as err:
pass
try:
import importlib
module = importlib.import_module(moduleName)
except BaseException as err:
serr = str(err)
print("Error to load the module '" + moduleName + "': " + serr)
return module
def reloadModule(moduleName):
module = loadModule(moduleName)
moduleName, modulePath = str(module).replace("' from '", "||").replace("<module '", '').replace("'>", '').split("||")
if (modulePath.endswith(".pyc")):
import os
os.remove(modulePath)
module = loadModule(moduleName)
return module
def getInstance(moduleName, param1, param2, param3):
module = reloadModule(moduleName)
instance = eval("module." + moduleName + "(param1, param2, param3)")
return instance
And everytime I want to reload a new instance I just have to call getInstance() like this:
myInstance = getInstance("MyModule", myParam1, myParam2, myParam3)
Finally I can call all the functions inside the new Instance:
myInstance.aFunction()
The only specificity here is to customize the params list (param1, param2, param3) of your instance.
You can also use exec built-in function that execute any string as a Python code.
In [1]: module = 'pandas'
...: function = 'DataFrame'
...: alias = 'DF'
In [2]: exec(f"from {module} import {function} as {alias}")
In [3]: DF
Out[3]: pandas.core.frame.DataFrame
For me this was the most readable way to solve my problem.
I am trying to run the follow python script:
#!/usr/bin/env python
#Author Jared Adolf-Bryfogle
#Python Imports
import os
import sys
from pathlib import Path
from typing import Union
#Append Python Path
p = os.path.split(os.path.abspath(__file__))[0]+"/src"
sys.path.append(p) #Allows all modules to use all other modules, without needing to update pythonpath
#Project Imports
from pic2.modules.chains.ProposedIgChain import ProposedIgChain
from pic2.modules.chains.IgChainSet import IgChainSet
from pic2.modules.chains.IgChainFactory import IgChainFactory
from pic2.modules.chains.AbChainFactory import AbChainFactory
from pic2.tools.fasta import *
class IgClassifyFASTA:
"""
Identify CDRs from a Fasta File
"""
def __init__(self, fasta_path: Union[Path, str]):
self.fasta_path = str(fasta_path)
self.outdir = os.getcwd()
self.outname = "classified"
self.fasta_paths = split_fasta_from_fasta(os.path.abspath(self.fasta_path), "user_fasta_split_"+self.outname, self.outdir)
def __exit__(self):
for path in self.fasta_paths:
if os.path.exists(path):
os.remove(path)
def set_output_path(self, outdir: Union[Path, str]):
self.outdir = str(outdir)
def set_output_name(self, outname: Union[Path, str]):
self.outname = str(outname)
My python version is 3.8, and the pic2 is a conda env. I get the the following error:
File "IgClassifyFASTA.py", line 29
def __init__(self, fasta_path:Union[Path, str]):
SyntaxError: invalid syntax
I cannot figure out what's wrong with this line. Could you kindly give me some hint what's the wrong lying in? I will appreciate any help.
Best regards!
I'm unable to capture stdout of runpy.run_module into a variable using StringIO.
To demonstrate the problem, I created a script called runpy_test.py (code below) using an arg switch;
0 = do not redirect stdout.
1 = redirect using StringIO, capture into variable, print variable.
Console Output
(base) PS C:\Users\justi\Documents> python .\runpy_test.py 0
pip 20.0.2 from C:\ProgramData\Anaconda3\lib\site-packages\pip (python 3.6)
(base) PS C:\Users\justi\Documents> python .\runpy_test.py 1
(base) PS C:\Users\justi\Documents>
I was expecting python .\runpy_test.py 1 to print pip 20.0.2 from C:\ProgramData\Anaconda3\lib\site-packages\pip (python 3.6), but as you can see from the above console capture, I'm getting nothing.
runpy_test.py
import io
import sys
import runpy
import copy
capture_stdout = bool(sys.argv[1] == "1")
if capture_stdout:
_stdout = sys.stdout
sys.stdout = io.StringIO()
_argv = copy.deepcopy(sys.argv)
sys.argv = ['', '-V']
runpy.run_module("pip", run_name="__main__")
sys.argv = _argv
if capture_stdout:
result = sys.stdout.getvalue()
sys.stdout = _stdout
print(f"result: {result}")
I'm guessing sys.stdout is not being correctly re-initialised before I print because of something related to runpy.run_module, but not really sure how to debug. Any ideas would be great, solutions even better.
My environment is Python 3.6.10 using conda 4.8.3.
Thanks in advance.
Using subprocess.check_output instead of runpy.run_module solved my problem.
See Installing python module within code
You can use capsys in the pytest framework:
def test_main(capsys):
runpy.run_module(
"helloworld",
init_globals=None,
run_name="__main__",
alter_sys=False)
captured = capsys.readouterr()
assert captured.out == "Hello, World!"
import os
import json
from collections import namedtuple
from ansible import context
from ansible.module_utils.common.collections import ImmutableDict
from ansible.utils.vars import load_extra_vars
from ansible.parsing.dataloader import DataLoader
from ansible.vars.manager import VariableManager
from ansible.inventory.manager import InventoryManager
from ansible.playbook.play import Play
from ansible.executor.playbook_executor import PlaybookExecutor
def execute_ansible_playbook(CLOUD_TO_USE=None, PLAYBOOK=None):
playbook_path = PLAYBOOK
#inventory_path = "hosts"
#Options = namedtuple('Options', ['connection', 'module_path', 'forks', 'become', 'become_method', 'become_user', 'check', 'diff', 'listhosts', 'listtasks', 'listtags', 'syntax'])
loader = DataLoader()
passwords = dict(vault_pass='secret')
inventory = InventoryManager(loader=loader, sources='inventory/' + CLOUD_TO_USE)
#inventory = InventoryManager(loader=loader, sources='localhost')
variable_manager = VariableManager(loader=loader, inventory=inventory)
executor = PlaybookExecutor(
playbooks=[playbook_path],
inventory=inventory,
variable_manager=variable_manager,
loader=loader,
passwords=passwords
)
results = executor.run()
print (results)
I got this code from Run Ansible playbook programmatically?
This is running properly for other ansible-playbooks. But now I want to pass extra_vars to an ansible-playbook. I couldn't find a proper answer.
How can I do that?
FWIW. Use ansible-runner. The documentation is not complete. All parameters are described in the source.