I am trying to write an ansible playbook to crawl a website and then store its contents into a static file under aws s3 bucket. Here is the crawler code :
"""
Handling pages with the Next button
"""
import sys
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = "https://xyz.co.uk/"
file_name = "web_content.txt"
while True:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
raw_html = soup.prettify()
file = open(file_name, 'wb')
print('Collecting the website contents')
file.write(raw_html.encode())
file.close()
print('Saved to %s' % file_name)
#print(type(raw_html))
# Finding next page
next_page_element = soup.select_one('li.next > a')
if next_page_element:
next_page_url = next_page_element.get('href')
url = urljoin(url, next_page_url)
else:
break
This is my ansible-playbook:
---
- name: create s3 bucket and upload static website content into it
hosts: localhost
connection: local
tasks:
- name: create a s3 bucket
amazon.aws.aws_s3:
bucket: testbucket393647914679149
region: ap-south-1
mode: create
- name: create a folder in the bucket
amazon.aws.aws_s3:
bucket: testbucket393647914679149
object: /my/directory/path
mode: create
- name: Upgrade pip
pip:
name: pip
version: 21.1.3
- name: install virtualenv via pip
pip:
requirements: /root/ansible/requirements.txt
virtualenv: /root/ansible/myvenv
virtualenv_python: python3.6
environment:
PATH: "{{ ansible_env.PATH }}:{{ ansible_user_dir }}/.local/bin"
- name: Run script to crawl the website
script: /root/ansible/beautiful_crawl.py
- name: copy file into bucket folder
amazon.aws.aws_s3:
bucket: testbucket393647914679149
object: /my/directory/path/web_content.text
src: web_content.text
mode: put
Problem is when I run this, it runs fine upto the task name: install virtualenv via pip and then throws following error while executing the task name: Run script to crawl the website:
fatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 2, "stderr": "/root/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-9798 3643645466/beautiful_crawl.py: line 1: import: command not found\n/root/.ansible /tmp/ansible-tmp-1625137700.8854306-13026-97983643645466/beautiful_crawl.py: lin e 2: from: command not found\n/root/.ansible/tmp/ansible-tmp-1625137700.8854306- 13026-97983643645466/beautiful_crawl.py: line 3: import: command not found\n/roo t/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-97983643645466/beautiful_cra wl.py: line 4: from: command not found\n/root/.ansible/tmp/ansible-tmp-162513770 0.8854306-13026-97983643645466/beautiful_crawl.py: line 6: url: command not foun d\n/root/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-97983643645466/beauti ful_crawl.py: line 7: file_name: command not found\n/root/.ansible/tmp/ansible-t mp-1625137700.8854306-13026-97983643645466/beautiful_crawl.py: line 10: syntax e rror near unexpected token ('\n/root/.ansible/tmp/ansible-tmp-1625137700.885430 6-13026-97983643645466/beautiful_crawl.py: line 10: response = requests.get (url)'\n", "stderr_lines": ["/root/.ansible/tmp/ansible-tmp-1625137700.8854306-1 3026-97983643645466/beautiful_crawl.py: line 1: import: command not found", "/ro ot/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-97983643645466/beautiful_cr awl.py: line 2: from: command not found", "/root/.ansible/tmp/ansible-tmp-162513 7700.8854306-13026-97983643645466/beautiful_crawl.py: line 3: import: command no t found", "/root/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-9798364364546 6/beautiful_crawl.py: line 4: from: command not found", "/root/.ansible/tmp/ansi ble-tmp-1625137700.8854306-13026-97983643645466/beautiful_crawl.py: line 6: url: command not found", "/root/.ansible/tmp/ansible-tmp-1625137700.8854306-13026-97 983643645466/beautiful_crawl.py: line 7: file_name: command not found", "/root/. ansible/tmp/ansible-tmp-1625137700.8854306-13026-97983643645466/beautiful_crawl. py: line 10: syntax error near unexpected token ('", "/root/.ansible/tmp/ansibl e-tmp-1625137700.8854306-13026-97983643645466/beautiful_crawl.py: line 10: response = requests.get(url)'"], "stdout": "", "stdout_lines": []}
What am I doing wrong here?
You have multiple problems.
Check the documentation.
No. 1: The script modules will run bash scripts by default, not python scripts. If you want to run a python script, you need to add a shebang like #!/usr/bin/env python3 as the first line of the script or use the executable parameter.
No 2: You create a venv, so I assume you want to run the script in that venv. You can't do that out of the box with the script module, so you would need to work around that.
This should work for you (you don't need the shebang, as you tell the script module to run it with python in the venv using the executable parameter):
- name: Run script to crawl the website
script: /root/ansible/beautiful_crawl.py
executable: /root/ansible/myvenv/bin/python
Related
When running over a file using the command line and yapf, my tags are the following:
-i --verbose --style "google"
When using the same above as args for pre-commit, my pre-commit hook always returns "Pass".
This was tested against the same file for the same changes, so I would have expected similar results. If I exclude --style "google", my pre-commit hook will at least change the format of my file, but not to the style that I want it to.
Can someone tell me what I am doing wrong with the args?
Python File that contains an example:
def hello_world():
print("hello world")
if 5 == 5: print("goodbye world")
.pre-commit-config.yaml file:
- repo: https://github.com/pre-commit/pre-commit-hooks.git
sha: v4.0.1
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/google/yapf
rev: v0.31.0
hooks:
- id: yapf
name: "yapf"
On commit, my file will change and pre-commit has told me yapf has changed my file to the following:
def hello_world():
print("hello world")
if 5 == 5: print("goodbye world")
If I go back to the same python file and update my .pre-commit-config.yaml file to this:
- repo: https://github.com/pre-commit/pre-commit-hooks.git
sha: v4.0.1
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/google/yapf
rev: v0.31.0
hooks:
- id: yapf
name: "yapf"
args: [--style "google" ]
Running a commit will provide Pass instead of making any changes, even the ones from before
Edit 1:
The .pre-commit.config.yaml file was updated to:
- repo: https://github.com/pre-commit/pre-commit-hooks.git
sha: v4.0.1
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/google/yapf
rev: v0.31.0
hooks:
- id: yapf
name: "yapf"
args: [--style, google]
Running pre-commit run only shows Passed instead of reformatting. I've also tried putting in pep8, and other arbitrary words as a replacement for google. These all result in Passed. Maybe there is something on my end where the style arg is being ignored and causing all of yapf to fail?
This post was from a while ago, but for anyone else in the future reading it, but I was able to control the yapf style in pre-commit with a .style.yapf in my parent directory, as outlined in the yapf documentation: https://github.com/google/yapf
This was the .style.yapf I used
[style]
based_on_style = google
I have an issue with access Github Secrets in CI workflow
the tests part of main.ymlfile -
# Run our unit tests
- name: Run unit tests
env:
CI: true
MONGO_USER: ${{ secrets.MONGO_USER }}
MONGO_PWD: ${{ secrets.MONGO_PWD }}
ADMIN: ${{ secrets.ADMIN }}
run: |
pipenv run python app.py
I have a database.py file in which I am accessing these environment variables
import os
import urllib
from typing import Dict, List, Union
import pymongo
from dotenv import load_dotenv
load_dotenv()
print("Mongodb user: ", os.environ.get("MONGO_USER"))
class Database:
try:
client = pymongo.MongoClient(
"mongodb+srv://" +
urllib.parse.quote_plus(os.environ.get("MONGO_USER")) +
":" +
urllib.parse.quote_plus(os.environ.get("MONGO_PWD")) +
"#main.rajun.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"
)
DATABASE = client.Main
except TypeError as NoCredentialsError:
print("MongoDB credentials not available")
raise Exception(
"MongoDB credentials not available"
) from NoCredentialsError
...
...
this is the issue I get in the build-
Traceback (most recent call last):
Mongodb user: None
MongoDB credentials not available
Followed by urllib raising bytes expected error
I have followed the documentation here but I still cannot find out my mistake
I have a AWS Lambda function based on python 3.7 and trying to use a module dicttoxml via AWS layers. My Python code is as below:
import json
import dicttoxml
def lambda_handler(event, context):
xml = dicttoxml.dicttoxml({"name": "Foo"})
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
At my local machine, it works perfectly fine but Lambda gives error as below:
{
"errorMessage": "module 'dicttoxml' has no attribute 'dicttoxml'",
"errorType": "AttributeError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 4, in lambda_handler\n xml = dicttoxml.dicttoxml({\"name\": \"Ankur\"})\n"
]
}
The directory structure of dicttoxml layer is as below:
dicttoxml.zip > python > dicttoxml > dicttoxml.py
I feel puzzled, what is wrong here?
I created the custom layer with dicttoxml can confirm that it works.
The technique used includes docker tool described in the recent AWS blog:
How do I create a Lambda layer using a simulated Lambda environment with Docker?
Thus for this question, I verified it as follows:
Create empty folder, e.g. mylayer.
Go to the folder and create requirements.txt file with the content of
echo dicttoxml > ./requirements.txt
Run the following docker command:
docker run -v "$PWD":/var/task "lambci/lambda:build-python3.7" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.7/site-packages/; exit"
Create layer as zip:
zip -9 -r mylayer.zip python
Create lambda layer based on mylayer.zip in the AWS Console. Don't forget to specify Compatible runtimes to python3.7.
Test the layer in lambda using the following lambda function:
import dicttoxml
def lambda_handler(event, context):
print(dir(dicttoxml))
The function executes correctly:
['LOG', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', 'collections', 'convert', 'convert_bool', 'convert_dict', 'convert_kv', 'convert_list', 'convert_none', 'default_item_func', 'dicttoxml', 'escape_xml', 'get_unique_id', 'get_xml_type', 'ids', 'key_is_valid_xml', 'logging', 'long', 'make_attrstring', 'make_id', 'make_valid_xml_name', 'numbers', 'parseString', 'randint', 'set_debug', 'unicode', 'unicode_literals', 'unicode_me', 'version', 'wrap_cdata']
I am trying to generate a file by template rendering to pass to the user data of the ec2 instance. I am using the third party terraform provider to generate an ignition file from the YAML.
data "ct_config" "worker" {
content = data.template_file.file.rendered
strict = true
pretty_print = true
}
data "template_file" "file" {
...
...
template = file("${path.module}/example.yml")
vars = {
script = file("${path.module}/script.sh")
}
}
example.yml
storage:
files:
- path: "/opt/bin/script"
mode: 0755
contents:
inline: |
${script}
Error:
Error: Error unmarshaling yaml: yaml: line 187: could not find expected ':'
on ../../modules/launch_template/launch_template.tf line 22, in data "ct_config" "worker":
22: data "ct_config" "worker" {
If I change ${script} to sample data then it works. Also, No matter what I put in the script.sh I am getting the same error.
You want this outcome (pseudocode):
storage:
files:
- path: "/opt/bin/script"
mode: 0755
contents:
inline: |
{{content of script file}}
In your current implementation, all lines after the first loaded from script.sh will not be indented and will not be interpreted as desired (the entire script.sh content) by a YAML decoder.
Using indent you can correct the indentation and using the newer templatefile functuin you can use a slightly cleaner setup for the template:
data "ct_config" "worker" {
content = local.ct_config_content
strict = true
pretty_print = true
}
locals {
ct_config_content = templatefile("${path.module}/example.yml", {
script = indent(10, file("${path.module}/script.sh"))
})
}
For clarity, here is the example.yml template file (from the original question) to use with the code above:
storage:
files:
- path: "/opt/bin/script"
mode: 0755
contents:
inline: |
${script}
I had this exact issue with ct_config, and figured it out today. You need to base64encode your script to ensure it's written correctly without newlines - without that, newlines in your script will make it to CT, which attempts to build an Ignition file, which cannot have newlines, causing the error you ran into originally.
Once encoded, you then just need to tell CT to !!binary the file to ensure Ignition correctly base64 decodes it on deploy:
data "template_file" "file" {
...
...
template = file("${path.module}/example.yml")
vars = {
script = base64encode(file("${path.module}/script.sh"))
}
}
storage:
files:
- path: "/opt/bin/script"
mode: 0755
contents:
inline: !!binary |
${script}
Via an S3 bucket, I've uploaded a lambda function along with its dependencies as a ZIP file. The lambda function is a web scraper with the following initial code to get the scraper started:
import json
import os
import pymysql
import boto3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--user-data-dir=/tmp/user-data')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--data-path=/tmp/data-path')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--homedir=/tmp')
chrome_options.add_argument('--disk-cache-dir=/tmp/cache-dir')
chrome_options.binary_location = os.getcwd() + "/bin/headless-chromium"
browser = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=chrome_options)
When I try to test the lambda function, I get the following error in the console:
{
"errorMessage": "Could not get version for Chrome with this command: google-chrome --version",
"errorType": "ValueError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 67, in lambda_handler\n browser = webdriver.Chrome(executable_path=ChromeDriverManager().install(), options=chrome_options)\n",
" File \"/var/task/webdriver_manager/chrome.py\", line 24, in install\n driver_path = self.download_driver(self.driver)\n",
" File \"/var/task/webdriver_manager/manager.py\", line 32, in download_driver\n driver_version, is_latest = self.__get_version_to_download(driver)\n",
" File \"/var/task/webdriver_manager/manager.py\", line 23, in __get_version_to_download\n return self.__get_latest_driver_version(driver), True\n",
" File \"/var/task/webdriver_manager/manager.py\", line 17, in __get_latest_driver_version\n return driver.get_latest_release_version()\n",
" File \"/var/task/webdriver_manager/driver.py\", line 54, in get_latest_release_version\n self._latest_release_url + '_' + chrome_version())\n",
" File \"/var/task/webdriver_manager/utils.py\", line 98, in chrome_version\n .format(cmd)\n"
]
}
In response, I tried editing the utils.py file in the webdriver_manager dependency folder, by using other commands like 'chrome --version' and 'chromium-browser --version' instead of 'google-chrome --version' under the function definition of 'chrome_version()', but got the similar error of not being able to the get the chrome version from the new command:
def chrome_version():
pattern = r'\d+\.\d+\.\d+'
cmd_mapping = {
OSType.LINUX: 'google-chrome --version',
OSType.MAC: r'/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version',
OSType.WIN: r'reg query "HKEY_CURRENT_USER\Software\Google\Chrome\BLBeacon" /v version'
}
cmd = cmd_mapping[os_name()]
stdout = os.popen(cmd).read()
version = re.search(pattern, stdout)
if not version:
raise ValueError(
'Could not get version for Chrome with this command: {}'
.format(cmd)
)
return version.group(0)
Can anyone tell me what command I should be using instead of 'google-chrome --version'?
By default, Google Chrome does not exists on the container that runs our lambda functions.
I'm implementing similar solutions but with JavaScript and the way I solve is by using a micro-browser (Chromium) using the following packages:
"chrome-aws-lambda": "^1.19.0",
"puppeteer-core": "^1.19.0"
For Python, here is a tutorial that might help in your situation.
https://robertorocha.info/setting-up-a-selenium-web-scraper-on-aws-lambda-with-python/