pig_hook in airflow doesn't work for python3 - python-3.5

Using Python 3.5.2, airflow 1.9.0
Trying to set up a pig hook, looking at the documentation here:
https://github.com/apache/incubator-airflow/blob/master/airflow/hooks/pig_hook.py
following the example in lines 50-52
>>> ph = PigCliHook()
>>> result = ph.run_cli("ls /;")
>>> ("hdfs://" in result)
Gives the following error:
File "python3.5/site-packages/airflow/hooks/pig_hook.py", line 53, in run_cli
f.write(pig)
File "python3.5/tempfile.py", line 622, in func_wrapper
return func(*args, **kwargs)
TypeError: a bytes-like object is required, not 'str'
If updated to run as:
>>> ph = PigCliHook()
>>> result = ph.run_cli("ls /;".encode('utf-8'))
>>> ("hdfs://" in result)
The error becomes:
File "python3.5/site-packages/airflow/hooks/pig_hook.py", line 74, in run_cli
stdout += line
TypeError: Can't convert 'bytes' object to str implicitly
And later in the same pig_hook.py file it does assume a string type for this field, so I don't think passing the input as a bytes object is correct.
I believe the object causing the problem is NamedTemporaryFile (from line 52 in pig_hook.py) which is opened by default in mode 'w+b' as described in the following post:
https://bugs.python.org/issue29245
But if I change line 53 in pig_hook.py to read:
with NamedTemporaryFile(dir=tmp_dir,'w') as f:
or
with NamedTemporaryFile(dir=tmp_dir, mode='w') as f:
it still expects a byte array resulting in the same error:
File "python3.5/site-packages/airflow/hooks/pig_hook.py", line 53, in run_cli
f.write(pig)
File "python3.5/tempfile.py", line 622, in func_wrapper
return func(*args, **kwargs)
TypeError: a bytes-like object is required, not 'str'
Does anyone know how I can solve this issue? I can't seem to get NamedTemporaryFile to open in a mode that uses string, and the rest of the code assumes a string.

It turns out that the error after calling:
>>> ph = PigCliHook()
>>> result = ph.run_cli("ls /;".encode('utf-8'))
>>> ("hdfs://" in result)
which was
File "python3.5/site-packages/airflow/hooks/pig_hook.py", line 74, in run_cli
stdout += line
TypeError: Can't convert 'bytes' object to str implicitly
Was not coming from my own input but from the logging. So I added:
line = line.decode('utf-8')
to line 75 of the pig_hook and it seems to be working just fine now.

Related

pymongo - bson.errors.InvalidDocument raised only sometimes for no apparent reason

my documents look like this : { "_id" : 5, "hunger" : 5, "energy" : 50 }
I'm calling this function..
def getEnergy(_id) -> int:
record = db.systems.find({"_id":_id}) # systems is the collection
return record[0]['energy']
and getting this error..
(...)
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\cursor.py", line 692, in __getitem__
for doc in clone:
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\cursor.py", line 1238, in next
if len(self.__data) or self._refresh():
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\cursor.py", line 1155, in _refresh
self.__send_message(q)
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\cursor.py", line 1044, in __send_message
response = client._run_operation(
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\mongo_client.py", line 1424, in _run_operation
return self._retryable_read(
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\mongo_client.py", line 1525, in _retryable_read
return func(session, server, sock_info, secondary_ok)
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\mongo_client.py", line 1420, in _cmd
return server.run_operation(
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\server.py", line 98, in run_operation
message = operation.get_message(
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\message.py", line 351, in get_message
request_id, msg, size, _ = _op_msg(
File "C:\Users\mateo\AppData\Roaming\Python\Python39\site-packages\pymongo\message.py", line 743, in _op_msg
return _op_msg_uncompressed(
bson.errors.InvalidDocument: cannot encode object: <pymongo.cursor.Cursor object at 0x0000021E52535670>, of type: <class 'pymongo.cursor.Cursor'>
Sometimes the function works just fine, and sometimes it throws an error. It seems to be a problem with the server but I can't figure out what exactly the problem is.
If the error is in that function (which we can't 100% tell as you've only posted half the stack trace), you will get this error is you are passing a cursor into the function; this snippet reproduces the error:
from pymongo import MongoClient
db = MongoClient()['mydatabase']
def getEnergy(_id) -> int:
record = db.systems.find({"_id":_id}) # systems is the collection
return record[0]['energy']
foo = db.somecollection.find()
getEnergy(foo)
error:
bson.errors.InvalidDocument: cannot encode object: <pymongo.cursor.Cursor object at 0x0000019DB0A0B070>, of type: <class 'pymongo.cursor.Cursor'>
You need to examine where you are calling getExport() and check the parameter you are passing in isn't a cursor object.

Why does django-q throw exception with arrow time

I'm trying to create a Django-q schedule and following the documents to use arrow for the next run I get the following error with the schedule:
schedule(
func='test.tasks.test_task',
name='test_task_nightly',
schedule_type=Schedule.DAILY,
next_run=arrow.utcnow().replace(hour=23, minute=30),
q_options={'timeout': 10800, 'max_attempts': 1},
)
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
schedule(
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django_q/tasks.py", line 122, in schedule
s.full_clean()
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django/db/models/base.py", line 1209, in full_clean
self.clean_fields(exclude=exclude)
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django/db/models/base.py", line 1251, in clean_fields
setattr(self, f.attname, f.clean(raw_value, self))
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django/db/models/fields/__init__.py", line 650, in clean
value = self.to_python(value)
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django/db/models/fields/__init__.py", line 1318, in to_python
parsed = parse_datetime(value)
File "/home/user/PycharmProjects/app/venv/lib/python3.8/site-packages/django/utils/dateparse.py", line 107, in parse_datetime
match = datetime_re.match(value)
TypeError: expected string or bytes-like object
Not sure why it's not accepting the time format similar to the example given in the django-q documentation page.
EDIT:
The task being scheduled:
def test_task():
print('Executed test task')
Nothing too complex just for testing purposes
The Django ORM (verion 3.2 at time of writing) won't accept an Arrow object in any DateTimeField.
Arrow objects emulate Python's datetime object interface, but they are not real datetime objects. So any code receiving an Arrow object will fail if it explicitly checks if your value is an honest-to-goodness datetime. Which is exactly what the code in
django.db.models.fields.DateTimeField.to_python appears to be doing:
def to_python(self, value):
if value is None:
return value
if isinstance(value, datetime.datetime):
return value
if isinstance(value, datetime.date):
value = datetime.datetime(value.year, value.month, value.day)
...
try:
parsed = parse_datetime(value)
As you can see, when it doesn't match a datetime or date instance, Django hands it off to the parse_datetime() function to deal with, which expects a string. Which explains your error: TypeError: expected string or bytes-like object
You can get around this by getting the .datetime property, which will return a plain old python datetime, i.e.
schedule(
func='test.tasks.test_task',
name='test_task_nightly',
schedule_type=Schedule.DAILY,
next_run=arrow.utcnow().replace(hour=23, minute=30).datetime,
q_options={'timeout': 10800, 'max_attempts': 1},
)

Python Multiprocessing( TypeError: cannot serialize '_io.BufferedReader' object )

I'm trying to make dictionary attack on zip file using Pool to increase speed.
But I face next error in Python 3.6, while it works in Python 2.7:
Traceback (most recent call last):
File "zip_crack.py", line 42, in <module>
main()
File "zip_crack.py", line 28, in main
for result in results:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 761, in next
raise value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 450, in _ handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.BufferedReader' object
I tried to search for same errors but couldn't find answer that can help here.
Code looks like this
def crack(pwd, f):
try:
key = pwd.strip()
f.extractall(pwd=key)
return True
except:
pass
z_file = zipfile.ZipFile("../folder.zip")
with open('words.dic', 'r') as passes:
start = time.time()
lines = passes.readlines()
pool = Pool(50)
results = pool.imap_unordered(partial(crack, f=z_file), lines)
pool.close()
for result in results:
if result:
pool.terminate()
break
pool.join()
I also tried another approach using map
with contextlib.closing(Pool(50)) as pool:
pool.map(partial(crack, f=z_file), lines)
which worked great and found passwords quickly in Python 2.7 but it throws same exception in python 3.6

Proper Use Of Python 3.x AMFY Module

How am I supposed to use the Amfy module? I try to use it like the JSON module (amfy.loads or amfy.load), but it just gives me errors:
C:\Users\Other>"C:\Users\Other\Desktop\Python3.5.2\test amf.py"
Traceback (most recent call last):
File "C:\Users\Other\Desktop\Python3.5.2\test amf.py", line 4, in <module>
print(amfy.load(cn_rsp.text))
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\__init__.py", line 9, in load
return Loader().load(input, proto=proto)
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\core.py", line 33, in load
return self._read_item3(stream, context)
File "C:\Users\Other\Desktop\Python3.5.2\lib\site-packages\amfy\core.py", line 52, in _read_item3
marker = stream.read(1)[0]
AttributeError: 'str' object has no attribute 'read'
this is what I wrote:
import requests
import amfy
cn_rsp = requests.get("http://realm498.c10.castle.rykaiju.com/api/locales/en/get_serialized_new")
print(amfy.load(cn_rsp.text))
After tinkering around and googling some stuff, I found a fix:
New code:
import amfy, requests, json
url = "http://realm416.c9.castle.rykaiju.com/api/locales/en/get_serialized_static"
req = requests.get(url)
if req.status_code == 200:
ret = req.json() if "json" in req.headers["content-type"] else amfy.loads(req.content)
else:
ret = {"failed": req.reason}
with open ("doa manifest.txt", 'w', encoding = 'utf-8') as dump:
json.dumps(ret, dump)
The Terminal throws a UnicodeEncodeError, but I was able to fix that by entering chcp 65001 and then set PYTHONIOENCODING=utf-8
The load method expects an input stream, you provide it a string. Just convert your string into a memory buffer which supports read method like this:
import io
print(amfy.load(io.BytesIO(cn_rsp.text.encode())))
unfortunately serialization fails when using this. Is there another url where it would work, a test URL maybe?
File "C:\Python34\lib\site-packages\amfy\core.py", line 146, in _read_vli
byte = stream.read(1)[0]
IndexError: index out of range

python: TypeError: expected string or buffer: can't reproduce error with prompt

I am trying to debug this function:
def check_domains(url):
global num_websites,domain_queue,domains,doc_queue,stanford_tagger
the_domain = re.match(r'^(:?https?:\/\/[^.]*\.)?([^/#?&]+).*$',url)
if the_domain is not None:
if the_domain.groups(0)[1] not in domains.keys():
domains[the_domain.groups(0)[1]] = website(doc_queue,the_domain.groups(0)[1])
domains[the_domain.groups(0)[1]].add_initial_url(url)
domain_queue.append(domains[the_domain.groups(0)[1]])
num_websites = num_websites + 1
else:
domains[the_domain.groups(0)[1]].add_url(url)
error:
File "web_crawler.py", line 178, in getdoc
check_domains(check)
File "web_crawler.py", line 133, in check_domains
the_domain = re.match(r'^(:?https?:\/\/[^.]*\.)?([^/#?&]+).*$',url)
File "/usr/local/lib/python2.7/re.py", line 137, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or buffer
I try and be a good boy and test in interactive mode:
>>> def check_domains(url):
... the_domain = re.match(r'^(:?https?:\/\/[^.]*\.)?([^/#?&]+).*$',url) #right here
... if the_domain is not None:
... print the_domain.groups(0)[1]
... else:
... print "NOOOO!!!!!"
...
>>>
>>> check_domains("http://www.hulu.com/watch/6704")
hulu.com
>>> check_domains("https://docs.python.org/2/library/datetime.html")
python.org
so this does what I want it to do and I didn't change that line. But why???
The value of url can still be None and that's what gives this error:
>>> re.match(r'^(:?https?:\/\/[^.]*\.)?([^/#?&]+).*$', None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 137, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or buffer
So you should check whether the object that you're passing for url is indeed a string. It may even be a number or something else but it's not a string which is what the matching function expects.

Resources