Finding the source code of methods implemented in C? - python-3.x

Please note I am asking this question for informational purposes only
I know the title sound like a duplicate of Finding the source code for built-in Python functions?. But let me explain.
Say for example, I want to find the source code of most_common method of collections.Counter class. Since the Counter class is implemented in python I could use the inspect module get it's source code.
ie,
>>> import inspect
>>> import collections
>>> print(inspect.getsource(collections.Counter.most_common))
This will print
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.items(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.items(), key=_itemgetter(1))
So if the method or class is implemented in C inspect.getsource will raise TypeError.
>>> my_list = []
>>> print(inspect.getsource(my_list.append))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\abdul.niyas\AppData\Local\Programs\Python\Python36-32\lib\inspect.py", line 968, in getsource
lines, lnum = getsourcelines(object)
File "C:\Users\abdul.niyas\AppData\Local\Programs\Python\Python36-32\lib\inspect.py", line 955, in getsourcelines
lines, lnum = findsource(object)
File "C:\Users\abdul.niyas\AppData\Local\Programs\Python\Python36-32\lib\inspect.py", line 768, in findsource
file = getsourcefile(object)
File "C:\Users\abdul.niyas\AppData\Local\Programs\Python\Python36-32\lib\inspect.py", line 684, in getsourcefile
filename = getfile(object)
File "C:\Users\abdul.niyas\AppData\Local\Programs\Python\Python36-32\lib\inspect.py", line 666, in getfile
'function, traceback, frame, or code object'.format(object))
TypeError: <built-in method append of list object at 0x00D3A378> is not a module, class, method, function, traceback, frame, or code object.
So my question is, Is there is any way(or Using third party package?) that we can find the source code of class or method implemented in C as well?
ie, something like this
>> print(some_how_or_some_custom_package([].append))
int
PyList_Append(PyObject *op, PyObject *newitem)
{
if (PyList_Check(op) && (newitem != NULL))
return app1((PyListObject *)op, newitem);
PyErr_BadInternalCall();
return -1;
}

No, there is not. There is no metadata accessible from Python that will let you find the original source file. Such metadata would have to be created explicitly by the Python developers, without a clear benefit as to what that would achieve.
First and foremost, the vast majority of Python installations do not include the C source code. Next, while you could conceivably expect users of the Python language to be able to read Python source code, Python's userbase is very broad and a large number do not know C or are interested in how the C code works, and finally, even developers that know C can't be expected to have to read the Python C API documentation, something that quickly becomes a requirement if you want to understand the Python codebase.
C files do not directly map to a specific output file, unlike Python bytecode cache files and scripts. Unless you create a debug build with a symbol table, the compiler doesn't retain the source filename in the generated object file (.o) it outputs, nor will the linker record what .o files went into the result it produces. Nor do all C files end up contributing to the same executable or dynamic shared object file; some become part of the Python binary, others become loadable extensions, and the mix is configurable and dependent on what external libraries are available at the time of compilation.
And between makefiles, setup.py and C pre-propressor macros, the combination of input files and what lines of source code are actually used to create each of the output files also varies. Last but not least, because the C source files are no longer consulted at runtime, they can't be expected to still be available in the same original location, so even if there was some metadata stored you still couldn't map that back to the original.
So, it's just easier to just remember a few base rules about how the Python C-API works, then map that back to the C code with a few informed code searches.
Alternatively, download the Python source code and create a debug build, and use a good IDE to help you map symbols and such back to source files. Different compilers, platforms and IDEs have different methods of supporting symbol tables for debugging.

There could be a way if you had the whole debug information (which are usually stripped).
Then you would get to the so or pyd, and use platform specific tools to extract the debug information (stored in the so or in the pdb on Windows) for the required function. You may want to have a look at DWARF information for Linux (on Windows, there is no documentation AFAIK).

Related

ValueError: Comments are not supported by the python backend

The ijson module has a documented option allow_comments=True, but when I include it,
an error message is produced:
ValueError: Comments are not supported by the python backend
Below is a transcript using the file test.py:
import ijson
for o in ijson.items(open(0), 'item'):
print(o)
Please note that I have no problem with a similar documented option, multiple_values=True.
Transcript
$ python3 --version
Python 3.10.9
$ python3 test.py <<< [1,2]
1
2
# Now change the call to: ijson.items(open(0), 'item', allow_comments=True)
$ python3 test.py <<< [1,2]
Traceback (most recent call last):
File "/Users/user/test.py", line 5, in <module>
for o in ijson.items(open(0), 'item', allow_comments=True):
File "/usr/local/lib/python3.10/site-packages/ijson/utils.py", line 51, in coros2gen
f = chain(events, *coro_pipeline)
File "/usr/local/lib/python3.10/site-packages/ijson/utils.py", line 29, in chain
f = coro_func(f, *coro_args, **coro_kwargs)
File "/usr/local/lib/python3.10/site-packages/ijson/backends/python.py", line 284, in basic_parse_basecoro
raise ValueError("Comments are not supported by the python backend")
ValueError: Comments are not supported by the python backend
$
Take a look at the Backends section of the documentation, which says:
Ijson provides several implementations of the actual parsing in the form of backends located in ijson/backends:
yajl2_c: a C extension using YAJL 2.x. This is the fastest, but might require a compiler and the YAJL development files to be present when installing this package. Binary wheel distributions exist for major platforms/architectures to spare users from having to compile the package.
yajl2_cffi: wrapper around YAJL 2.x using CFFI.
yajl2: wrapper around YAJL 2.x using ctypes, for when you can’t use CFFI for some reason.
yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
python: pure Python parser, good to use with PyPy
And later on in the FAQ it says:
Q: Are there any differences between the backends?
...
The python backend doesn’t support allow_comments=True It also internally works with str objects, not bytes, but this is an internal detail that users shouldn’t need to worry about, and might change in the future.
If you want support for allow_comments=True, you need to be using one of the yajl based backends. According to the docs:
Importing the top level library as import ijson uses the first available backend in the same order of the list above, and its name is recorded under ijson.backend. If the IJSON_BACKEND environment variable is set its value takes precedence and is used to select the default backend.
You'll need the necessary libraries, etc, installed on your system in order for this to work.

Julia: Using ProtoBuf to read messages from gzipped file

A sensor provides a stream of frames containing object coordinates, which are stored in ProtoBuf format in a gzipped file. I would like to read this file in Julia.
Using protoc, I have generated the Protobuf files for both Python and Julia, coordinate_push.py and coordinate_push.jl
My Python code is as follows:
frameList = []
with gzip.open(filePath) as f:
data = f.read()
next_pos, pos = 0, 0
while pos < len(data):
msg = coordinate_push.CoordinatesFrame()
next_pos, pos = _DecodeVarint32(data, pos)
msg.ParseFromString(data[pos:pos + next_pos])
frameList.append(msg)
pos += next_pos
I'd like to rewrite the above in Julia, and don't know where to start. Part of the problem is that I haven't fully understood the Python script (IO is not my strong point).
I understand that I need:
to open the gzip file, presumably using using GZip; file = GZip.open(file_path, "r")
to read in the data, along the lines of using ProtoBuf; data = readproto(iob, CoordinatesFrame())
What I don't understand is:
how to define iob, and especially how to link it to file (in the Julia Protobuf manual, we had iob = PipeBuffer(), but here it's a gzip-file that we'd like to read)
how to replicate the while-loop in Julia, and in particular the mysterious _DecodeVarint32 (I'm on Windows, if it's related to that.)
whether the file coordinate_push.jl has to be in the same directory as my main file, and if not, how I can properly import it (it is currently in a proto subfolder, and in Python I'd import it using from src.proto import coordinate_push)
Insight on any of the three points would be highly appreciated.
You should open an issue on the Gzip GitHub repo and ask this first part of your question there (I am not a Gzip expert unfortunately).
On the second point, I suggest looking here: https://github.com/JuliaIO/FileIO.jl/blob/master/README.md for lots of examples of FileIO loops which seems exactly what you need to replicate that Python loop. For the second part of this question, you best bet for that function is to try and hunt down the definition on GitHub or in the docs somewhere.
For the 3rd questions, coordinate_push.jl does not need to be in the same folder as your "main file" (I am not sure what you mean by this so perhaps it would help to add context on the structure of your files). To import that file all you need to do is add include("path/to/coordinate_push.jl") at the top of the file you want to call/run the code from. It's worth noting that the path can either be the absolute path or the relative project path (in some cases).

Loading .npz with Python 3.5 always crashes

In this simple tutorial written in Python 2.7, they have a line loading the numpy array.
train_data = np.load(open('../musicnet.npz','rb'))
Then, they get the data by calling different keys
X,Y = train_data['2494']
Everything works well in python 2.7
Data type of train_data is numpy.lib.npyio.NpzFile
My problem
However, whenever I try to do the same in Python 3.5, most of the lines work fine, except when it comes to the line of X,Y = train_data['2494'], it just freezes there forever. I would like to use Python 3.5 because my other projects are written in python 3.5.
How to rewrite this line so that it runs with Python 3.5?
Error Message
I finally managed to get the error message in terminal
It freezes there because there's tons of output right after the error message, my jupyter notebook just cannot handle that much information.
Solution
Change the encoding to 'bytes'
train_data = np.load('../musicnet.npz', encoding='bytes')
Then everything works fine.
You first said things crashed, now you say it freezes when trying to access a specific array. numpy has the same syntax in 3.5 compared to 2.7. You shouldn't have to rewrite anything.
np.load does have a couple of parameters that deal with differences between Py2 and Py3. But I'm not sure these are an issue for you.
fix_imports : bool, optional
Only useful when loading Python 2 generated pickled files on Python 3,
which includes npy/npz files containing object arrays. If `fix_imports`
is True, pickle will try to map the old Python 2 names to the new names
used in Python 3.
encoding : str, optional
What encoding to use when reading Python 2 strings. Only useful when
loading Python 2 generated pickled files in Python 3, which includes
npy/npz files containing object arrays. Values other than 'latin1',
'ASCII', and 'bytes' are not allowed, as they can corrupt numerical
data. Default: 'ASCII'
Try
print(list(train_data.keys()))
This should show the array names that were saved to the zip archive. Do they match the names in the Py2 load? Do they include the '2494' name?
A couple of things are unusual about:
X,Y = train_data['2494']
Naming an array in the zip archive by a string number, and unpacking the load into two variables.
Do you know anything about how this was savez? What was saved?
Another question - are you loading this file from the same machine that Py2 worked on? Or has the file been transferred from another machine, and possibly corrupted?
As those parameters indicate, there are differences in the pickle code between Py2 and Py3. If the original save included object dtype arrays, or non-array objects, then they will be pickled and there might be incompatibilities in the pickle versions.
Try this,
with np.load('../musicnet.npz') as train_data:
X,Y = train_data['2494']
There are 2 ways out in my point of view:
re-edit your code
train_data = np.load(open('../musicnet.npz','rb'))
to
train_data = np.load(open('../musicnet.npz','r'))
Because the mode of r/rb in python2.7 / 3.5 is a difference in your situation.
Using the default debugger to pointing the significant error. (Usually, work on my experience)

Python 3 combining file open and read commands - a need to close a file and how?

I am working through "Learn Python 3 the Hard Way" and am making code more concise. Lines 11 to 18 of the program below (line 1 starts at # program: p17.py) are relevant to my question. Opening and reading a file are very easy and it is easy to see how you close the file you open when working with the files. The original section is commented out and I include the concise code on line 16. I commented out the line of code that causes an error (on line 20):
$ python3 p17_aside.py p17_text.txt p17_to_file_3.py
Copying from p17_text.txt to p17_to_file_3.py
This is text.
Traceback (most recent call last):
File "p17_aside.py", line 20, in
indata.close()
AttributeError: 'str' object has no attribute 'close'
Code is below:
# program: p17.py
# This program copies one file to another. It uses the argv function as well
# as exists - from sys and os.path modules respectively
from sys import argv
from os.path import exists
script, from_file, to_file = argv
print(f"Copying from {from_file} to {to_file}")
# we could do these two on one line, how?
#in_file = open(from_file)
#indata = in_file.read()
#print(indata)
# THE ANSWER -
indata = open(from_file).read()
# The next line was used for testing
print(indata)
# indata.close()
So my question is should I just avoid the practice of combining commands as done above or is there a way to properly deal with that situation so files are closed when they should be? Is it necessary to deal with the situation of closing a file at all in this situation?
Context manager and with statement is a comfortable way to make sure your file is closed as needed:
with open(from_file) as fobj:
indata = fobj.read()
Nowadays, you can also use Path-like objects and their read_text and read_bytes methods:
# This assumes Path from pathlib has been imported
indata = Path(from_file).read_text()
The error you were seeing... is because you were not trying to close the file, but str into which you've read its content into. You'd need to assign object returned by open a name, and then read from and close that one:
fobj = open(from_file)
indata = fobj.read()
fobj.close() # This is OK
Strictly speaking, you would not need to close that file as dangling file descriptors would be "clobbered" with the process. Esp. in a short example like this, it would be of relatively little concern.
I hope I got the follow up question in comment correctly to extend on this a bit more.
If you wanted a single command, look at the pathtlib.Path example above.
With open as such, you cannot perform read and close in a single operation and without assigning result of open to a variable. As both read and close would have to be performed on the same object returned by open. If you do:
var = fobj.read()
Now, var refers to content read out of the file (so nothing that you could close, would have a close method).
If you did:
open(from_file).close()
After (but also before; at any point), you would simply open that file (again) and close it immediately. BTW. this returns None, just in case you wanted to get the return value. But it would not affect previously open file handles and file-like objects. It would not serve any practical purpose except for perhaps making sure you can open a file.
But again. It's a good practice to perform the housekeeping, but strictly speaking (and esp. in a short code like this). If you did not close the file and relied on the OS to clean-up after your process. It'd work fine.
How about the following:
# to open the file and read it
indata = open(from_file).read()
print(indata)
# this closes the file - just the opposite of opening and reading
open(from_file).close()

Creating parse trees in NLTK using tagged sentence

Using the methods defined in the NLTK book, I want to create a parse tree of a sentence that has already been POS tagged. From what I understand from the chapter linked above, any words you want to be able to recognize need to be in the grammar. This seems ridiculous, seeing as there's a built in POS tagger that would make hand-writing the parts of speech for each word completely redundant. Am I missing some functionality of the parsing methods that allows for this?
With the stanford parser, POS tags are not needed to get a parse for a tree as it is built into the model. The StanfordParser and models are not available out of the box and need to be downloaded.
Most people see this error when trying to use the StanfordParser in NLTK
>>> from nltk.parse import stanford
>>> sp = stanford.StanfordParser()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda3/lib/python3.5/site-packages/nltk/parse/stanford.py", line 51, in __init__
key=lambda model_name: re.match(self._JAR, model_name)
File "/home/user/anaconda3/lib/python3.5/site-packages/nltk/internals.py", line 714, in find_jar_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))
LookupError:
===========================================================================
NLTK was unable to find stanford-parser\.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser\.jar, see:
<http://nlp.stanford.edu/software/lex-parser.shtml>
===========================================================================
To fix this, Download the Stanford Parser to a directory and extract the contents. Let's use the example directory on a *nix system /usr/local/lib/stanfordparser. The file stanford-parser.jar must be located there, along with the other files.
When all the files are there, set the environment variables for the location of the parser and models.
>>> import os
>>> os.environ['STANFORD_PARSER'] = '/usr/local/lib/stanfordparser'
>>> os.environ['STANFORD_MODELS'] = '/usr/local/lib/stanfordparser'
Now you can use the parser to export the possible parses for the sentence you have, for example:
>>> sp = stanford.StanfordParser()
>>> sp.parse("this is a sentence".split())
<list_iterator object at 0x7f53b93a2dd8>
>>> trees = [tree for tree in sp.parse("this is a sentence".split())]
>>> trees[0] # example parsed sentence
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['sentence'])])])])])
An iterator object is returned since there can be more than one parser for a given sentence.
These are two different kinds of technology involved here. The chapter you link to is about hand-written context-free grammars, which typically have a few dozen rules and can handle a tiny subset of English (or any other language you cover). While it is possible to create a large-coverage system on a very large number of such rules (plus other technologies), the CFG implementation in the NLTK is only intended for teaching or demonstration purposes-- put differently, it's a toy. Don't even think about using it for general-purpose parsing.
For parsing real text, there are probabilistic parsers like the Stanford parser (for which the nltk has an interface in nltk.parse.stanford). Such parsers are generally trained on large treebanks, they can handle unknown words, and as you would expect they either take POS-tagged text as input, or do their own POS tagging.
All this said, it's not hard to tweak the NLTK's CFG machinery to handle unknown words, if you have reason to do that: Write grammars over POS tags rather than over words (e.g., you'd write NP => "DT" "NN", so that the POS tags are the terminals); then extract the POS tags from your tagged sentence, build a parse tree over them, and put the words back in. (This won't be enough if your CFG contains rules that mix terminals and non-terminals, like "give" NP "to" NP.)

Resources