Articles in the python category

Investigating the Python bound method

You work with Python for a while and you'll become familiar with printing a method and getting

<bound method Foo.function of <__main__.Foo instance at 0xb736960c>>

I think there is room for one more explanation on the internet, since I've never seen it diagrammed out (maybe for good reason!).

An illustration of Python bound methods

In the above diagram on the left, we have the fairly simple conceptual model of a class with a function. One naturally tends to think of the function as a part of the class and your instance calls into that function. This is conceptually correct, but a little abstracted from what's actually happening.

The right attempts to illustrate the underlying process in some more depth. The first step, on the top right, is building something like the following:

class Foo():
   def function(self):
       print "hi!"

As this illustrates, the above code results in two things happening; firstly a function object for function is created and secondly the __dict__ attribute of the class is given a key function that points to this function object.

Now the thing about this function object is that it implements the descriptor protocol. In short, if an object implements a __get__ function; then when that object is accessed as an attribute of an object the __get__ function is called. You can read up on the descriptor protocol, but the important part to remember is that it passes in the context from which it is called; that is the object that is calling the function.

So, for example, when we then do the following:

f = Foo()
f.function()

what happens is that we get the attribute function of f and then call it. f above doesn't actually know anything about function as such — what it does know is its class inheritance and so Python goes searching the parent's class __dict__ to try and find the function attribute. It finds this, and as per the descriptor protocol when the attribute is accessed it calls the __get__ function of the underlying function object.

What happens now is that the function's __get__ method returns essentially a wrapper object that stores the information to bind the function to the object. This wrapper object is of type types.MethodType and you can see it stores some important attributes in the object — im_func which is the function to call, and im_self which is the object who called it. Passing the object through to im_self is how function gets it's first self argument (the calling object).

So when you print the value of f.function() you see it report itself as a bound method. So hopefully this illustrates that a bound method is a just a special object that knows how to call an underlying function with context about the object that's calling it.

To try and make this a little more concrete, consider the following program:

import types

class Foo():

    def function(self):
        print "hi!"

f = Foo()

# this is a function object
print Foo.__dict__['function']

# this is a method as returned by
#   Foo.__dict__['function'].__get__()
print f.function

# we can check that this is an instance of MethodType
print type(f.function) == types.MethodType

# the im_func field of the MethodType is the underlying function
print f.function.im_func
print Foo.__dict__['function']

# these are the same object
print f.function.im_self
print f

Running this gives output something like

$ python ./foo.py
<function function at 0xb73540d4>
<bound method Foo.function of <__main__.Foo instance at 0xb736960c>>
True
<function function at 0xb73540d4>
<function function at 0xb73540d4>
<__main__.Foo instance at 0xb72c060c>
<__main__.Foo instance at 0xb72c060c>

To pull it apart; we can see that Foo.__dict__['function'] is a function object, but then f.function is a bound method. The bound method's im_func is the underlying function object, and the im_self is the object f: thus im_func(im_self) is calling function with the correct object as the first argument self.

So the main point is to kind of shift thinking about a function as some particular intrinsic part of a class, but rather as a separate object abstracted from the class that gets bound into an instance as required. The class is in some ways a template and namespacing tool to allow you to find the right function objects; it doesn't actually implement the functions as such.

There is plenty more information if you search for "descriptor protocol" and Python binding rules and lots of advanced tricks you can play. But hopefully this is a useful introduction to get an initial handle on what's going on!

Python and --prefix

Something interesting I discovered about Python and --prefix that I can't see a lot of documentation on...

When you build Python you can use the standard --prefix flag to configure to home the installation as you require. You might expect that this would hard-code the location to look for the support libraries to the value you gave; however in reality it doesn't quite work like that.

Python will only look in the directory specified by prefix after it first searches relative to the path of the executing binary. Specifically, it looks at argv[0] and works through a few steps — is argv[0] a symlink? then dereference it. Does argv[0] have any slashes in it? if not, then search the $PATH for the binary. After this, it starts searching for dirname(argv[0])/lib/pythonX.Y/os.py, then dirname(argv[0])/../lib and so on, until it reaches the root. Only after these searches fail does the interpreter then fall back to the hard-coded path specified in the --prefix when configured.

What is the practical implications? It means you can move around a python installation tree and have it all "just work", which is nice. In my situation, I noticed this because we have a completely self-encapsulated build toolchain, but we wish to ship the same interpreter on the thing that we're building (during the build, we run the interpreter to create .pyc files for distribution, and we need to be sure that when we did this we didn't accidentally pick up any of the build hosts python; only the toolchain python).

The PYTHONHOME environment variable does override this behaviour; if it is set then the search stops there. Another interesting thing is that sys.prefix is therefore not the value passed in by --prefix during configure, but the value of the current dynamically determined prefix value.

If you run an strace, you can see this in operation.

readlink("/usr/bin/python", "python2.7", 4096) = 9
readlink("/usr/bin/python2.7", 0xbf8b014c, 4096) = -1 EINVAL (Invalid argument)
stat64("/usr/bin/Modules/Setup", 0xbf8af0a0) = -1 ENOENT (No such file or directory)
stat64("/usr/bin/lib/python2.7/os.py", 0xbf8af090) = -1 ENOENT (No such file or directory)
stat64("/usr/bin/lib/python2.7/os.pyc", 0xbf8af090) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/python2.7/os.py", {st_mode=S_IFREG|0644, st_size=26300, ...}) = 0
stat64("/usr/bin/Modules/Setup", 0xbf8af0a0) = -1 ENOENT (No such file or directory)
stat64("/usr/bin/lib/python2.7/lib-dynload", 0xbf8af0a0) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/python2.7/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

Firstly it dereferences symlinks. Then it looks for Modules/Setup to see if it is running out of the build tree. Then it starts looking for os.py, walking its way upwards. One interesting thing that may either be a bug or a feature, I haven't decided, is that if you set the prefix to / then the interpreter will not go back to the root and then look in /lib. This is probably pretty obscure usage though!

All this is implemented in Modules/getpath.c which has a nice big comment at the top explaining the rules in detail.

pylint and hiding of attributes

I recently came across the pylint error:

E:  3,4:Foo.foo: An attribute affected in foo line 12 hide this method

in code that boiled down to essentially:

class Foo:

    def foo(self):
        return True

    def foo_override(self):
        return False

    def __init__(self, override=False):
        if override:
            self.foo = self.foo_override

Unfortunately that message isn't particularly helpful in figuring out what's going on. I still can't claim to be 100% sure what the message is intended to convey, but I can construct something that maybe it's talking about.

Consider the following using the above class

foo = Foo()
moo = Foo(override=True)

print "expect True  : %s" % foo.foo()
print "expect False : %s" % moo.foo()
print "expect True  : %s" % Foo.foo(foo)
print "expect False : %s" % Foo.foo(moo)

which gives output of:

$ python ./foo.py
expect True  : True
expect False : False
expect True  : True
expect False : True

Now, if you read just about any Python tutorial, it will say something along the lines of:

... the special thing about methods is that the object is passed as the first argument of the function. In our example, the call x.f() is exactly equivalent to MyClass.f(x). In general, calling a method with a list of n arguments is equivalent to calling the corresponding function with an argument list that is created by inserting the method’s object before the first argument. [Official Python Tutorial]

The official tutorial above is careful to say in general; others often don't.

The important point to remember is how python internally resolves attribute references as described by the data model. The moo.foo() call is really moo.__dict__["foo"](moo); examining the __dict__ for the moo object we can see that foo has been re-assigned:

>>> print moo.__dict__
{'foo': <bound method Foo.foo_override of <__main__.Foo instance at 0xb72838ac>>}

Our Foo.foo(moo) call is really Foo.__dict__["foo"](moo) -- the fact that we reassigned foo in moo is never noticed. If we were to do something like Foo.foo = Foo.foo_override we would modify the class __dict__, but that doesn't give us the original semantics.

So I postulate that the main point of this warning is to suggest to you that you're creating an instance that now behaves differently to its class. Because the symmetry of calling an instance and calling a class is well understood you might end up getting some strange behaviour, especially if you start with heavy-duty introspection of classes.

Thinking about various hacks and ways to re-write this construct is kind of interesting. I think I might have found a hook for a decent interview question :)

Handling hostnames, UDP and IPv6 in Python

So, you have some application where you want the user to specify a remote host/port, and you want to support IPv4 and IPv6.

For literal addresses, things are fairly simple. IPv4 addresses are simple, and RFC2732 has things covered by putting the IPv6 address within square brackets.

It gets more interesting as to what you should do with hostnames. The problem is that getaddrinfo can return you multiple addresses, but without extra disambiguation from the user it is very difficult to know which one to choose. RFC4472 discusses this, but there does not appear to be any good solution.

Possibly you can do something like ping/ping6 and have a separate program name or configuration flag to choose IPv6. This comes at a cost of transparency.

The glibc implementation of getaddrinfo() puts considerable effort into deciding if you have an IPv6 interface up and running before it will return an IPv6 address. It will even recognise link-local addresses and sort addresses more likely to work to the front of the returned list as described here. However, there is still a small possibility that the IPv6 interface doesn't actually work, and so the library will sort the IPv6 address as first in the returned list when maybe it shouldn't be.

If you are using TCP, you can connect to each address in turn to find one that works. With UDP, however, the connect essentially does nothing.

So I believe probably the best way to handle hostnames for UDP connections, at least on Linux/glibc, is to trust getaddrinfo to return the sanest values first, try a connect on the socket anyway just for extra security and then essentially hope it works. Below is some example code to do that (literal address splitter bit stolen from Python's httplib).

import socket

DEFAULT_PORT = 123

host = '[fe80::21c:a0ff:fb27:7196]:567'

# the port will be anything after the last :
p = host.rfind(":")

# ipv6 literals should have a closing brace
b = host.rfind("]")

# if the last : is outside the [addr] part (or if we don't have []'s
if (p > b):
    try:
        port = int(host[p+1:])
    except ValueError:
        print "Non-numeric port"
        raise
    host = host[:p]
else:
    port = DEFAULT_PORT

# now strip off ipv6 []'s if there are any
if host and host[0] == '[' and host[-1] == ']':
    host = host[1:-1]

print "host = <%s>, port = <%d>" % (host, port)

the_socket = None

res = socket.getaddrinfo(host, port, socket.AF_UNSPEC, socket.SOCK_DGRAM)

# go through all the returned values, and choose the ipv6 one if
# we see it.
for r in res:
    af,socktype,proto,cname,sa = r

    try:
        the_socket = socket.socket(af, socktype, proto)
        the_socket.connect(sa)
    except socket.error, e:
        # connect failed!  try the next one
        continue

    break

if the_socket == None:
    raise socket.error, "Could not get address!"

# ready to send!
the_socket.send("hi!")

numbers2ppm

I'm not sure how to describe this best, but numbers2ppm.py is a Python script to turn a list of numbers into a (plain format) PPM image filled with coloured boxes. Perhaps an example is best.

$ cat test.in
01234567899876543210

$ ./numbers2ppm.py -W 40 -H 40 -c 10 ./test.in > test.ppm

$ convert test.ppm test.png

You should end up with

numbers2ppm script output

If you're me, you could use something like this to read a dump of reference counts of physical frames of memory dumped from the kernel, creating a nice graphical view of memory usage and sharing. I imagine it may come in handy for other things too.

Printing files side-by-side

I'm really not sure if there is an eaiser way to do this, but here is my newly most-used utility. It puts two files beside each other; but as opposed to sdiff/diff -y doesn't analyse it, and as opposed to paste keeps to fixed widths. Here is a screen shot.

$  python ./side-by-side.py --width 40 /tmp/afile.txt /tmp/afile2.txt
this is a file                           |  i have here another file
that has lines of                        |  that also has some text and
text.  to read                           |  some lines.  although it
                                         |  is slightly longer than the other
this is a really really really really re *  file with all these words
                                         |  in
                                         |  it

I'd love to hear from all the Python freaks how you could get the LOC even lower; every time I do something like this I find out about a new, quicker, way to recurse a list :)

#!/usr/bin/python

import sys, os
from optparse import OptionParser

class InFile:
    def __init__(self, filename):
        try:
            self.lines=[]
            self.maxlen = 0
            for l in open(filename).readlines():
                self.lines.append(l.rstrip())
        except IOError, (error, message):
            print "Can't read input %s : %s" % (filename, message)
            sys.exit(1)

        self.nlines = len(self.lines)
        if self.nlines == 0:
            self.lines.append("")
        self.maxlen = max(map(len, self.lines))

    # pad to the max len, with a extra space then the deliminator
    def pad_lines(self, nlines, width=0, nodiv=False, notrunc=False):
        if width == 0:
            width = self.maxlen
        pad = []
        for i in range(0, width):
            pad += " "
        # add on some extra for the divider and spaces
        pad += "   "
        padlen = len(pad)
        for i in range(0, nlines):
            try:
                linelen = len(self.lines[i])
            except IndexError:
                self.lines.append("")
                linelen = 0
            if (linelen > width):
                linelen = width
                if not notrunc:
                    pad[-2] = "*"
            elif nodiv:
                pad[-2] = " "
            else:
                pad[-2] = "|"
            self.lines[i] = self.lines[i][:linelen] +  "".join(pad[linelen - padlen:])

usage= "side-by-side [-w width] file1 file2 ... filen"

parser = OptionParser(usage, version=".1")

parser.add_option("-w", "--width", dest="width", action="store", type="int",
                      help="Set fixed width for each file", default=0)
parser.add_option("--last-div", dest="lastdiv", action="store_true",
                  help="Print divider after last file", default=False)
parser.add_option("--no-div", dest="nodiv", action="store_true",
                  help="Don't print any divider characters", default=False)
parser.add_option("--no-trunc", dest="notrunc", action="store_true",
                  help="Don't show truncation with a '*'", default=False)

(options, args) = parser.parse_args()

flist = []
if (len(args) == 0):
    print usage
    sys.exit(1)
for f in args:
    flist.append(InFile(f))

max_lines = max(map(lambda f: f.nlines, flist))

for i in range(0,len(flist)):
    if (len(flist)-1 == i):
        options.nodiv = not options.lastdiv
    flist[i].pad_lines(max_lines, options.width, options.nodiv, options.notrunc)

for l in range(0, max_lines):
    for f in flist:
        print f.lines[l],
    print

update: Leon suggests

pr -Tm file1 file2

Which is pretty close, but doesn't seem to put any divider between the files. Still might be a handy tool for your toolbox. It seems, from the pr man page, the word for doing this sort of thing is columnate.

update 2: Told you I'd learn new ways to iterate! Stephen Thorne came up with a neat solution. Some extra tricks he used

  • ' ' * n gives you n spaces. Pretty obvious when you think of it!
  • Usage of generators with the yield statement.
  • The itertools package with it's modified zip like izip function.

All very handy tips for your Python toolbox. If you're learning Python I'd reccommend solving this problem as you can really put to use some of the niceties of the language.

Convert an IP address to hexadecimal

Here is a python script to convert IP addresses into hexadecimal, which may be required to name files for your bootloader if you are trying to netboot, for example. You can specify a mask if you have a large group of machines on a network (e.g. 10.1.3.2 with a mask of 24 will just give you 0x0A == 10d, while a mask of 16 gives you 0xOA01).

import re
import sys
import socket

if (not len(sys.argv) == 2):
    print "Usage: ip2hex.py hostname|ip address/mask"
    sys.exit(1)

try:
    (in_str, mask) = sys.argv[1].split("/")
    # sanity check mask
    mask = int(mask)
    if (mask > 32 or mask < 0):
        print "Mask out of range"
        sys.exit(1)
except ValueError:
    mask = 0
    in_str = sys.argv[1]

try:
    ip_addr = socket.gethostbyname(in_str)
except:
    print "Invalid address!"
    sys.exit(1)

#gethostbyname really checks this for us, but you never know
ip_regex = re.compile('(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.' \
                      '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)')
ip_match = ip_regex.match(ip_addr)

if (ip_match == None):
    print "Invalid address"
    sys.exit(1)

hex_ip_addr = 0
for i in range(1,5):
    hex_ip_addr += int(ip_match.group(i)) << (4-i)*8

fmt = "%%0%dX" % ((32 - mask) / 4)
print fmt % (hex_ip_addr >> mask)

on asterisks in Python

Asterisks have lots of meaning in Python. Firstly, consider in a function definition

>>> def function(arg, *vargs, **kargs):
    print arg
    print vargs
    print kargs
>>> function(1, 2,3,4,5, test1="abc", test2="def")
1
(2, 3, 4, 5)
{'test1': 'abc', 'test2': 'def'}
  • *vargs puts all left-over non-keyword arguments into a tuple called vargs.
  • **kargs puts all left-over keyword arguments into a dictionary called kargs.

On the other hand, you can use the asterisk with a tuple when calling a function to expand out the elements of the tuple into positional arguments

>>> def function(arg1, arg2, arg3):
    print arg1, arg2, arg3
>>> args = (1,2,3)
>>> function(*args)
1 2 3

You can do a similar thing with keyword arguments and a dictionary with the double asterisk operator

>>> def function(arg1=None, arg2=None):
     print arg1, arg2
>>> dict = {"arg1":"1", "arg2":"2"}
>>> function(**dict)
1 2