Last updated: June 8, 2014
This is a preliminary list of common pull request fixes requested. It’s presumed that your pull request should already pass the Travis buildbot, including docstring and code formatting checks.
Python supports breaking a logical line over multiple file lines in a number of ways. One is to use backslashes before the line ending. Another is to enclose the broken section is parentheses (), ([] also works but you should only use this if you are otherwise creating a list). Note that if you have open parentheses from a function call you do not need additional parentheses.
In Pylearn2 we generally prefer parentheses, because it means there’s less markup to maintain and leads to less spurious errors.
Yes:
assert some_complicated_conditional_thing, (
"This is the assertion error on a separate line."
)
No:
assert some_complicated_conditional_thing, \
"This just gets annoying, especially if there are multiple " \
"lines of text."
Note that string concatenation across lines is automatic, no need for +. If enclosed in parentheses you don’t need a either:
# Valid Python.
print ("The quick brown fox jumps over the lazy dog. And then "
"the fox did it again.")
See the PEP8 indentation recommendations for how to arrange indentation for continuation lines.
Pylearn2 grew rapidly in the beginning, often without proper attention to testing. Modifying a piece of code in the codebase may alter how it works; if you make such a modification, you should not only verify that tests pass but that tests _exist_ for the piece of code you’re modifying. You should verify that those tests exist and update them as needed, including a test case for the behaviour you’re adding or modifying.
Usually tests for a module foo are found in tests/test_foo.py.
Tests that test for previously existing bugs are particularly critical, as further modification of the code may reintroduce the bug by those who are not aware of the subtleties that led to it in the first place.
Your pull request description (or a commit message for one of the commits) should include one of the supported variants of the syntax so that the issue is auto-closed upon merge.
Pull requests with lots and lots of tiny commits are hard to review. Lots of commits that subsequently introduce a minor bug and then fix them can also make bisecting a pain.
Your final pull request should comprise as few commits as logically make sense. Each commit should ideally leave the repository in a working state (tests passing, functionality preserved).
You can squash commits using git rebase -i and following the instructions. Note that you will have to git push –force origin my_branch_name after a rebase.
You should squash to a minimal set of semantically distinct commits before asking for a review, and then possibly squash again if you’ve made lots of commits in response to feedback (note that you can reorder the commits in the editor window given by git rebase -i).
The order of iteration over dictionaries in Python is not guaranteed to remain the same across different invocations of the same program. This is a result of a randomized hashing algorithm and is actually an important security feature for preventing certain kinds of Denial-of-Service attacks. Unfortunately, where such data structures are employed in scientific simulations, this can pose reproducibility problems.
The main reason for this is that computations in floating point do not precisely obey the typical laws of arithmetic (commutativity, associativity, distributivity), and slight differences in the order of operations can introduce small differences in result, which can have butterfly effects that significantly alter the results of a long-running job. The order of operations can be altered by the order in which a Theano graph is assembled, and the precise form it takes can unfortunately sometimes alter which compile-time graph optimizations are performed.
In order to stamp out inconsistencies introduced by an unpredictable iteration order, we make extensive use of the OrderedDict class. This class is part of the collections module in Python 2.7 and Python 3.x, however, in order to maintain Python 2.6 compatibility, we import it from theano.compat.python2x, which provides an equivalent pure-Python implementation if the built-in version is not available.
You should consider carefully whether the iteration order over a dictionary you’re using could result in different behaviour. If in doubt, use an OrderedDict. For the updates parameter when creating Theano functions, you _must_ use an OrderedDict, or a list of (shared_variable, update_expression) tuples.
When iterating over sets, consider whether you should first sort your set. The sorted() built-in function is a simple way of doing this.
In most cases you should be using logging statements instead. You can initialize a logger in a new module with:
import logging
log = logging.getLogger(__name__)
And subsequently call into it with log.info(), log.warning(), etc.
If so, consider using the faster and more memory efficient versions.
Note that zip and izip truncate the sequence of tuples they produce to the length of the shortest input sequence. If you expect, as is often the case, that the sequences you are zipping together should be the same length, use safe_zip or safe_izip defined in pylearn2.utils.
Also see itertools.izip_longest if you want to zip together sequences of unequal length with a fill value.
For values() and items() consider whether itervalues() or iteritems() would be more appropriate, if you’re only iterating over them once, not keeping the result around for any length of time, and don’t need random access.
Also, don’t bother with keys() or iterkeys() at all if you’re just going to iterate over it. for k in my_dictionary iterates over keys by default.
An exception to these rules is if you are _modifying_ the dictionary within the loop. Then you probably want to duplicate things with the keys(), values() and items() calls.
If you are using the update() method of a dictionary or OrderedDict and you expect that none of the keys in the argument should already be in the dictionary, use safe_update() defined in pylearn2.utils.
You should almost never have a bare except: in library code. Use:
except Exception:
...
instead. This catches any subclass of Exception but lets through certain low-level exceptions like KeyboardInterrupt, SystemExit, etc. that inherit from BaseException instead. You almost certainly do not want your code to catch these.
Don’t raise a new exception, use the reraise_as method from pylearn2.utils.exc instead.:
except Exception:
reraise_as(ValueError("Informative error message here"))
This retains the traceback and original error message, allowing for easier debugging using a tool like pdb.
In places where a list, tuple, or other iterable object (say, a deque) will suffice, use pylearn2.utils.is_iterable.
Unless you have a very good reason you should probably be using isinstance(foo, basestring) which correctly handles both str and unicode instances.
Usually such checks are unnecessary but where they might be, we’ve defined some helpful constants.
Use isinstance(foo, pylearn2.utils.py_number_types). This checks against Python builtins as well as NumPy-defined numerical types.
Use isinstance(foo, pylearn2.utils.py_integer_types). This checks against Python builtins as well as NumPy-defined integer types.
First, ask yourself: do you really need to? Would passing an integer here be inappropriate in all circumstances? Would a cast (i.e. float() be sufficient?
If you really need to, use isinstance(foo, pylearn2.utils.py_float_types). This checks against Python builtins as well as NumPy-defined float types.
Again, ask yourself whether passing a real here would be an error, and whether you can get away with a cast.
If you really need to, use isinstance(foo, pylearn2.utils.py_complex_types). This checks against Python builtins as well as NumPy-defined complex types.
If so, use pylearn2.utils.contains_nan or pylearn2.utils.contains_inf. To check for either np.nan or np.inf, use pylearn2.utils.isfinite. These functions are faster and more memory effecient than np.any(np.isnan(X)) or np.any(np.isinf(X)).
If you’re building Theano functions, use pylearn2.utils.function. This disables the on_unused_input check, which in most cases you don’t want to consider an error if you’re doing any kind of generic graph building.
Use pylearn2.utils.as_floatX to cast symbolic quantities to the default floating point type, and use constantX to create symbolic constants from a scalar or ndarray with the dtype specified in theano.config.floatX.
Example:
stuff = []
for i in range(50):
for j in range(20):
for k in range(30):
stuff.append((i, j, k))
Consider whether itertools.product will get the job done more readably and probably more efficiently.
itertools contains the functions permutations, combinations and combinations_with_replacement that will probably get the job done more efficiently than your own code.
Use the decorator pylearn2.utils.wraps to inherit the docstring if it is unchanged. If you add a docstring to a function that is wrapped in this fashion, it will be appended below the inherited docstring.
If you are using the NumPy generator, are you providing a way to seed it as well as a default seed ? You should never be using numpy.random functions directly. Use pylearn2.utils.rng.make_np_rng with a user-provided seed and a default_seed argument.
If you are using the Theano RNG you should create it similarly with pylearn2.utils.rng.make_theano_rng.
Use os.path.join rather than concatenating together with ‘/’. This ensures the code still works on Windows.
Use os.path.basename and os.path.dirname to ensure Windows compatibility.
Use the with statement, i.e.:
with open(fname, 'w') as f:
f.write('blah blah blah')
This is cleaner and ensures that the file always gets closed, even in an error condition.