.. _pull_request_checklist: ############################### Pylearn2 Pull Request Checklist ############################### *Last updated: June 8, 2014* This is a preliminary list of common pull request fixes requested. It's presumed that your pull request should already pass the Travis buildbot, including docstring and code formatting checks. .. contents:: :local: Are you breaking a statement over multiple lines? ================================================= Python supports breaking a logical line over multiple file lines in a number of ways. One is to use backslashes before the line ending. Another is to enclose the broken section is parentheses (), ([] also works but you should only use this if you are otherwise creating a list). Note that if you have open parentheses from a function call you do not need additional parentheses. In Pylearn2 we generally prefer parentheses, because it means there's less markup to maintain and leads to less spurious errors. Yes:: assert some_complicated_conditional_thing, ( "This is the assertion error on a separate line." ) No:: assert some_complicated_conditional_thing, \ "This just gets annoying, especially if there are multiple " \ "lines of text." Note that string concatenation across lines is automatic, no need for `+`. If enclosed in parentheses you don't need a \ either:: # Valid Python. print ("The quick brown fox jumps over the lazy dog. And then " "the fox did it again.") See the `PEP8 indentation recommendations`_ for how to arrange indentation for continuation lines. .. _PEP8 indentation recommendations: http://www.python.org/dev/peps/pep-0008/#indentation Do tests exist for the code you're modifying? ============================================= Pylearn2 grew rapidly in the beginning, often without proper attention to testing. Modifying a piece of code in the codebase may alter how it works; if you make such a modification, you should not only verify that tests pass but that tests _exist_ for the piece of code you're modifying. You should verify that those tests exist and update them as needed, including a test case for the behaviour you're adding or modifying. Usually tests for a module `foo` are found in `tests/test_foo.py`. Are you fixing a bug? Did you add a regression test? ==================================================== Tests that test for previously existing bugs are particularly critical, as further modification of the code may reintroduce the bug by those who are not aware of the subtleties that led to it in the first place. Are you fixing an issue that is on the issue tracker? ===================================================== Your pull request *description* (or a commit message for one of the commits) should include `one of the supported variants`_ of the syntax so that the issue is auto-closed upon merge. .. _one of the supported variants: https://help.github.com/articles/closing-issues-via-commit-messages Have you squashed out any nuisance commits? =========================================== Pull requests with lots and lots of tiny commits are hard to review. Lots of commits that subsequently introduce a minor bug and then fix them can also make bisecting a pain. Your final pull request should comprise as few commits as logically make sense. Each commit should ideally leave the repository in a working state (tests passing, functionality preserved). You can squash commits using `git rebase -i` and following the instructions. Note that you will have to `git push --force origin my_branch_name` after a rebase. You should squash to a minimal set of semantically distinct commits before asking for a review, and then possibly squash again if you've made lots of commits in response to feedback (note that you can reorder the commits in the editor window given by `git rebase -i`). Are you using OrderedDict where necessary? Are you iterating over sets? ======================================================================= The order of iteration over dictionaries in Python is not guaranteed to remain the same across different invocations of the same program. This is a result of a randomized hashing algorithm and is actually an important security feature for preventing certain kinds of Denial-of-Service attacks. Unfortunately, where such data structures are employed in scientific simulations, this can pose reproducibility problems. The main reason for this is that computations in `floating point`_ do not precisely obey the typical laws of arithmetic (commutativity, associativity, distributivity), and slight differences in the order of operations can introduce small differences in result, which can have `butterfly effects`_ that significantly alter the results of a long-running job. The order of operations can be altered by the order in which a Theano graph is assembled, and the precise form it takes can unfortunately sometimes alter which compile-time graph optimizations are performed. .. _floating point: http://en.wikipedia.org/wiki/Floating_point .. _butterfly effects: http://en.wikipedia.org/wiki/Butterfly_effect In order to stamp out inconsistencies introduced by an unpredictable iteration order, we make extensive use of the `OrderedDict` class. This class is part of the `collections` module in Python 2.7 and Python 3.x, however, in order to maintain Python 2.6 compatibility, we import it from `theano.compat.python2x`, which provides an equivalent pure-Python implementation if the built-in version is not available. You should consider carefully whether the iteration order over a dictionary you're using could result in different behaviour. If in doubt, use an `OrderedDict`. For the `updates` parameter when creating Theano functions, you _must_ use an `OrderedDict`, or a list of `(shared_variable, update_expression)` tuples. When iterating over sets, consider whether you should first sort your set. The `sorted()` built-in function is a simple way of doing this. Are you using print statements? =============================== In most cases you should be using logging statements instead. You can initialize a logger in a new module with:: import logging log = logging.getLogger(__name__) And subsequently call into it with `log.info()`, `log.warning()`, etc. Are you creating a sequence and then immediately iterating over it? =================================================================== If so, consider using the faster and more memory efficient versions. * `xrange` instead of `range`. * `from theano.compat.six.moves import zip as izip` instead of `zip`. This import is for Python 3 compatibility. Are you using `zip()`/`izip()` on sequences you expect to be the same length? ============================================================================= **Note that zip and izip truncate the sequence of tuples they produce to the length of the shortest input sequence.** If you expect, as is often the case, that the sequences you are zipping together should be the same length, use `safe_zip` or `safe_izip` defined in `pylearn2.utils`. Also see `itertools.izip_longest` if you want to zip together sequences of unequal length with a fill value. Are you using the dict/`OrderedDict` methods `keys()`/`values()`/`items()`? =========================================================================== For `values()` and `items()` consider whether `itervalues()` or `iteritems()` would be more appropriate, if you're only iterating over them once, not keeping the result around for any length of time, and don't need random access. Also, don't bother with `keys()` or `iterkeys()` at all if you're just going to iterate over it. `for k in my_dictionary` iterates over keys by default. An exception to these rules is if you are _modifying_ the dictionary within the loop. Then you probably want to duplicate things with the `keys()`, `values()` and `items()` calls. Are you updating a dictionary or OrderedDict with `.update()`? ============================================================== If you are using the `update()` method of a dictionary or `OrderedDict` and you expect that none of the keys in the argument should already be in the dictionary, use `safe_update()` defined in `pylearn2.utils`. Do you have an `except:` block? =============================== You should almost never have a bare `except:` in library code. Use:: except Exception: ... instead. This catches any subclass of `Exception` but lets through certain low-level exceptions like `KeyboardInterrupt`, `SystemExit`, etc. that inherit from `BaseException` instead. You almost certainly do not want your code to catch these. Don't raise a new exception, use the `reraise_as` method from `pylearn2.utils.exc` instead.:: except Exception: reraise_as(ValueError("Informative error message here")) This retains the traceback and original error message, allowing for easier debugging using a tool like pdb. Are you checking to see if an argument is iterable? =================================================== In places where a list, tuple, or other iterable object (say, a deque) will suffice, use `pylearn2.utils.is_iterable`. Are you checking if something is a string? ========================================== Unless you have a very good reason you should probably be using `isinstance(foo, basestring)` which correctly handles both `str` and `unicode` instances. Are you checking if something is a number? ========================================== Usually such checks are unnecessary but where they might be, we've defined some helpful constants. Are you checking if something is _any_ kind of number? ------------------------------------------------------ Use `isinstance(foo, pylearn2.utils.py_number_types)`. This checks against Python builtins as well as NumPy-defined numerical types. Are you checking if something is an integer? -------------------------------------------- Use `isinstance(foo, pylearn2.utils.py_integer_types)`. This checks against Python builtins as well as NumPy-defined integer types. Are you checking if something is a float? ----------------------------------------- First, ask yourself: do you really need to? Would passing an integer here be inappropriate in all circumstances? Would a cast (i.e. `float()` be sufficient? If you really need to, use `isinstance(foo, pylearn2.utils.py_float_types)`. This checks against Python builtins as well as NumPy-defined float types. Are you checking if something is a complex number? -------------------------------------------------- Again, ask yourself whether passing a real here would be an error, and whether you can get away with a cast. If you really need to, use `isinstance(foo, pylearn2.utils.py_complex_types)`. This checks against Python builtins as well as NumPy-defined complex types. Are you checking for the presence of `np.nan` or `np.inf` in an array? ---------------------------------------------------------------------- If so, use `pylearn2.utils.contains_nan` or `pylearn2.utils.contains_inf`. To check for either `np.nan` or `np.inf`, use `pylearn2.utils.isfinite`. These functions are faster and more memory effecient than `np.any(np.isnan(X))` or `np.any(np.isinf(X))`. Are you creating Theano functions? ================================== If you're building Theano functions, use `pylearn2.utils.function`. This disables the `on_unused_input` check, which in most cases you don't want to consider an error if you're doing any kind of generic graph building. Are you creating Theano shared variables? ========================================= If your model, cost, etc. creates floating point shared variables you should probably create them with `pylearn2.utils.sharedX`, which creates shared variables with the dtype set to `theano.config.floatX`. Are you casting symbols/constants to a Theano floating point type? ================================================================== Use `pylearn2.utils.as_floatX` to cast symbolic quantities to the default floating point type, and use `constantX` to create symbolic constants from a scalar or ndarray with the dtype specified in `theano.config.floatX`. Do you have big nested loops for generating a Cartesian product? ================================================================ Example:: stuff = [] for i in range(50): for j in range(20): for k in range(30): stuff.append((i, j, k)) Consider whether `itertools.product` will get the job done more readably and probably more efficiently. Are you generating combinations or permutations of a set (or list, ...)? ======================================================================== `itertools` contains the functions `permutations`, `combinations` and `combinations_with_replacement` that will probably get the job done more efficiently than your own code. Are you overriding methods in your class? ========================================= Use the decorator `pylearn2.utils.wraps` to inherit the docstring if it is unchanged. If you add a docstring to a function that is wrapped in this fashion, it will be appended below the inherited docstring. Are you writing functions that uses pseudo-random numbers? ========================================================== If you are using the NumPy generator, are you providing a way to seed it as well as a default seed ? You should **never** be using `numpy.random` functions directly. Use `pylearn2.utils.rng.make_np_rng` with a user-provided seed and a `default_seed` argument. If you are using the Theano RNG you should create it similarly with `pylearn2.utils.rng.make_theano_rng`. Are you assembling filesystem paths with `dir` + `/` + `filename` or similar? ============================================================================= Use `os.path.join` rather than concatenating together with '/'. This ensures the code still works on Windows. Are you extracting the directory name or base filename from a file path? ======================================================================== Use `os.path.basename` and `os.path.dirname` to ensure Windows compatibility. Are you opening/closing files? ============================== Use the `with` statement, i.e.:: with open(fname, 'w') as f: f.write('blah blah blah') This is cleaner and ensures that the file always gets closed, even in an error condition.