Table Of Contents

Previous topic

Overview

Next topic

Your models in Pylearn2

This Page

Working with large datasets in Pylearn2

By default Pylearn2 loads all the dataset to the main memory (not GPU). This could be problematic for large datasets. There exists multiple Python/Numpy solutions for dealing with large data:

Pylearn2 currently only supports Pytables and h5py. (memmap support has been introduced in the latest version of Theano. But it has not been tested with Pylearn2 yet.)

PyTables

pylearn2.datasets.dense_design_matrix.DenseDesignMatrixPyTables is designed to mimic the behaviour of DenseDesignMatrix but underneath it stores the data in PyTables hdf5 file format. pylearn2.datasets.svhn.SVHN is a good example of how to make a DenseDesignMatrixPyTables object and store your data in it.

h5py

If you have your data already saved in hdf5 format, you can use pylearn2.datasets.hdf5.HDF5Dataset class to access your data in Pylearn2. For an example of how to save data in hdf5 format and load it with HDF5Dataset, take a look at pylearn2.datasets.tests.test_hdf5.TestHDF5Dataset.

PyTables VS h5py

Each library has its own comparison:

One advantage of h5py over PyTables is that, one can use hdf5 files made with other libraries. Whereas PyTables hdf5 files are not standard. But PyTables claims to be more fast and supports compression.

Known issues

  • Both hdf5 based solutions are know to crash when the data is accessed in a random order. To avoid this issue, we suggest to use one of the ‘sequential’ or ‘batchwise_shuffled_sequential’ iterator schemes.
  • Writing large amount of data to hdf5 at once is been know to result in crash. So it’s advised to use mini-batches to write the data to files. Some of the prepossessing functions has mini-batch options, but not all of them.
  • Users should be careful that any change to data, will result to change to data on disk.