Minibatching
============

This isn't a topic that deserves to have its own page, but here it is.

"Minibatch-first" convention
----------------------------

Penne assumes that:

- A minibatch of n-dimensional arrays is an (n+1)-dimensional array,
  where the first axis ranges over the instances in the minibatch.
- A minibatch of ints (representing one-hot vectors) is a list of ints.

The way that NumPy's broadcasting rules work (which Penne follows)
means that many operations will automatically work on minibatches. For
example, elementwise addition works correctly if either or both
arguments are minibatches.

But of course there are lots of exceptions.

- ``dot(x,y)`` sums over the last axis of ``x`` and the second-to-last
  axis of ``y``, which will behave differently if ``y`` is a vector or
  a minibatch of vectors.

  - If ``x`` is a vector or minibatch of vectors, use ``vecdot`` instead.
  - If ``x`` is a matrix/tensor, the solution that the ``Layer`` class
    uses is to write ``dot(y, x)`` instead.

- ``concatenate`` and ``stack`` default to ``axis=0``. Use negative
  axis numbers to get code that works with or without minibatches.

The functions and modules that Penne provides are (as far as I know)
safe to use on minibatches.

The ``penne.lm`` module provides a simple utility function for
grouping a sequence of training examples into a sequence of
minibatches (lists) of training examples:

.. autofunction:: penne.lm.batches

Sequences
---------

With sequence models, the sentences in a minibatch are not all the
same length. The simplest solution is just to pad all the sentences
with a dummy symbol so that they are all the same length. The
``penne.lm`` module provides some utility functions for making this
easier.

.. autofunction:: penne.lm.pack_batch
.. autofunction:: penne.lm.unpack_batch

Training
--------

You may need to scale some training parameters by the minibatch size:

- ``learning_rate`` should be divided by minibatch size
- ``clip_gradients`` should be multiplied by minibatch size