Mill Computing, Inc. Forums The Mill Markets Stochastic rounding for machine learning

  • Author
    Posts
  • jrfeenst
    Participant
    Post count: 4
    #1858 |

    Big efficiency gains could be an advantage for the mill if a stochastic rounding mode were available for fixed point data:

    http://arxiv.org/pdf/1502.02551.pdf

    With neural networks in 10 billion parameter range there would be a big advantage to reducing the precision of the data. Current off the shelf devices fail to train with very low precision because deterministic rounding kills the gradient calculations. I don’t think it would need to be crypto grade random, so maybe it’s not too expensive in hardware?

  • Ivan Godard
    Keymaster
    Post count: 689

    We once had a stochastic rounding mode defined; the idea was to explore the rounding sensitivity of an algorithm. The other members of the IEEE 754 (FP) committee dumped all over the idea, for reasons too technical for here even if I remember them correctly (I am not a math analysis guy),

    However, your note is the first I have heard of stochastic being of use in neural networks. Could you post a few links for (elementary, tutorial) material on the subject?

  • jrfeenst
    Participant
    Post count: 4

    I’m just learning about neural networks myself (stanford coursera class). I immediately thought about using lower precision fixed point instead of the standard 32 bit floating point that it seems like everybody uses. I ran across that paper and looked around to see if there was any support for it in hardware and found none.

    I think the lack of information is due to the lack of available hardware :). The paper I linked to had to use FPGA to test out the idea.

    One more paper I found describes an 8 bit FP representation with 16 bit accumulation. It doesn’t seem to discuss training much, just feed forward (much less dependence on precision):

    http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37631.pdf

    It is the back propagation phase which computes gradients which requires the precision.

    This paper is referenced from the original one:

    http://arxiv.org/pdf/1412.7024v4.pdf

    It is mostly about comparing different representations and a sort of dynamic fixed point where the exponent changes as the gradients decrease over the training iterations.

  • gideony2
    Participant
    Post count: 9

    If you go for stochastic rounding, make sure the random numbers are decent (some are not — see https://en.wikipedia.org/wiki/RANDU ).

    Simplest fix: stream cipher with exportable key-length (or: encrypt 0,1,2… under a block-cipher — also exportable, if jail-time is an issue)

  • Ivan Godard
    Keymaster
    Post count: 689

    For training 16-bit perceptrons you shouldn’t need crypto-quality random; 16 bits sampled from the middle of a 32-bit LFSR should be fine. What I don’t understand (haven’t read the cites yet) is why regular FP rounding doesn’t work.

  • gideony2
    Participant
    Post count: 9

    (1) decent crypto is just plain cheap; and: does an LFSR _have_ a middle? (if all bits were shifted 17 bits around, it would be just as good an LFSR)

    I don’t know how to think about pseudo-random (it is supposed to look random, unless you play “unfairly”); I do know how to think about crypto: if the key is secret, distinguishing the output from random is _VERY_ expensive.

    (2) without randomized rounding, 1+0.1+0.1+0.1+0.1 …. (all the way across 100 pages; rounded to integer) will still be 1. Randomized, it will be 0.1*(# of adds), +- epsilon.

  • gideony2
    Participant
    Post count: 9

    The A5/1 cipher should be as cheap as 3 small LFSRs

    • Ivan Godard
      Keymaster
      Post count: 689

      The problem with a hardware random number generator is that programs want a repeatable sequence, and there may be more than one program concurrently using it. Hence each app must have its own RNG state, which must be saved/restored at switch. That save/restore cost is paid by everyone, whether they use RNG or not, and the Mill tends to avoid such things in the basic architecture.

      The alternative is to give the RNG the semantics of an i/o device: you open it, you own it. That works for a bunch of embedded type uses, but not for the general case. Or you can treat it as a functional unit, like an ALU, and the app maintains the state in its own space. That also works, but the load/store cost of calling the unit would swamp the actual RNG compute, and having the state in the app invites back-channel attacks..

      We have considered a seed generator in hardware, that would use a physical process (back-biased diode or some such) to give low-precision value that could be used as a seed for a conventional software RNG. Repeatability of sequence is then available by using a fixed seed, while the physical seed would be unique each time and so not need save/restore. That still might happen, although it’s not in the current sim.

You must be logged in to reply to this topic.