Each instruction (really a half-instruction) on each side comprised some number of blocks, each of which carries a variable amount of fixed-size info. On the Mill there are three blocks per side, but the encoding mechanism can have any number (but an odd number is more sensible). On each size, one block decodes in the initial cycle, just as you describe; thereafter two blocks can decode per cycle, because we decode the array of blocks from both ends; this is why an odd number of blocks makes most sense.
On the exu side (where the usual two-in-N out ops reside) the info in all three blocks is treated as operations. Consequently, we get one block’s worth of ops at the end of D0, and two at the end of D1. D2 is used to match the logical belt numbers from the ops to the physical data location, and the crossbar routing happens during the transition between D2 and X0 (aka D3). One-cycle ops (ALU typically) execute in X0, and their result is available to other ops in the following cycle. Thus the basic pipeline has four stages, three decode and one execute.
The flow side is slightly more complicated. Only the initial (D0) block contains operations; the remaining two blocks contain arguments to those operations. All flow ops contain two two-bit fields that specify how many arguments that op takes from the other two blocks. The flow decoder can pick out those fields at the start of D0, run them through what amounts to a priority encoder, and have control lines at the start of D1 to route the arguments from their block to the intended flow operation.
Of the two arg blocks, one is a simple vector of morsels. Each morsel is big enough to hold a belt operand number, so (depending on member) morsels are 3/4/5/6 bits long, and the two-bit field selects 0/1/2/3 of them. The morsel arguments are used for belt references, small literals and other incidental data. The other flow-side arg block is a byte array. The two-bit field in the flow op selects 0/1/2/4 bytes from this array.
Consequently each flow-side op has a main opcode (from the initial block), up to three argument morsels, and up to 4 bytes of immediate data. The opcode is available at the end of D0, the args at the end of D1, and D2 is used to map logical belt numbers to physical, just like the exu side. In fact the two half-instructions become one during D2.
Back to your question: where is the bitmap? In the literal argument of course. And it only uses as many bytes are actually needed. And there can be two or more ops using the bitfields, because the arg bytes that hold the bitfields are not in the block that has the opcodes and the two-bit fields.
So the only part of Mill encoding that is actually variable length is the block, and each block’s content is fixed-length. We do have a linear parse at the block level, but not at the operation level. It’s as if we needed to parse two consecutive x86 instructions, and had two cycles to do it in, which is easy.
BTW, some flow ops can need more arg data than the 4 byte+3 morsel that comes with the slot – a call with a long arg list for example, or a con(stant) op with a quad (128-bit) argument. For these we gang additional slots to the main slot, each adding four bytes plus three morsels to the available data arguments. By ganging you lose a potential operation in that instruction (the gang slot has a unique opcode), but you can have all the data you want up to the maximum that the block shifters handle. The shifters can handle the maximal single operation/argument combination, so all possible operations can be encoded, if necessary in one-op instructions. The specializer knows the instruction capacity, and won’t put more ops in (with their arguments) than will fit in an instruction.
So if you have three slots that support the call op, a Mill instruction can contain three calls: each has the code branch offset in its immediate, and has morsels for up to three belt arguments to the call. If one call takes more arguments then that will reduce the number of calls you can have in the instruction as the big call grabs slots for its additional arguments. In practice slot overflow for call is uncommon; most functions have three or fewer arguments. The big consumer of flow-side argument encoding is floating-point literals.