Many core mill (GPU)

Author
Posts
Joe Taber
Participant
January 5, 2014 at 8:37 pm
Post count: 25
#350 |
Would the mill architecture map well to the highly parallel / many core market? I’m thinking on the order of 100 tiny mills on a chip. Would this be feasible or even useful?
Ivan Godard
Keymaster
January 5, 2014 at 10:44 pm
Post count: 689
#354
A massively parallel Mill is possible, but not a GPU (which are very different architectures). Think Sun’s Niagara or Tilera. Currently most of that market seems to be 32-bit, and Mills would be overkill, but it may evolve up into our range.
GPUs are wavefront machines, lacking real branches and recursion. The Mill would do a good job of software graphics for business-type machines, but for game-type horsepower you want a GPU handling the triangles.
Or so we have expected. Won’t know until we see what people use them for 🙂
- harrison partch
  Participant
  January 20, 2014 at 8:58 am
  Post count: 4
  #527
  [F]or game-type horsepower you want a GPU handling the triangles.
  Not for long. CPU tracing will soon beat GPU rasterization.
- jimrandomh
  Participant
  April 15, 2014 at 9:05 pm
  Post count: 4
  #987
  You may be selling it short; I think many game developers (and users) would be willing to a pretty hefty performance penalty to have their graphics handled by LLVM instead of by NVidia and ATI’s drivers.
imbecile
Participant
January 6, 2014 at 12:07 am
Post count: 48
#361
Well, Intel tried Larrabee. I would expect the Mill architecture to be much more suited for something like that than x86.
AMD tries hUMA too. In my ignorant lay person opinion, once the memory loads and access patterns can be served by one shared memory it shouldn’t be too much harder to plug two different sets of Mill cores into it. One set for application code, one set for float and graphics code.
Will_Edwards
Moderator
January 6, 2014 at 1:57 am
Post count: 98
#363
It will be interesting to see how Intel’s new Knight’s Landing (72 in-order x86 cores giving 3 TFlops double-precision(!)) is received. I’ve chatted to someone who played with Knights Corner but as I recall they struggled to apply it to their problems. Sadly I’ve forgotten any deep insights they may have mentioned.
I guess the big challenge when you have a lot of independent cores flying in close formation is meshing them together? And the granularity of the tasks has to be really quite large I imagine; if you play with, say, Intel’s Thread Building Blocks or openMQ (where parallelism is in-lined, rather than explicitly crafting a large number of tasks), you’ll be staggered at how many iterations of a loop you need to propose to do before its worth spreading them across multiple cores.
Of course the Go goroutines and Erlang lightweight processes for CSP can perhaps use some more cores in a mainstream way, for server workloads.
The other approach to massively parallel on-chip is GPGPU, which is notoriously non-GP-friendly and hard to apply to many otherwise-parallel problems. I persevered with hybrid CPU (4 core i7) and CUDA (meaty card, fermi IIRC, I was borrowing it on a remote machine, forget spec) when I was doing recmath contest entries, and typically the CUDA would give me nearly 2x the total performance of the 4xi7, which is not to be sneezed at but hardly unleashing all those flops! And conditions really killed it.
AMD is pushing hard towards the APU and Intel also unified the address space for their integrated GPUs IIRC, so things do come to pass pretty much as John Carmack predicts each QuakeCon. His views on raytracing triangles for games are terribly exciting, and suggest to me a move towards more GP and MIMD GPUs in future too.
So it’ll be exciting to see how people innovate with the Mill.
- harrison partch
  Participant
  January 20, 2014 at 8:47 am
  Post count: 4
  #526
  Carmack is wrong. Tracing triangles on the GPU is not the way forward.
bhurt
Participant
January 12, 2014 at 9:46 am
Post count: 5
#464
I don’t think the Mill would be significantly *better* than a GPU at what a GPU is good for, but with a goodly number of FP functional units, I think the Mill could be more or less *equal* to a GPU at what a GPU does. At least a mid-range GPU.
Where this becomes interesting is in situations where you don’t want to spend the power and cost budget for specialized GPUs- think tablets, smart phones, and net books. Here, havng a CPU that can do “triple duty”- have the power/cost of an embedded CPU, desktop CPU performance on general purpose workloads, *and* a decent GPU as needed, and you’ve got something *very* interesting. NVidia might not be afraid of the Mill, but ARM should be freaking paranoid.
- Ivan Godard
  Keymaster
  January 12, 2014 at 10:54 am
  Post count: 689
  #466
  Reply to bhurt #464
  For graphics-like loads you would configure a Mill member that was narrow (6-8 total slots) but very high (perhaps 64-byte vector size) so you would have 16-element single-precision SIMD in each of possibly two arithmetic slots . That would give you a respectable number of shaders, but the problem is the load on the memory hierarchy. Each one of those vectors is a cache line, so to saturate the function units you are pulling four and pushing two lines every cycle. Granted, everything used for the drawing is going to live in a whopping big LLC, but the sheer bandwidth at the top is going to be hard.
  There are ways to handle this – don’t use cache for the data, but configure NUMA in-core memory for example and push the problem to the software. But the result is pretty special-purpose; a chip with one of those and a handful of regular Mill cores is possible; we’d do fine for less graphics-intensive work. Nevertheless, for Call of Duty go to Nvidia.
  - PeterH
    Participant
    January 13, 2014 at 5:16 am
    Post count: 41
    #470
    I’m strictly an amateur at GPU design, but I agree that memory bandwidth is a major issue. What I figure is you want to read in a cache line worth of pixels, process those pixels against a cached set of polygons and then write the results out. This minimizes bandwidth to the output framebuffer. Not too complex scenes, limited by space for polygons and not switching pixel shaders, might be rendered reading and writing back each set of pixels only once. Cache space and bandwidth for texture buffers is still an issue without specialized hardware.
  - bhurt
    Participant
    January 13, 2014 at 1:50 pm
    Post count: 5
    #475
    Reply to Ivan Goddard #466
    Also, there’s a reason why graphics cards use different memory chips (gDDR) from regular CPUs. The biggest difference is (as I understand it) is gDDR chips have higher throughput but also higher latency.
    I think I would advocate the Mill have single-cycle FP add/compare, for one reason: Javascript. Javascript uses FP for it’s numbers, and in addition to being the browser-side language, is increasingly used on the server side (for reasons passing my understanding). So lots of computations that in a C or Java program would be integers are done as FP computations in Javascript. Javascript compilers do, I think, convert some (many? most?) of these to integer ops, but high speed simple FP computations would still be an enormous benefit.
    - Ivan Godard
      Keymaster
      January 15, 2014 at 8:36 am
      Post count: 689
      #485
      Standard-conforming FP addition can’t be as fast as integer (one cycle) unless the integer is very slowed down. Normalization and rounding must be done before (and after) the integer add that is inside the FP add. Some GPUs and other special-purpose engines just do everything in FP, and an integer is just an FP number with a zero exponent; for those machines int and float trivially take the same time, but that not something you’d do in a general-purpose machine.
harrison partch
Participant
January 20, 2014 at 8:21 am
Post count: 4
#524
A manycore mill could run the raytracing code I have been working on; a GPU cannot. If it could it would be a CPU and not a GPU, since the algorithm traces pointers in main memory along each ray. This movie was traced on [CPU~Quad core Intel Core i7-2700K CPU (-HT-MCP-) clocked at Min:1832.578Mhz Max:2702.656Mhz]:
The postprocessing sped the video by 2.5 times, but the frame times of every tenth frame (in ms) are visible in the corner. This is within an order of magnitude of acceptable fullscreen realtime performance; at that point (cpu) tracing wins over (gpu) rasterization. The tracer parallelizes almost perfectly across cores, so one wants as many of them as possible; since each ray follows pointers, cache misses are (probably) more important than other exection time, so more slower cores would be better than fewer faster cores. The code is open source and on github.

This is the first footage ever from the original tracer and dates to 2004 or 2005:
- This reply was modified 10 years, 6 months ago by harrison partch.
Author
Posts