#1: they could, and perhaps some members might be implemented that way. However the store might be only partly overlapping the load, so the logic to do the grab might have to do a shift and merge, which is non-trivial hardware and there are a lot of retire stations. The L1 already has the shift-and-merge logic (it must, to put stored data into the right place in the cache line), and aliasing interference is rare, so it is cheaper to let the store go to the L1 and reissue the load from the station.
Note that the first try for the load will have caused the loaded line to be hoisted in the cache hierarchy, so the retry will find the data (both the newly stored part and any that had not been overwritten) higher – and faster to get – than it did on the first try.
#2: Cache policies are full of knobs, levers and buttons in all architectures, and Mill is no exception. It is quite likely that both the complement of policy controls and their settings will vary among Mill family members. What you suggest is one possible such policy. The current very-alpha-grade policies in the sim try to preemptively push updates down, except if the line has never been read; this distinction is an attempt the avoid pushing partial lines that are output-only, to avoid the power cost of the repeated partial pushes. This and other policies are certain to change as we get more code through and start to move off sim so we get real power numbers. Fortunately, none of this is visible in the machine model so apps can ignore it all.