Notes on pretraining parallelisms and failed training runs.
Wrote up some flashcards here to help myself retain all the stuff below.On why pretraining runs failsHad an interesting chat with someone on why pretraining runs often fail. It was very interesting to get a sense of all the tangible ways that things can get fucked, and why training is such a precarious operation. At a high level, breaking causality, and adding bias, seem to be key culprits.Breaking causality:When you do expert routing, you first go through the router, which gives you a score of ...
Read full article →