Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.
could convolving dense layers across the masked-self-attention input layers (and hidden input layers) in the decoders help keep GPTNs weight count in check, while allowing for larger context windows?
similarly: convolving masked self-attention across large input context windows