For several years now, we’ve taken for granted that many transformer-based large language models (LLMs) use a technique known as autoregression. This machine learning technique aligns well with how many languages work, in that it processes and generates each word, or token, sequentially from left to right. But with the increasing complexity of AI-generated text, the costs of inference and problems with latency have risen as well.
However, there may be a better way, thanks to the recent release of Mercury by US-based Inception Labs, the…