Databricks, the creators of Apache Spark, has recently released Dolly 2.0, reportedly the first open-source, instruction-following large language model (LLM) for commercial use that has been fine-tuned on a human-generated data set. Dolly could serve as a compelling starting point for homebrew ChatGPT competitors.
Dolly 2.0 is based on EleutherAI’s pythia model family and has a 12 billion-parameter model, which makes it more aligned with OpenAI’s ChatGPT. The new model is exclusively fine-tuned on a training data set called “databricks-dolly-15k,” which was crowdsourced from Databricks employees. The calibration has provided Dolly with the ability to answer questions and engage in dialogue as a chatbot better.
Dolly 1.0 faced limitations regarding commercial use due to the training data, which contained output from ChatGPT and was subject to OpenAI’s terms of service. To address this issue, Databricks crowdsourced over 13,000 demonstrations of instruction-following behavior from more than 5,000 of its employees between March and April 2023.
The resulting data set, along with Dolly’s model weights and training code, have been released fully open source under a Creative Commons license, enabling anyone to use, modify, or extend the data set for any purpose, including commercial applications.
Dolly’s open-source nature sets it apart from proprietary models like OpenAI’s ChatGPT, which requires users to pay for API access and adhere to specific terms of service. Additionally, Meta’s LLaMA, which recently spawned a wave of derivatives after its weights leaked on BitTorrent, does not allow commercial use.
AI researcher Simon Willison called Dolly 2.0 “a really big deal” on Mastodon, praising its fine-tuning instruction set, which was hand-built by 5,000 Databricks employees and released under a CC license. This release could inspire more companies to develop and release their own LLMs, enabling businesses and organizations to create and customize their own chatbots without relying on third-party services.