Large Language Models (or LLMs) are machine learning models (built on top of transformer neural network) that aim to solve language problems like auto-completion, text classification, text generation, and more. It is a subset of Deep Learning.
The "large" refers to the model being pre-trained by a massive set of data (could be in sizes of petabytes), accompanied by millions or billions of parameters and further fine-tuned for a specific use-case.
While LLMs made projects like ChatGPT and Google Gemini possible, they are proprietary. So, there's a lack of transparency, and the entire control of the LLM (and its usage) is at the discretion of the company that owns it.
So, what's the solution to this? Open source LLMs.
With an open-source LLM, you can have multiple benefits like transparency, no vendor lock-in, free to use for commercial purposes, and total control over customization of the model (in terms of performance, carbon footprint, and more).
Similar reasons why we need open-source ChatGPT chatbot alternatives:
Now that you know why we need open-source LLMs, let me highlight some of the best ones available.
1. Falcon 180B
The Technology Innovation Institute (TII) in the United Arab Emirates (UAE) launched an open LLM, which performs close to its proprietary competitors.
The model includes 180 billion parameters and was trained on 3.5 trillion tokens.
In the Falcon AI model family, there are smaller/newer models like Falcon 2 11B, which is a scalable solution, and outperforms similar models from Meta and Google.
β’ License: Falcon-180B TII License (based on Apache 2.0)
β’ Usage: Allowed for commercial use and should be fine-tuned for specific tasks
2. Dolly 2.0
Dolly 2.0 is one of the most prominent open-source LLM models developed by Databricks, fine-tuned on a human-generated instruction dataset. Unlike its predecessor, the dataset is its original, generated by more than 5000 employees of the company.
It includes 12 billion parameters, and is geared towards summarization, QA, classification, and some brainstorming. Dolly 2.0 does not aim to compete with existing models, but the focus was more on providing a human-generated dataset along with an LLM. They claimed it to be the first of its kind back then that tried to mimic ChatGPT's interaction.
β’ License: MIT
β’ Usage: Allowed for research and commercial use
3. Cerebras-GPT
Cerebras-GPT is a family of seven GPT-3 open-source LLM models ranging from 111 million to 13 billion parameters.
These models aim for higher accuracy with a faster training times, lower cost, and consume less energy. It is trained using DeepMind's Chinchilla formula.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use and fine-tuned for ChatGPT like interaction
4. Bloom
Bloom is an open-access multilingual LLM model that is trained to continue text from prompts, and can be trained for other text-based tasks.
The model includes 176 billion parameters, and can output in 46 different languages and 13 programming languages.
β’ License: RAIL License v1.0
β’ Usage: Allowed for research and non-commercial entities
5. DLite V2
DLite by AI Squared aims to present lightweight LLMs that you can use almost anywhere.
The family of model includes variations ranging from 123 million to 1.5 billion parameters. It utilizes the same database built by Databricks for Dolly 2.0. Hence, it is tuned for the same use-case.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use and fine-tuned for ChatGPT like interaction
6. MPT-7B
MPT-7B is another prominent open-source model by Databricks. Unlike Dolly 2.0, it competes and surpasses Meta's LLaMA-7B in many ways.
The model was trained using a mixed set of data from various sources, including 6.7 billion parameters. It also aims to be an affordable alternative to the closed source LLM models available.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use and fine-tuned for ChatGPT like interaction
7. XGen 7B
XGen 7B is an LLM model by Salesforce, one of the leading companies in its industry. The model has been trained for text generation and code tasks, claiming performance benefits over similar models like MPT.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use and fine-tuned for text generation/code tasks
8. OpenLLaMA
OpenLLaMA is a reproduction of Meta's popular LLaMa model trained with a different dataset to make it available under a permissive license, when compared to Meta's custom commercial license.
Currently, you can find OpenLLaMA 7Bv2 and a 3Bv3 model, with all of them trained using a mixture of datasets.
You can learn more about it on its GitHub page.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use
9. StarCoder
StarCoder is an interesting open-source LLM trained using data from GitHub (commits, issues etc) fine-tuned for coding tasks.
It includes 15 billion parameters and trained on 80+ programming languages. Considering the data is from GitHub, you will need to add attribution wherever necessary when you use the model.
β’ License: Open RAIL-M v1
β’ Usage: Allowed for research and commercial use (with restrictions not to misuse it)
10. CodeT5
Another open-source from Salesforce, fine-tuned for coding tasks and code generation. CodeT5 is one of the most competing code LLMs that features 16 billion parameters, and claims to outperform OpenAI's code-cushman001 model.
β’ License: BSD 3-Clause
β’ Usage: Allowed for research and non-commercial use
11. Yi-1.5
Yi-1.5 is an exciting open-source LLM featuring a whopping 34.4 billion parameters.
It is fine-tuned for coding, math, reasoning, and instruction-following capability. Originally, this model wasn't open-source, but later it was. And, that's a good thing.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use
12. SOLAR-10.7B
SOLAR-10.7B is a pre-trained open-source LLM by Upstage.ai that just generates random text. You need to fine-tune it to adapt to your particular requirements.
When compared to models with larger parameters, it promises to provide a good model performance efficiency and scalability even with 10.7 billion parameters. It is an incredibly popular adaptable language model.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use
13. Mistral 7B
Mistral 7B is a powerful language model that claims to outperform LLaMa 1/2 in various ways. The large language model is capable of math, reasoning, word knowledge, QA, and more.
It is a pre-trained generative text model with 7 billion parameters. Hence, it does not have any moderation mechanisms. So, you need to fine-tune it to output moderated results.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use
14. OLMo-7B
OLMo-7B is yet another open language model pre-trained with Dolma dataset featuring three trillion token open corpus. The entire training data, model weights, and the code is open-source, and available to use without restriction.
β’ License: Apache 2.0
β’ Usage: Allowed for research and commercial use
π¬ What is your favorite open-source LLM? Have you tried running any of them? Share your thoughts in the comments below!