Primer for Using LLMs

Greycroft
13 min readMar 28, 2023

--

Our team saw a big influx of questions from our portfolio companies about generative AI broadly and LLMs specifically in the past few months as companies looked to be early adopters. This led to many great discussions with these management teams and other leaders in the space who have implemented LLMs. We decided to take the learnings from these various discussions and distribute a summary of the collective learnings back to our portfolio in the form of the below primer on using LLMs. There was a strong reception to the materials, so we decided to opensource this primer with the hope that it helps accelerate the adoption of LLMs for more companies and builders.

The space is moving rapidly with new best practices, resources, and software emerging constantly. This is meant to be a living and breathing document, so all feedback and recommendations are appreciated! Feel free to email Nick Crance at nicholas@greycroft.com.

We’d also love to hear from you if you’re building tools for the below processes or are building an LLM-based application!

Overview

The power of generative AI and the foundation models that serve as the engine for generative AI is undeniable today. Generative AI and one of its most popular applications ChatGPT feels like it’s everywhere, likely leading you to frequently think about use cases that could accelerate product roadmaps, increase employee productivity, and reduce operating costs. Those thoughts may also be closely followed by a feeling of unease about whether your team has the resources or know-how to build on foundation models.

The truth is that foundation models have existed for years and are based on the Transformer architecture invented in 2017 by a team of Google researchers. The recent explosion in popularity is driven by the rise of tooling that abstracts the complexity of accessing these models, which means you can now tap into these models even if your engineers don’t have PhDs in AI or NLP.

This goal of this primer is to provide general information, best practices, and resources to help you navigate being a successful early adopter. The primer will focus on large language models (or “LLMs”) and their emerging suite of tools because of their ubiquity

The guide is written to be valuable regardless of whether you’re early in the journey with only a use case in mind or you’re already using LLMs. All content will be in layman’s terms to support a wider audience, but links with technical detail are included if you want to explore further.

As a disclaimer — This primer will not focus on why and how you should use AI. The decision to use LLMs in your internal systems or external products isn’t generalizable. This decision is company-specific and high stakes because it is time, cost, and resource intensive. It’s critical to ensure your team is not falling for shiny object syndrome and ask yourself “is there a clear business case for using these models”. If you don’t have a clear answer of “it makes our product 10x better / significantly reduce opex / accelerate revenue generation”, then it’s worth reassessing that decision. There are many great resources highlighting the range of use cases for LLMs and our team is available to give company-specific guidance.

This primer is split into two major sections:

Setting up your model(s)

Accessing large language models has never been easier. You can sign up for OpenAI, enter your credit card info, and access GPT-3 models (and soon to be GPT-4!) via API within an hour. That may be the best route for your team, but we recommend considering a broader set of options given your LLM (or set of LLMs) will likely be the engine for your internal systems and external customer-facing products.

Picking the right LLM

The first and most critical decision in the LLM process is sifting through the dozens of viable models available to you / your team and picking the best model(s) for your use case. There are many tradeoffs to consider as you pick what’s best for your use case

Open-source vs closed-source: Closed-source can provide access to LLM capabilities without AI / ML engineering resources and likely does so at a lower cost. The downside is that this abstraction layer may create vendor lock-in and may lead to higher total costs at larger scale (vs hosting your own model). If you have technical resources, an open-source model can provide more flexibility and higher performance. However, you may incur higher cost — both in paying those technical employees and in cloud bills for hosting those models.

Quality: Models have varying performance across multiple dimensions such as output accuracy, response time, bias, etc. Objective performance measurement can be difficult. Benchmarks like Stanford’s HELM framework exist, but it’s important to define your objectives and find a model that fits your needs. In general, larger models (i.e. more parameters) are often more performant across general tasks, but you should review performance within the lens of your use case. You should explore and test several models to see how outputs compare across them. Further, it’s important to keep in mind that performance can be enhanced in several ways (e.g., prompt engineering, fine-tuning) — more info on this topic in the next section.

Cost: Tied to the above. Larger models often produce higher quality outputs, but this size can also lead to a higher cost per inference. OpenAI’s pricing highlights this relationship. Keep this in mind as you build the business case for utilizing an LLM. A higher cost model may break the business case, but you may be able to make the numbers work with a lower cost model.

Data privacy & protection: The data that flows through your LLMs will likely contain sensitive information (e.g., PII) and proprietary datasets that can serve as your competitive advantage. If using a third-party managed model (e.g., Anthropic, OpenAI), then it’s important to understand what will happen with your data — where it is stored, how is it secured, how it is used, what legal rights they have to your data, etc. — more info on this topic in the next section

Single vs multiple models: If you have several use cases for LLMs, it may be best to select the individual model that is best for each task. You may use multiple models each developed by a different group or all developed by a specific provider. For example, OpenAI’s GPT-3 is actually 4 models with varying levels of performance and costs. They even have a comparison tool to help you pick the best models for each use case.

The above information can be ascertained through discussions with vendors and comprehensively testing a range of models to find the best fit. It’s critical to test for your specific use case (vs. generalized benchmarks or ones provided by the vendor) and conduct testing on a frequent cadence given how rapidly new models are being released.

During testing, it’s best to experiment with each model’s settings which you can adjust to optimize the output for your use case. For example — increasing the “temperature” parameter encourages the LLM to return less common words or sequences. This may be desirable for applications where variation in responses is valuable such as creative writing and even chatbots. Cohere has a good overview of these settings if you want to read further.

In the future, we believe tools will emerge that allow users to route their inputs to multiple model providers and use the optimal output based on pre-defined criteria (e.g., response rate, cost, etc.) In other words, what Kubernetes provides for the cloud, but for LLMs. LangChain’s model comparison module and OpenAI’s Evals may be early versions of this functionality.

Prompt engineering & design

Prompt Engineering or Prompt Tuning is the process of constructing the input into these models to include key details to improve the output. This action can take many forms, but often includes a chain of thought and few-shot learning examples. Many generative AI applications today use this to generate relevant results from generalized models (e.g., GPT-3).

Designing and identifying the optimal prompts for your specific use case or application is an iterative process. There are many resources on prompt engineering best practices (Snorkel, OpenAI, Cohere). Further, there are many prompt engineering platforms that make it easy to design, compare, store, and access these prompts including LangChain,Dust.tt, Vellum, Baseplate, and Promptable,

Prompt engineering can materially impact accuracy of your application, but is not a defensible moat against competitors for two reasons

Low defensibility: Prompts can be easily replicated (and potentially beat) by competitors using the same model. They may even be able to expose your prompt as a Stanford student did to Bing with prompt injection

Lack of interoperability: Prompt engineering may work well for an individual model, but likely will not be as reliable across models including upgrades of existing models (e.g., transitioning from GPT-3 to GPT-4 after the release last week). There is also speculation that future models may be good enough that prompt engineering won’t improve outputs.

Emerging frameworks such as Demonstrate-Search-Predict (or “DSP”) highlight that the AI community is dedicating resources to improving in-context learning.

Improving model output quality with adapters & fine-tuning

In addition to prompt engineering, LLM output quality can be improved through fine-tuning the weights of an existing model or adding more parameters (i.e., adapter modules). It’s important to consider both options when assessing the right option for you.

Fine-tuning models is the process of unfreezing a pre-trained LLM and re-training the model with a custom domain-specific dataset to improve output quality for a given application or set of applications.

Closed model providers likely offer fine-tuning. See OpenAI guide for fine-tuning here. However, it will be at a higher cost than base models. For example — Inference costs for a fine-tuned model on OpenAI are 4–6x their base versions, and that doesn’t include training.

Open-source models can be fine-tuned with tools like Humanloop and Vellum. I won’t dive deep into this process of fine-tuning your self-hosted model in this guide. Instead I’d recommend reading Hugging Face’s primer on the topic here.

Adapter modules are an alternative to fine-tuning. The process includes adding incremental parameter layers on top of the “frozen” generalized model and training them on task-specific data. This layer can be materially smaller than the base model (i.e., <10% the number of parameters). There are a few approaches including linear probing and inserting adapter layers between transformer layers. This approach is less expensive than full fine-tuning due to a lower number of weights being trained (only incremental parameters vs all). However, it does require using open-source models and may require more experimentation to find the right approach to optimize output.

Data privacy & private models

We’ve received several questions about how the data in LLM prompts are handled: Are these APIs secure if customers input sensitive data? Do prompts and associated reinforced learning from our customers help our competitors? Can competitors somehow reverse engineer data we’re inputting into these models?

These are all valid questions. Based on our discussions with model providers and operators in the space, we believe it will become standard practice for model providers to offer “private models” similar to how cloud service providers offer both public and private clouds. In practice, these function similarly to customer-specific or use case specific fine-tuned models but use the weights of the base pre-trained models. They will likely be more expensive to run inference on.

To re-emphasize an earlier point — It’s important to understand what will happen with your data when using a third-party hosted model — where it is stored, how is it secured, how it is used, what legal rights they have to your data, etc. We recommend asking these questions during discussions with vendors.

Operationalizing your model(s)

Now that you’ve picked your model(s) and are spinning them up in the right way, it’s time to focus on how to stand up your LLM-based application. We’ll cover a few of the most important operational details you’ll want to consider to as you set up your application.

Chaining together an LLM-based application

Many LLM applications may be simple enough to generate relevant output for a user based on their input and some prompt engineering. Other LLM applications may require a more complex series of tasks — both with the LLM and external apps, services, or APIs. Enter the concept of “Chains”.

Borrowing from Langchain’s documentation:

“Chains allow us to combine multiple components together to create a single, coherent application. For example, we can create a chain that takes user input, formats it with a [template for your prompt], and then passes the formatted response to an LLM. We can build more complex chains by combining multiple chains together, or by combining chains with other components.”

Further, they include a simple example of an application where users could input their description for their company and the LLM would generate 2 outputs. First output could be a name for the company based on your description. Second output could be a slogan based on that description and name. This requires two separate prompts to your LLM and are sequential.

LangChain’s documentation has an overview of LLM Chains with key concepts, examples of chains, and how to get started. It’s worth a read to familiarize yourself with the concept. Other emerging frameworks such as Dust and Klu also enable chaining.

Data augmentation in prompts

Enhancing prompts with data is an example of “chains”. As discussed in the prompt engineering, incorporating more context into prompts can materially improve model output accuracy for a task. While this context can be generalized, it is often more powerful to incorporate relevant customer-specific data into LLMs. This requires designing an application that can both fetch and insert this data into prompts with low latency. Many prompt engineering platforms such as LangChain, Dust.tt, and Fixie have modules that support this since data can enhance prompt design.

In many cases, your data may be larger than the token constraints allowed by model provider APIs or be large enough that it’s costly to run through your own model every time. Embeddings may help with this.

Embeddings & vector stores

There are many LLM use cases where it’s valuable to access large volumes of unstructured data. For example — you may want to allow users to ask questions / do semantic search against a large repository of documents or identify customer segmentations based on underlying characteristics. This is where embeddings come in. If you’re an ML / NLP engineer, this likely isn’t new to you.

Text embeddings allow you to represent unstructured text data in a structured numeric vector with a lower dimensional space. Text with similar meanings often have similar numeric representations. This is important because transformer-based models including LLMs don’t read text directly. These embeddings make words, sentences, and larger blocks of text readable for LLMs.

Embeddings can be created in many ways, but it is generally a 2-step process.

  1. Tokenize the desired text using a tokenizer.
  2. Convert tokens into embeddings, which can represent words or even full sentences.

Many platforms and the engineers that build them rely on pre-trained embeddings models that have an associated tokenizer, but it’s important to take the approach that’s natural to your team.

If you have a technical team that is using an open-source LLM, then you may want to use an open-source embeddings model. You may also need a vector store / database like Weviate, Pinecone, or Chroma if you go this route.

If you’re using a closed-source model and want to use managed services for many parts of your stack, then the simplest option is to use an end-to-end embedding tool offered by the LLM providers (see OpenAI’s embedding tool and Cohere’s embedding tool).

If you’re looking for more information on embeddings, Jay Alammar goes deeper on the technical aspects of contextual embeddings and the deeper history of embeddings models. Lenny Rachitsky also provides a clear and succinct step-by-step guide on how he created embeddings on all of his previous writing to power a chatbot to answer readers’ questions on how previous writing.

Augmented Language Models

Meta recently released a report summarizing the growing body of research focused on augmenting language modes with reasoning skills (i.e. decomposing complex tasks into simpler subtasks) and the ability to use tools (i.e. calling external applications for a task). These “Augmented Language Models” are often focused on addressing common limitations of LLMs.

For example — Retrieval Augmented Generation (“RAG”) models leverage LLMs knowledge of language trained into their parameters (“parametric knowledge”) and pair it with a retrieval system that can incorporate external data from proprietary enterprise-specific data sources (“non-parametric knowledge”). Meta’s Atlas and Google Deepmind’s RETRO are examples of recent innovation in the space, which has high potential to address LLMs lack of traceability and output consistency.

We believe these approaches to augmenting language models have high potential to addressing current limitations of LLMs, unlocking higher level of adoptions from enterprises.

****************

Disclaimers: Notwithstanding Greycroft’s publication of this article to its website, the views expressed herein are the personal views of Nick Crance and do not necessarily reflect the views of Greycroft or the strategies and products that Greycroft offers or invests in. Further, the inclusion of such views herein does not constitute a representation that Greycroft has adopted or intends to adopt practices that align with such views. Nothing contained herein constitutes investment, legal, tax or other advice nor is it to be relied on in making an investment or other decision. This publication was prepared solely for information purposes and should not be viewed as a current or past recommendation or a solicitation of an offer to buy or sell any securities or to adopt any investment strategy. This publication contains projections or other forward-looking statements, which are based on beliefs, assumptions and expectations that may change as a result of many possible events or factors. Companies noted throughout are not [all] Greycroft portfolio companies or affiliates and are intended as a curated list of providers in the relevant spaces. The information contained herein is only as current as of the date indicated and may be superseded by subsequent market events or for other reasons. The information in this document has been developed internally and/or obtained from sources believed to be reliable; however, Greycroft does not guarantee the accuracy, adequacy, or completeness of such information. Greycroft does not assume any duty to, nor undertakes to update forward looking statements.

--

--

Greycroft

A seed-to-growth venture capital firm that partners with exceptional entrepreneurs to build the world’s most transformative companies.