The pg_tiktoken extension
Efficiently tokenize data in your Postgres database using OpenAI's `tiktoken` library
The pg_tiktoken
extension enables fast and efficient tokenization of data in your Postgres database using OpenAI's tiktoken library.
This topic provides guidance on installing the extension, utilizing its features for tokenization and token management, and integrating the extension with ChatGPT models.
What is a token?
Language models process text in units called tokens. A token can be as short as a single character or as long as a complete word, such as "a" or "apple." In some languages, tokens may comprise less than a single character or even extend beyond a single word.
For example, consider the sentence "Neon is serverless Postgres." It can be divided into seven tokens: ["Ne", "on", "is", "server", "less", "Post", "gres"].
pg_tiktoken
functions
The pg_tiktoken
offers two functions:
tiktoken_encode
: Accepts text inputs and returns tokenized output, allowing you to seamlessly tokenize your text data.tiktoken_count
: Counts the number of tokens in a given text. This feature helps you adhere to text length limits, such as those set by OpenAI's language models.
pg_tiktoken
extension
Install the You can install the pg_tiktoken
extension by running the following CREATE EXTENSION
statement in the Neon SQL Editor or from a client such as psql
that is connected to Neon.
For information about using the Neon SQL Editor, see Query with Neon's SQL Editor. For information about using the psql
client with Neon, see Connect with psql.
tiktoken_encode
function
Use the The tiktoken_encode
function tokenizes text input and returns a tokenized output. The function accepts encoding names and OpenAI model names as the first argument and the text you want to tokenize as the second argument, as shown:
The function tokenizes text using the Byte Pair Encoding (BPE) algorithm.
tiktoken_count
function
Use the The tiktoken_count
function counts the number of tokens in a text. The function accepts encoding names and OpenAI model names as the first argument and text as the second argument, as shown:
Supported models
The tiktoken_count
and tiktoken_encode
functions accept both encoding and OpenAI model names as the first argument:
The following models are supported:
Encoding name | OpenAI model |
---|---|
cl100k_base | ChatGPT models, text-embedding-ada-002 |
p50k_base | Code models, text-davinci-002, text-davinci-003 |
p50k_edit | Use for edit models like text-davinci-edit-001, code-davinci-edit-001 |
r50k_base (or gpt2) | GPT-3 models like davinci |
pg_tiktoken
with ChatGPT models
Integrate The pg_tiktoken
extension allows you to store chat message history in a Postgres database and retrieve messages that comply with OpenAI's model limitations.
For example, consider the message
table below:
The gpt-3.5-turbo chat model requires specific parameters:
The messages
parameter is an array of message objects, with each object containing two pieces of information: The role
of the message sender (either system
, user
, or assistant
) and the actual message content
. Conversations can be brief, with just one message, or span multiple pages as long as the combined message tokens do not exceed the 4096-token limit.
To insert role
, content
, and the number of tokens into the database, use the following query:
Manage text tokens
When a conversation contains more tokens than a model can process (e.g., over 4096 tokens for gpt-3.5-turbo
), you will need to truncate the text to fit within the model's limit.
Additionally, lengthy conversations may result in incomplete replies. For example, if a gpt-3.5-turbo
conversation spans 4090 tokens, the response will be limited to just six tokens.
The following query retrieves messages up to your desired token limits:
<MAX_HISTORY_TOKENS>
represents the conversation history you want to keep for chat completion, following this formula:
For example, assume the desired completion length is 100 tokens (NUM_COMPLETION_TOKENS=90
).
Conclusion
In conclusion, the pg_tiktoken
extension is a valuable tool for tokenizing text data and managing tokens within Postgres databases. By leveraging OpenAI's tiktoken library, it simplifies the process of tokenization and working with token limits, enabling you to integrate more easily with with OpenAI's language models.
As you explore the capabilities of the pg_tiktoken extension
, we encourage you to provide feedback and suggest features you'd like to see added in future updates. We look forward to seeing the innovative natural language processing applications you create using pg_tiktoken
.
Resources
Need help?
Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.