Get Started with the LangChain Integration
Note
This tutorial uses LangChain's Python library. For a tutorial that uses the JavaScript library, see Get Started with the LangChain JS/TS Integration.
You can integrate Atlas Vector Search with LangChain to build LLM applications and implement retrieval-augmented generation (RAG). This tutorial demonstrates how to start using Atlas Vector Search with LangChain to perform semantic search on your data and build a RAG implementation. Specifically, you perform the following actions:
Set up the environment.
Store custom data on Atlas.
Create an Atlas Vector Search index on your data.
Run the following vector search queries:
Semantic search.
Semantic search with score.
Semantic search with metadata pre-filtering.
Implement RAG by using Atlas Vector Search to answer questions on your data.
Work with a runnable version of this tutorial as a Python notebook.
Background
LangChain is an open-source framework that simplifies the creation of LLM applications through the use of "chains." Chains are LangChain-specific components that can be combined for a variety of AI use cases, including RAG.
By integrating Atlas Vector Search with LangChain, you can use Atlas as a vector database and use Atlas Vector Search to implement RAG by retrieving semantically similar documents from your data. To learn more about RAG, see Retrieval-Augmented Generation (RAG) with Atlas Vector Search.
Prerequisites
To complete this tutorial, you must have the following:
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
An OpenAI API Key. You must have an OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.
An environment to run interactive Python notebooks such as Colab.
Set Up the Environment
Set up the environment for this tutorial.
Create an interactive Python notebook by saving a file
with the .ipynb
extension. This notebook allows you to
run Python code snippets individually, and you'll use
it to run the code in this tutorial.
To set up your notebook environment:
Install and import dependencies.
Run the following command:
pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-openai pymongo pypdf
Then, run the following code to import the required packages:
import os, pymongo, pprint from langchain_community.document_loaders import PyPDFLoader from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough from langchain_mongodb import MongoDBAtlasVectorSearch from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain.prompts import PromptTemplate from langchain.text_splitter import RecursiveCharacterTextSplitter from pymongo import MongoClient from pymongo.operations import SearchIndexModel
Define environment variables.
Run the following code, replacing the placeholders with the following values:
Your OpenAI API Key.
Your Atlas cluster's SRV connection string.
os.environ["OPENAI_API_KEY"] = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Use Atlas as a Vector Store
Then, load custom data into Atlas and instantiate Atlas as a vector database, also called a vector store. Copy and paste the following code snippets into your notebook.
Load the sample data.
For this tutorial, you use a publicly accessible PDF document about a recent MongoDB earnings report as the data source for your vector store.
To load the sample data, run the following code snippet. It does the following:
Retrieves the PDF from the specified URL and loads the raw text data.
Uses a text splitter to split the data into smaller documents.
Specifies chunk parameters, which determines the number of characters in each document and the number of characters that should overlap between two consecutive documents.
# Load the PDF loader = PyPDFLoader("https://investors.mongodb.com/node/13176/pdf") data = loader.load() # Split PDF into documents text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20) docs = text_splitter.split_documents(data) # Print the first document docs[0]
Document(metadata={'producer': 'West Corporation using ABCpdf', 'creator': 'PyPDF', 'creationdate': '2025-03-05T21:06:26+00:00', 'title': 'MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results', 'source': 'https://investors.mongodb.com/node/13176/pdf', 'total_pages': 9, 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year')
Instantiate the vector store.
Run the following code to create a vector store instance
named vector_store
from the sample documents.
This snippet specifies the following:
The connection string to your Atlas cluster.
langchain_db.test
as the Atlas namespace to store the documents.The
text-embedding-3-large
embedding model from OpenAI to convert the text into vector embeddings for theembedding
field.vector_index
as the index to use for querying the vector store.
# Instantiate the vector store using your MongoDB connection string vector_store = MongoDBAtlasVectorSearch.from_connection_string( connection_string = ATLAS_CONNECTION_STRING, namespace = "langchain_db.test", embedding = OpenAIEmbeddings(model="text-embedding-3-large"), index_name = "vector_index" ) # Add documents to the vector store vector_store.add_documents(documents=docs)
After running the sample code, you can
view your vector embeddings in the Atlas UI
by navigating to the langchain_db.test
collection in your cluster.
Create the Atlas Vector Search Index
Note
To create an Atlas Vector Search index, you must have Project Data Access Admin
or higher access to the Atlas project.
To enable vector search queries on your vector store,
create an Atlas Vector Search index on the langchain_db.test
collection
by using the LangChain helper method or the
PyMongo driver method.
Run the following code in your notebook for your preferred method. The index definition specifies indexing the following fields:
embedding
field as the vector type. Theembedding
field contains the embeddings created using OpenAI'stext-embedding-3-large
embedding model. The index definition specifies3072
vector dimensions and measures similarity usingcosine
.page_label
field as the filter type for pre-filtering data by the page number in the PDF.
# Use helper method to create the vector search index vector_store.create_vector_search_index( dimensions = 3072, # The dimensions of the vector embeddings to be indexed filters = [ "page_label" ] )
# Create your index model, then create the search index search_index_model = SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 3072, "similarity": "cosine" }, { "type": "filter", "path": "page_label" } ] }, name="vector_index", type="vectorSearch" ) atlas_collection.create_search_index(model=search_index_model)
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
Run Vector Search Queries
Once Atlas builds your index, run vector search queries on your data. The following examples demonstrate various queries that you can run on your vectorized data.
The following query uses the similarity_search
method
to perform a basic semantic search for
the string MongoDB acquisition
. It returns a
list of documents ranked by relevance.
query = "MongoDB acquisition" results = vector_store.similarity_search(query) pprint.pprint(results)
[Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'), Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc069243a6', metadata={ ... }, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally")]
The following query uses the similarity_search_with_score
method to perform a semantic search for
the string MongoDB acquisition
and specifies the
k
parameter to limit the number of documents to return
to 3
.
Note
The k
parameter in this example refers to the
similarity_search_with_score
method option, not the
knnBeta operator option of the same name.
It returns the three most relevant documents
and a relevance score between
0
and 1
.
query = "MongoDB acquisition" results = vector_store.similarity_search_with_score( query = query, k = 3 ) pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924409', metadata={ ... }, page_content='SOURCE MongoDB, Inc.'), 0.8193451166152954), (Document(id='67f0259b8bb2babc0692432f', metadata={ ... }, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), 0.7815237045288086), (Document(id='67f0259b8bb2babc06924355', metadata={ ... }, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), 0.7788857221603394)]
You can pre-filter your data by using an
MQL match expression
that compares the indexed field with another value in
your collection. You must index any metadata fields that you want to
filter by as the filter
type. To learn more, see
How to Index Fields for Vector Search.
Note
You specified the page_label
field as a filter
when you created the index
for this tutorial.
The following query uses the similarity_search_with_score
method
to perform a semantic search for
the string MongoDB acquisition
. It also specifies the following:
The
k
parameter to limit the number of documents to return to3
.A pre-filter on the
page_label
field that uses the$eq
operator to match documents appearing on page 2 only.
It returns the three most relevant documents from page 2
and a relevance score between
0
and 1
.
query = "MongoDB acquisition" results = vector_store.similarity_search_with_score( query = query, k = 3, pre_filter = { "page_label": { "$eq": 2 } } ) pprint.pprint(results)
[(Document(id='67f0259b8bb2babc06924355', metadata={ ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), 0.7788857221603394), (Document(id='67f0259b8bb2babc06924351', metadata={ ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), 0.7606035470962524), (Document(id='67f0259b8bb2babc06924354', metadata={ ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), 0.7583936452865601)]
Tip
For a full list of semantic search methods, refer to the API reference.
Answer Questions on Your Data
This section demonstrates how to implement RAG in your application with Atlas Vector Search and LangChain. Now that you've used Atlas Vector Search to retrieve semantically similar documents, run the following code examples to prompt the LLM to answer questions based on those documents.
This example does the following:
Instantiates Atlas Vector Search as a retriever to query for similar documents, including the optional
k
parameter to search for only the10
most relevant documents.
Defines a LangChain prompt template to instruct the LLM to use these documents as context for your query. LangChain passes these documents to the
{context}
input variable and your query to the{question}
variable.Constructs a chain that specifies the following:
Atlas Vector Search as the retriever to search for documents to use as context.
The prompt template that you defined.
The
gpt-4o
chat model from OpenAI to generate a context-aware response.
Invokes the chain with a sample query.
Returns the LLM's response and the documents used as context. The generated response might vary.
# Instantiate Atlas Vector Search as a retriever retriever = vector_store.as_retriever( search_type = "similarity", search_kwargs = { "k": 10 } ) # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: {question} """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI(model="gpt-4o") # Construct a chain to answer questions on your data chain = ( { "context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain question = "What was MongoDB's latest acquisition?" answer = chain.invoke(question) print("Question: " + question) print("Answer: " + answer) # Return source documents documents = retriever.invoke(question) print("\nSource documents:") pprint.pprint(documents)
Question: What was MongoDB's latest acquisition? Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models. Source documents: [Document(id='67f0259b8bb2babc06924409', metadata={'_id': '67f0259b8bb2babc06924409', ... 'page_label': '9'}, page_content='SOURCE MongoDB, Inc.'), Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), Document(id='67f0259b8bb2babc0692432f', metadata={'_id': '67f0259b8bb2babc0692432f', ... 'page_label': '1'}, page_content='MongoDB platform. In fiscal year 2026 we expect to see stable consumption growth in Atlas, our main growth driver," said Dev Ittycheria, President\nand Chief Executive Officer of MongoDB .'), Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc069243a6', metadata={'_id': '67f0259b8bb2babc069243a6', ... 'page_label': '4'}, page_content="MongoDB's unified, intelligent data platform was built to power the next generation of applications, and MongoDB is the most widely available, globally"), Document(id='67f0259b8bb2babc06924329', metadata={'_id': '67f0259b8bb2babc06924329', ... 'page_label': '1'}, page_content='MongoDB, Inc. Announces Fourth Quarter and Full Year Fiscal 2025 Financial Results\nMarch 5, 2025\nFourth Quarter Fiscal 2025 Total Revenue of $548.4 million, up 20% Year-over-Year'), Document(id='67f0259b8bb2babc069243a7', metadata={'_id': '67f0259b8bb2babc069243a7', ... 'page_label': '4'}, page_content='distributed database on the market. With integrated capabilities for operational data, search, real-time analytics, and AI-powered retrieval, MongoDB'), Document(id='67f0259b8bb2babc069243a5', metadata={'_id': '67f0259b8bb2babc069243a5', ... 'page_label': '4'}, page_content="Headquartered in New York, MongoDB's mission is to empower innovators to create, transform, and disrupt industries with software and data."), Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), Document(id='67f0259b8bb2babc069243a9', metadata={'_id': '67f0259b8bb2babc069243a9', ... 'page_label': '4'}, page_content='50,000 customers across almost every industry—including 70% of the Fortune 100—rely on MongoDB for their most important applications. To learn\nmore, visit mongodb.com .\nInvestor Relations')]
This example does the following:
Instantiates Atlas Vector Search as a retriever to query for similar documents, including the following optional parameters:
k
to search for only the10
most relevant documents.score_threshold
to use only documents with a relevance score above0.75
.Note
This parameter refers to a relevance score that Langchain uses to normalize your results, and not the relevance score used in Atlas Search queries. To use Atlas Search scores in your RAG implementation, define a custom retriever that uses the
similarity_search_with_score
method and filters by the Atlas Search score.pre_filter
to filter on thepage_label
field for documents that appear on page 2 only.
Defines a LangChain prompt template to instruct the LLM to use these documents as context for your query. LangChain passes these documents to the
{context}
input variable and your query to the{question}
variable.Constructs a chain that specifies the following:
Atlas Vector Search as the retriever to search for documents to use as context.
The prompt template that you defined.
The
gpt-4o
chat model from OpenAI to generate a context-aware response.
Invokes the chain with a sample query.
Returns the LLM's response and the documents used as context. The generated response might vary.
# Instantiate Atlas Vector Search as a retriever retriever = vector_store.as_retriever( search_type = "similarity", search_kwargs = { "k": 10, "score_threshold": 0.75, "pre_filter": { "page_label": { "$eq": 2 } } } ) # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: {question} """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI(model="gpt-4o") # Construct a chain to answer questions on your data chain = ( { "context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain question = "What was MongoDB's latest acquisition?" answer = rag_chain.invoke(question) print("Question: " + question) print("Answer: " + answer) # Return source documents documents = retriever.invoke(question) print("\nSource documents:") pprint.pprint(documents)
Question: What was MongoDB's latest acquisition? Answer: MongoDB's latest acquisition was Voyage AI, a pioneer in state-of-the-art embedding and reranking models. Source documents: [Document(id='67f0259b8bb2babc06924351', metadata={'_id': '67f0259b8bb2babc06924351', ... 'page_label': '2'}, page_content='Measures."\nFourth Quarter Fiscal 2025 and Recent Business Highlights\nMongoDB acquired Voyage AI, a pioneer in state-of-the-art embedding and reranking models that power next-generation'), Document(id='67f0259b8bb2babc06924355', metadata={'_id': '67f0259b8bb2babc06924355', ... 'page_label': '2'}, page_content='conjunction with the acquisition of Voyage, MongoDB is announcing a stock buyback program of $200 million, to offset the\ndilutive impact of the acquisition consideration.'), Document(id='67f0259b8bb2babc06924354', metadata={'_id': '67f0259b8bb2babc06924354', ... 'page_label': '2'}, page_content='data.\nMongoDB completed the redemption of 2026 Convertible Notes, eliminating all debt from the balance sheet. Additionally, in'), Document(id='67f0259b8bb2babc06924358', metadata={'_id': '67f0259b8bb2babc06924358', ... 'page_label': '2'}, page_content='Lombard Odier, a Swiss private bank, partnered with MongoDB to migrate and modernize its legacy banking technology'), Document(id='67f0259b8bb2babc06924352', metadata={'_id': '67f0259b8bb2babc06924352', ... 'page_label': '2'}, page_content="AI applications. Integrating Voyage AI's technology with MongoDB will enable organizations to easily build trustworthy,"), Document(id='67f0259b8bb2babc0692435a', metadata={'_id': '67f0259b8bb2babc0692435a', ... 'page_label': '2'}, page_content='applications from a legacy relational database to MongoDB 20 times faster than previous migrations.\nFirst Quarter and Full Year Fiscal 2026 Guidance'), Document(id='67f0259b8bb2babc06924356', metadata={'_id': '67f0259b8bb2babc06924356', ... 'page_label': '2'}, page_content='For the third consecutive year, MongoDB was named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud'), Document(id='67f0259b8bb2babc0692434d', metadata={'_id': '67f0259b8bb2babc0692434d', ... 'page_label': '2'}, page_content='compared to $121.5 million of cash from operations in the year-ago period. MongoDB used $29.6 million of cash in capital'), Document(id='67f0259b8bb2babc0692434c', metadata={'_id': '67f0259b8bb2babc0692434c', ... 'page_label': '2'}, page_content='Cash Flow: During the year ended January 31, 2025, MongoDB generated $150.2 million of cash from operations,'), Document(id='67f0259b8bb2babc06924364', metadata={'_id': '67f0259b8bb2babc06924364', ... 'page_label': '2'}, page_content='MongoDB will host a conference call today, March 5, 2025, at 5:00 p.m. (Eastern Time) to discuss its financial results and business outlook. A live')]