Rutam Bhagat

Posted on May 13

Langchain: Document Splitting

#langchain #machinelearning #ai #chatbot

In last blog, we learned how to load documents into a standard format using LangChain's document loaders. Once the documents are loaded, the next step is to split them into smaller chunks. This process may seem straightforward at first, but there are subtleties and important considerations that can significantly impact the performance and accuracy of downstream tasks.

Why is Document Splitting Important?

Document splitting is crucial because it ensures that semantically relevant content is grouped together within the same chunk. This is particularly important when answering questions or performing other tasks that rely on the contextual information present in the documents.

Consider the following example: Let's say we have a sentence about the Toyota Camry and its specifications. If we split this sentence naively, without considering the context, we could end up with one chunk containing part of the sentence and another chunk containing the remaining part. As a result, when attempting to answer a question about the Camry's specifications, we would not have the complete information in either chunk, leading to an incorrect or incomplete answer.

How Does Document Splitting Work in LangChain?

The basis of all text splitters in LangChain involves splitting the text into chunks of a specified size, with an optional overlap between adjacent chunks. This is illustrated in the following diagram:

The chunk_size corresponds to the size of each chunk, which can be measured in characters or tokens (we'll discuss both approaches). The chunk_overlap is a portion of text that is shared between consecutive chunks, allowing for context to be maintained across chunk boundaries.

All text splitters in LangChain have two main methods: create_documents() and split_documents(). These methods follow the same logic under the hood but expose different interfaces: one takes a list of text strings, and the other takes a list of pre-existing documents.

Types of Text Splitters

LangChain provides several types of text splitters, each with its own strengths and use cases. Here are some of the most commonly used splitters:

CharacterTextSplitter

The CharacterTextSplitter is a more basic splitter that splits the text based on a single character separator, such as a space or a newline. This splitter is useful when dealing with text that doesn't have a clear structure or when you want to split the text at specific points.

RecursiveCharacterTextSplitter

The RecursiveCharacterTextSplitter is recommended for generic text splitting. It splits the text based on a hierarchy of separators, starting with double newlines (\n\n), then single newlines (\n), spaces ( ), and finally, individual characters. This approach aims to preserve the structure and coherence of the text by prioritizing splitting at natural boundaries like paragraphs and sentences.

TokenTextSplitter

The TokenTextSplitter splits the text based on token count rather than character count. This can be useful because many language models have context windows designated by token count rather than character count. Tokens are often approximately four characters long, so splitting based on token count can provide a better representation of how the language model will process the text.

MarkdownHeaderTextSplitter

The MarkdownHeaderTextSplitter is designed to split Markdown documents based on their header structure. It preserves the header metadata in the resulting chunks, allowing for context-aware splitting and potential downstream tasks that use the document structure.

Hands-on Examples

Let's explore some hands-on examples to better understand how these text splitters work and how to use them effectively.

Setting up the Environment

First, we'll set up the environment by importing the necessary libraries and loading the OpenAI API key:

import os
from langchain_openai import OpenAI
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

Next, we'll import two of the most commonly used text splitters:

from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)

Splitting with RecursiveCharacterTextSplitter and CharacterTextSplitter

Let's start by defining some toy examples to understand how these splitters work:

chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text1 = "abcdefghijklmnopqrstuvwxyz"
print(r_splitter.split_text(text1))  
# Output: ['abcdefghijklmnopqrstuvwxyz']

text2 = "abcdefghijklmnopqrstuvwxyzabcdefg"
print(r_splitter.split_text(text2))  
# Output: ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
print(r_splitter.split_text(text3))  
# Output: ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
print(c_splitter.split_text(text3))  
# Output: ['a b c d e f g h i j k l m n o p q r s t u v w x y z']

# Set the separator for CharacterTextSplitter
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=" "
)
print(c_splitter.split_text(text3))  
# Output: ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

These examples demonstrate how the RecursiveCharacterTextSplitter splits the text based on the specified chunk_size and chunk_overlap, while the CharacterTextSplitter splits the text based on a single character separator (in this case, a space).

Splitting Real-World Examples

Now, let's try splitting some real-world examples:

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \\n\\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

c_splitter = CharacterTextSplitter(chunk_size=450, chunk_overlap=0, separator=" ")
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450, chunk_overlap=0, separators=["\n\n", "\n", " ", ""]
)

chunks = c_splitter.split_text(some_text)
print("Chunks: ", chunks)
print("Length of chunks: ", len(chunks))
# Chunks:  ['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,', 'have a space.and words are separated by space.']
# Length of chunks:  2

chunks = r_splitter.split_text(some_text)
print("Chunks: ", chunks)
print("Length of chunks: ", len(chunks))
# Chunks:  ["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.", 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
# Length of chunks:  2

In this example, we use both the CharacterTextSplitter and the RecursiveCharacterTextSplitter to split a longer text. The CharacterTextSplitter splits the text based on spaces, while the RecursiveCharacterTextSplitter first tries to split on double newlines, then single newlines, spaces, and finally, individual characters.

We can also split real-world documents, such as PDFs and Notion databases:

from langchain.document_loaders import PyPDFLoader, NotionDirectoryLoader

# Load a PDF document
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n", chunk_size=1000, chunk_overlap=150, length_function=len
)
docs = text_splitter.split_documents(pages)

print("Pages in the original document: ", len(pages))
print("Length of chunks after splitting pages: ", len(docs))
# Pages in the original document:  22
# Length of chunks after splitting pages:  353

This code loads a PDF document using the PyPDFLoader, splits the pages into smaller chunks using the CharacterTextSplitter, and prints the number of original pages and the number of resulting chunks.

# Load a Notion database
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

docs = text_splitter.split_documents(notion_db)

print("Pages in the original notion document: ", len(notion_db))
print("Length of chunks after splitting pages: ", len(docs))
# Pages in the original notion document:  52
# Length of chunks after splitting pages:  353

Similarly, we can load a Notion database using the NotionDirectoryLoader, split the documents into chunks, and print the number of original documents and the resulting chunks.

Token-based Splitting

In addition to character-based splitting, LangChain also supports token-based splitting, which can be useful when working with language models that have context windows designated by token count:

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
print(text_splitter.split_text(text1))  
# Output: ['foo', ' bar', ' b', 'az', 'zy', 'foo']

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
print(docs[0])  
# Output: Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})
print(pages[0].metadata)
# Output: {'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

In this example, we use the TokenTextSplitter to split text based on token count. We can adjust the chunk_size and chunk_overlap parameters to control the splitting behavior.

Context-aware Splitting

LangChain also provides tools for context-aware splitting, which aims to preserve the document structure and semantic context during the splitting process. One such tool is the MarkdownHeaderTextSplitter, which splits Markdown documents based on their header structure and preserves the header metadata in the resulting chunks:

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(md_header_splits[0])  
# Output: Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})
print(md_header_splits[1])  
# Output: Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

In this example, we define a Markdown document with headers and use the MarkdownHeaderTextSplitter to split the document based on the header structure. The resulting chunks preserve the header metadata, which can be useful for downstream tasks that leverage the document structure.

We can also apply this splitter to real-world Markdown files, such as a Notion database:

loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = " ".join([d.page_content for d in docs])

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(txt)

print(md_header_splits[0])

Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at hr@blendle.com. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you.  \nIf you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2).", metadata={'Header 1': "Blendle's Employee Handbook"})

This code loads a Notion database, joins the document contents into a single string, splits the string using the MarkdownHeaderTextSplitter, and prints the first resulting chunk.

Conclusion

Document splitting is a crucial step in the LangChain pipeline, as it ensures that semantically relevant content is grouped together within the same chunk. LangChain provides a variety of text splitters, each with its own strengths and use cases, allowing you to choose the most appropriate splitter for your specific needs.

Whether you're working with generic text, Markdown documents, code snippets, or other types of content, LangChain's text splitters offer flexibility and customization options to split your documents effectively. By understanding the nuances and considerations involved in document splitting, you can optimize the performance and accuracy of your language models and downstream tasks.

Source Code

https://github.com/RutamBhagat/LangChainHCCourse2/blob/main/course_2/document_splitting.ipynb

DEV Community