Skip to main content

Adobe PDF Extract API

Adobe PDF Extract API is a machine-learning based service that extracts content from PDF files, including text, images, tables, and more.

This current implementation of a loader using Adobe PDF Extract API can either incorporate content as one document in JSON format, split it into chunks (optimized for RAG), or extract only figures and tables.

Figures are represented as placeholders inside the generated chunks with the base64 encoded image contained in the metadata. If you want to incorporate with an LLM that supports vision, you can split the message at its placeholder and insert the image as a separate message.

Extracted tables are formatted in markdown, and figures are extracted as base64 encoded images.

Setupโ€‹

Adobe PDF Extraction API credentials - follow this document to get one if you don't have. You will be passing <client_id> and <client_secret> as parameters to the loader.

%pip install --upgrade --quiet  langchain langchain-community adobe-pdf-extraction

"chunk" modeโ€‹

The first example uses a local file which will be sent to Adobe PDF Extract API.

With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:

from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser

file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"

parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="chunk",
)
loader = GenericLoader.from_filesystem(file_path, parser=parser)

documents = loader.load()

The output contains Documents with the extracted chunks.

documents

"json" modeโ€‹

The extraction result can also be returned in raw JSON format.

from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser

file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"

parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="json",
)

loader = GenericLoader.from_filesystem(file_path, parser=parser)

documents = loader.load()
documents

"data" modeโ€‹

To extract only figures and tables from the PDF, set the mode to data.

from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser

file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"

parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="data",
)
loader = GenericLoader.from_filesystem(file_path, parser=parser)

documents = loader.load()

The resulting output will be langchain documents with the extracted figures and tables.

for document in documents:
if document.metadata["content_type"] == "markdown":
print(f"Table Content: {document.page_content}")
elif document.metadata["content_type"] == "base64":
print(f"Figure Content: {document.page_content}")

Was this page helpful?