Unstructured
The
unstructuredpackage from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructuredecosystem within LangChain.
Installation and Setup
If you are using a loader that runs locally, use the following steps to get unstructured and its
dependencies running.
-
For the smallest installation footprint and to take advantage of features not available in the open-source
unstructuredpackage, install the Python SDK withpip install unstructured-clientalong withpip install langchain-unstructuredto use theUnstructuredLoaderand partition remotely against the Unstructured API. This loader lives in a LangChain partner repo instead of thelangchain-communityrepo and you will need anapi_key, which you can generate a free key here.- Unstructured's documentation for the sdk can be found here: https://docs.unstructured.io/api-reference/api-services/sdk
-
To run everything locally, install the open-source python package with
pip install unstructuredalong withpip install langchain-communityand use the sameUnstructuredLoaderas mentioned above.- You can install document specific dependencies with extras, e.g.
pip install "unstructured[docx]". - To install the dependencies for all document types, use
pip install "unstructured[all-docs]".
- You can install document specific dependencies with extras, e.g.
-
Install the following system dependencies if they are not already available on your system with e.g.
brew installfor Mac. Depending on what document types you're parsing, you may not need all of these.libmagic-dev(filetype detection)poppler-utils(images and PDFs)tesseract-ocr(images and PDFs)qpdf(PDFs)libreoffice(MS Office docs)pandoc(EPUBs)
-
When running locally, Unstructured also recommends using Docker by following this guide to ensure all system dependencies are installed correctly.
The Unstructured API requires API keys to make requests. You can request an API key here and start using it today! Checkout the README here here to get started making API calls. We'd love to hear your feedback, let us know how it goes in our community slack. And stay tuned for improvements to both quality and performance! Check out the instructions here if you'd like to self-host the Unstructured API or run it locally.
Data Loaders
The primary usage of Unstructured is in data loaders.
UnstructuredLoader
See a usage example to see how you can use this loader for both partitioning locally and remotely with the serverless Unstructured API.
from langchain_unstructured import UnstructuredLoader
UnstructuredCHMLoader
CHM means Microsoft Compiled HTML Help.
from langchain_community.document_loaders import UnstructuredCHMLoader
UnstructuredCSVLoader
A comma-separated values (CSV) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a usage example.
from langchain_community.document_loaders import UnstructuredCSVLoader
UnstructuredEmailLoader
See a usage example.
from langchain_community.document_loaders import UnstructuredEmailLoader
UnstructuredEPubLoader
EPUB is an e-book file format that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled ePub. EPUB is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a usage example.
from langchain_community.document_loaders import UnstructuredEPubLoader
UnstructuredExcelLoader
See a usage example.
from langchain_community.document_loaders import UnstructuredExcelLoader