📄️ AI21SemanticTextSplitter
This example goes over how to use AI21SemanticTextSplitter in LangChain.
📄️ Beautiful Soup
Beautiful Soup is a Python package for parsing
📄️ Cross Encoder Reranker
This notebook shows how to implement reranker in a retriever with your own cross encoder from Hugging Face cross encoder models or Hugging Face models that implements cross encoder function (example: BAAI/bge-reranker-base). SagemakerEndpointCrossEncoder enables you to use these HuggingFace models loaded on Sagemaker.
📄 ️ DashScope Reranker
This notebook shows how to use DashScope Reranker for document compression and retrieval. DashScope is the generative AI service from Alibaba Cloud (Aliyun).
📄️ Doctran: extract properties
We can extract useful features of documents using the Doctran library, which uses OpenAI's function calling feature to extract specific metadata.
📄️ Doctran: interrogate documents
Documents used in a vector store knowledge base are typically stored in a narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the likelihood of retrieving relevant documents, and decrease the likelihood of retrieving irrelevant documents.
📄️ Doctran: language translation
Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.
📄️ Google Cloud Vertex AI Reranker
The Vertex Search Ranking API is one of the standalone APIs in Vertex AI Agent Builder. It takes a list of documents and reranks those documents based on how relevant the documents are to a query. Compared to embeddings, which look only at the semantic similarity of a document and a query, the ranking API can give you precise scores for how well a document answers a given query. The ranking API can be used to improve the quality of search results after retrieving an initial set of candidate documents.
📄️ Google Cloud Document AI
Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.
📄️ Google Translate
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another.
📄️ HTML to text
html2text is a Python package that converts a page of HTML into clean, easy-to-read plain ASCII text.
📄️ Infinity Reranker
Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip.
📄️ Jina Reranker
This notebook shows how to use Jina Reranker for document compression and retrieval.
📄️ Markdownify
markdownify is a Python package that converts HTML documents to Markdown format with customizable options for handling tags (links, images, ...), heading styles and other.
📄️ Nuclia
Nuclia automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.
📄️ OpenAI metadata tagger
It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for a more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.