GitHub PDF Connector
The GitHub PDF connector allows you to index all PDF files from a GitHub repository, including those in subdirectories. This is useful for indexing technical documentation, datasheets, hardware schematics, and other PDF-based resources. The PDFs must include extractable text. Purely scanned pages without OCR (e.g. images) will not produce any usable text.
Features
- Recursive Traversal: Automatically discovers all PDF files in the entire repository directory structure
- Text Extraction: Extracts text content from PDFs using
pypdflibrary - Path Filtering: Optional path filter to index only PDFs in specific directories
- Branch/Tag Support: Specify a particular branch, tag, or commit ref to index
- Chunking: Automatically chunks large PDF documents for optimal retrieval
Configuration
Required Parameters
type: Must be"github_pdf"repo_owner: GitHub repository owner/organization namerepo_name: GitHub repository name
Optional Parameters
ref: Branch, tag, or commit SHA to index (defaults to the repository's default branch)path_filter: Path prefix to limit indexing to specific directories (e.g.,"docs/"to only index PDFs in the docs folder)
Example Usage
Basic Example - Index All PDFs
{
"name": "hardware-docs",
"description": "Hardware schematics and datasheets from our hardware repository",
"connector": {
"type": "github_pdf",
"repo_owner": "your-org",
"repo_name": "hardware-repo"
}
}
With Path Filter
Index only PDFs in the schematics/ directory:
{
"name": "schematics-only",
"description": "Hardware schematics from the schematics directory",
"connector": {
"type": "github_pdf",
"repo_owner": "your-org",
"repo_name": "hardware-repo",
"path_filter": "schematics/"
}
}
With Specific Branch
Index PDFs from the develop branch:
{
"name": "dev-hardware-docs",
"description": "Hardware documentation from the develop branch",
"connector": {
"type": "github_pdf",
"repo_owner": "your-org",
"repo_name": "hardware-repo",
"ref": "develop"
}
}
Requirements
- A GitHub personal access token must be configured in the
GITHUB_TOKENenvironment variable - The token must have read access to the target repository
- PDFs must contain extractable text (scanned images without OCR will not yield text content)
How It Works
- Discovery: The connector fetches the complete repository tree recursively from GitHub's API
- Filtering: Identifies all files with
.pdfextension (case-insensitive) - Path Filtering: If
path_filteris specified, only PDFs matching the path prefix are processed - Download: Each PDF is downloaded via GitHub's blob API
- Text Extraction: Text is extracted from each page of the PDF
- Chunking: The extracted text is split into chunks (default 512 tokens with 50 token overlap)
- Indexing: Each chunk is indexed with embeddings for semantic search
Limitations
- PDFs must contain extractable text. Scanned images or image-based PDFs without OCR will appear empty
- Very large PDFs may take longer to process
- GitHub API rate limits apply (the connector includes automatic rate limit handling)
- Password-protected or encrypted PDFs are not supported
Troubleshooting
No text extracted from PDFs
- Ensure PDFs contain actual text, not just scanned images
- Some PDFs may use non-standard encodings or fonts that make text extraction difficult
PDFs not being found
- Verify the repository name and owner are correct
- Check that the
ref(if specified) exists in the repository - Ensure the
path_filter(if specified) matches your repository structure - Verify your GitHub token has read access to the repository