# AI Document OCR Provider

A Google Document AI provider for Drupal's [AI module](https://www.drupal.org/project/ai), enabling OCR text extraction from PDF files, images, and other supported document formats.

## Features

- **Google Document AI Integration**: Leverages Google Cloud's Document AI OCR service
- **AI Module Provider**: Integrates as a provider for Drupal's AI module
- **Multiple File Format Support**: PDF, JPEG, PNG, GIF, TIFF, BMP, WebP
- **Structured Data Extraction**: Extracts text, paragraphs, pages, and document structure
- **Dynamic Processor Loading**: Auto-loads available processors from your Google Cloud project
- **Key Module Integration**: Secure credential storage using Drupal's Key module
- **AI Automators Support**: Automatic document processing when files are uploaded

## Requirements

- Drupal 10.0+ or Drupal 11.0+
- [AI module](https://www.drupal.org/project/ai)
- [AI Automators module](https://www.drupal.org/project/ai_automators) (for automatic processing)
- [Key module](https://www.drupal.org/project/key) (recommended)
- Google Cloud Account with Document AI API enabled
- PHP 8.1+

## Installation

1. Install the module using Composer:
   ```bash
   composer require drupal/ai_document_ocr
   ```

2. Enable the module:
   ```bash
   drush en ai_document_ocr
   ```

## Google Cloud Setup

1. **Create a Google Cloud Project**
   - Go to [Google Cloud Console](https://console.cloud.google.com/)
   - Create a new project or select an existing one

2. **Enable Document AI API**
   - Navigate to APIs & Services > Library
   - Search for "Document AI API" and enable it

3. **Create Document AI Processors**
   - Go to Document AI > Processors
   - Create processors as needed (Document OCR, Form Parser, etc.)
   - Note the Processor IDs for configuration

4. **Create Service Account**
   - Go to IAM & Admin > Service Accounts
   - Create a new service account
   - Grant "Document AI API User" role
   - Generate and download a JSON key file

## Configuration

1. **Set up Google Cloud Credentials**
   - Store your service account JSON key using Drupal's Key module
   - Go to Configuration > System > Keys
   - Create a new key with "File" provider
   - Upload your Google Cloud service account JSON file

2. **Configure AI Document OCR Provider**
   - Navigate to Configuration > AI > AI Providers > Document OCR (`/admin/config/ai/providers/document-ocr`)
   - Select "AI Document OCR Provider" as the AI Provider
   - Choose your Google Cloud service account key
   - Select your Google Cloud region
   - Choose a processor from the auto-loaded list

3. **AI Provider Selection**
   - The module integrates with Drupal's AI module
   - Select "AI Document OCR Provider" in the AI provider dropdown
   - Configure your Google Cloud settings in the provider configuration

## Usage

### Via AI Module

```php
// Get the AI provider manager
$ai_provider_manager = \Drupal::service('ai.provider');

// Create an instance of the Document OCR provider
$provider = $ai_provider_manager->createInstance('ai_document_ocr');

// Prepare document input
$input = new \Drupal\ai_document_ocr\OperationType\DocumentToText\DocumentToTextInput(
  base64_encode($file_content),
  $mime_type,
  $filename
);

// Process document
$output = $provider->documentToText($input, 'model_id');

// Get results
$extracted_text = $output->getText();
$confidence = $output->getConfidence();
$structured_data = $output->getStructuredData();
```

### Via AI Automators (Automatic Processing)

The module includes AI Automators integration for automatic document processing:

1. **Setup Fields**: Create an image/file field (source) and text field (target) on your content type
2. **Configure Automator**: Go to the target text field settings and enable the Document Processor automator
3. **Select Source Field**: Choose your image/file field as the base field for processing
4. **Configure OCR Settings**: Set confidence threshold and structured data extraction options

The automator will automatically:
- Process images and PDFs when nodes are saved
- Extract text using Google Document AI OCR
- Store results in the target text field
- Handle processing errors with logging

### Supported File Types

- `application/pdf`
- `image/jpeg`, `image/jpg`
- `image/png`
- `image/gif`
- `image/tiff`
- `image/bmp`
- `image/webp`

## Key Classes

- `\Drupal\ai_document_ocr\Plugin\AiProvider\DocumentOcrProvider`: Main AI provider plugin
- `\Drupal\ai_document_ocr\Plugin\AiAutomatorType\DocumentProcessor`: AI Automator plugin for automatic processing
- `\Drupal\ai_document_ocr\OperationType\DocumentToText\DocumentToTextInput`: Input class
- `\Drupal\ai_document_ocr\OperationType\DocumentToText\DocumentToTextOutput`: Output class
- `\Drupal\ai_document_ocr\Form\AiProviderConfigForm`: Configuration form

## Security

This module follows secure practices for handling Google Cloud credentials:

- **Key Module Integration**: All sensitive credentials are stored using Drupal's Key module with file-based storage
- **No Database Storage**: Credentials are never stored in the Drupal database
- **Automatic Project ID**: Project ID is extracted from the service account key, avoiding manual entry
- **Secure File Storage**: Service account JSON files should be stored outside the web root
- **Minimal Permissions**: Service accounts only need "Document AI API User" role

## Troubleshooting

### Common Issues

1. **No AI Provider Option**
   - Ensure the AI module is enabled
   - Clear Drupal caches

2. **Authentication Errors**
   - Verify service account JSON key format
   - Check Document AI API is enabled in Google Cloud
   - Ensure service account has proper permissions

3. **No Processors Loaded**
   - Verify Google Cloud region selection
   - Check service account key has access to the project
   - Ensure processors exist in the selected region

4. **Processing Failures**
   - Check file format is supported
   - Verify file size limits per Google Cloud Document AI specifications
   - Review processor ID format

5. **AI Automators Not Working**
   - Ensure AI Automators module is enabled
   - Check automator configuration at Configuration > AI > AI Automators
   - Verify file fields are properly configured to trigger automators
   - Check queue processing for background operations

### Logging

Check Drupal logs at **Reports > Recent log messages** for any error information.

## License

This project is licensed under the GPL-2.0+ license.