# Text Chunking Implementation - Final Documentation

**Issue:** [#3552783 - Enforce 10k-char limit per request to Lara Translate API](https://www.drupal.org/project/tmgmt_laratranslate/issues/3552783)

**Last Updated:** 24 October 2025

## Table of Contents

1. [Overview](#overview)
2. [Implementation Architecture](#implementation-architecture)
3. [RecursiveCharacterTextSplitter Service](#recursivecharactertextsplitter-service)
4. [TextSplitterValidator Service](#textsplittervalidator-service)
5. [Key Features](#key-features)
6. [Configuration Options](#configuration-options)
7. [Usage Examples](#usage-examples)
8. [Integration with LaraTranslator](#integration-with-laratranslator)
9. [Testing Strategy](#testing-strategy)
10. [Performance Considerations](#performance-considerations)
11. [Comparison with Previous Approaches](#comparison-with-previous-approaches)

---

## Overview

The Lara Translate API enforces a strict limit of **10,000 characters per request**. To handle content that exceeds this limit, we have implemented a sophisticated text chunking system based on LangChain's `RecursiveCharacterTextSplitter`.

### Why RecursiveCharacterTextSplitter?

Unlike simple sentence-boundary splitting, the recursive approach:

- ✅ **Hierarchically tries different separators** - starts with larger semantic units (paragraphs) and recursively tries smaller units (sentences, words) if needed
- ✅ **Preserves context better** - configurable chunk overlap ensures translation context is maintained
- ✅ **Language-specific splitting** - supports different splitting patterns for code (PHP, JavaScript, Python, etc.)
- ✅ **Flexible and extensible** - highly configurable with multiple separator strategies
- ✅ **Battle-tested** - based on LangChain's proven implementation used in production AI/ML systems
- ✅ **UTF-8 safe** - proper multibyte character handling

### What We Implemented

We have implemented **two services**:

1. **RecursiveCharacterTextSplitter** - Core chunking logic with recursive separator strategy
2. **TextSplitterValidator** - HTML validation utilities for quality assurance

---

## Implementation Architecture

### Service Structure

```
src/Service/
├── RecursiveCharacterTextSplitter.php  # Main chunking service
└── TextSplitterValidator.php           # Validation utilities
```

---

## RecursiveCharacterTextSplitter Service

### Core Concept

The splitter recursively tries different separators to find the best split point:

1. Try splitting by **paragraphs** (`\n\n`)
2. If chunks still too large, try **lines** (`\n`)
3. If still too large, try **words** (` `)
4. Last resort: **characters** (`""`)

This ensures the largest semantic units are preserved while respecting size limits.

### Class Overview

```php
namespace Drupal\tmgmt_laratranslate\Service;

use Drupal\tmgmt_laratranslate\Enum\KeepSeparator;
use Psr\Log\LoggerInterface;

/**
 * Recursive character text splitter service.
 *
 * Recursively splits text by trying different separators to find one that
 * works. This is a PHP port of LangChain's RecursiveCharacterTextSplitter.
 *
 * @see https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py
 */
class RecursiveCharacterTextSplitter {
  // Implementation...
}
```

### Default Configuration

```php
// Default separator hierarchy
$this->separators = ["\n\n", "\n", " ", ""];

// Keep separators in output by default
$this->keepSeparator = KeepSeparator::Yes;

// Separators are literal strings (not regex)
$this->isSeparatorRegex = FALSE;

// Maximum chunk size: 4000 characters
$this->chunkSize = 4000;

// Overlap between chunks: 200 characters
$this->chunkOverlap = 200;

// UTF-8 aware length function
$this->lengthFunction = fn (string $text): int => mb_strlen($text, 'UTF-8');
```

### Public Methods

#### `configure(array $config): static`

Configure the splitter with custom settings.

**Parameters:**
- `separators` (array) - Separator strings to try in order
- `keep_separator` (KeepSeparator|string|bool) - How to handle separators
- `is_separator_regex` (bool) - Whether separators are regex patterns
- `chunk_size` (int) - Maximum chunk size in characters
- `chunk_overlap` (int) - Overlap between chunks
- `length_function` (callable) - Custom length calculation function

**Returns:** `$this` (fluent interface)

**Example:**
```php
$splitter->configure([
  'chunk_size' => 9900,
  'chunk_overlap' => 200,
  'separators' => ["\n\n", "\n", ". ", " ", ""],
  'keep_separator' => KeepSeparator::End,
]);
```

#### `splitText(string $text): array`

Split text into chunks using the configured settings.

**Parameters:**
- `$text` (string) - The text to split

**Returns:** Array of text chunks

**Example:**
```php
$text = str_repeat('This is a long paragraph. ', 500);
$chunks = $splitter->splitText($text);
// Returns: ['chunk1', 'chunk2', ...]
```

#### `forLanguage(string $language, array $config = []): static`

Create a splitter configured for a specific programming language.

**Supported Languages:**
- `php` - PHP code
- `python` - Python code
- `javascript`, `js` - JavaScript code
- `typescript`, `ts` - TypeScript code
- `java` - Java code
- `go` - Go code
- `rust` - Rust code
- `markdown`, `md` - Markdown documents
- `html` - HTML documents

**Parameters:**
- `$language` (string) - Language identifier
- `$config` (array) - Additional configuration options

**Returns:** New configured instance

**Example:**
```php
$phpSplitter = $splitter->forLanguage('php', [
  'chunk_size' => 5000,
  'chunk_overlap' => 100,
]);

$phpCode = file_get_contents('MyClass.php');
$chunks = $phpSplitter->splitText($phpCode);
```

#### `reassembleChunks(array $chunks): string`

Reassemble chunks back into a single text.

**Parameters:**
- `$chunks` (array) - Array of text chunks

**Returns:** Reassembled text

**Example:**
```php
$originalText = $splitter->reassembleChunks($chunks);
```

#### `getTextLength(string $text): int`

Get the length of text using the configured length function.

**Parameters:**
- `$text` (string) - The text to measure

**Returns:** Length in characters (UTF-8 aware)

**Example:**
```php
$length = $splitter->getTextLength('Hello 世界'); // Returns: 8
```

### Separator Strategies (KeepSeparator Enum)

The `KeepSeparator` enum defines how separators are handled:

```php
enum KeepSeparator: string {
  case Yes = 'yes';     // Keep as separate elements
  case No = 'no';       // Discard completely
  case Start = 'start'; // Attach to start of following text
  case End = 'end';     // Attach to end of preceding text
}
```

**Examples:**

```php
// Original text
$text = "Sentence one. Sentence two. Sentence three.";

// KeepSeparator::Yes (default)
// Split: ["Sentence one", ".", " ", "Sentence two", ".", " ", "Sentence three", "."]

// KeepSeparator::No
// Split: ["Sentence one", "Sentence two", "Sentence three"]

// KeepSeparator::End
// Split: ["Sentence one.", "Sentence two.", "Sentence three."]

// KeepSeparator::Start
// Split: ["Sentence one", ". Sentence two", ". Sentence three."]
```

### Language-Specific Separators

When using `forLanguage()`, the splitter uses language-specific separator patterns:

### How Recursive Splitting Works

**Algorithm Flow:**

1. **Select Separator**: Try separators in order until one matches the text
2. **Split**: Split text using the selected separator
3. **Check Splits**: For each resulting split:
   - If length < `chunk_size`: Add to "good splits"
   - If length >= `chunk_size`: **Recursively split** using remaining separators
4. **Merge**: Merge "good splits" into chunks respecting size and overlap limits
5. **Overlap**: Maintain `chunk_overlap` characters between adjacent chunks for context

**Example:**

```php
$text = "Paragraph one.\n\nParagraph two.\n\nParagraph three with very long content...";

// Step 1: Try "\n\n" separator
// Result: ["Paragraph one.", "Paragraph two.", "Paragraph three with very long content..."]

// Step 2: Third chunk too large, recursively try "\n"
// On "Paragraph three with very long content..." split by lines

// Step 3: Merge chunks with overlap
// Final: ["Paragraph one.\n\nParagraph two.", "Paragraph two.\n\nParagraph three first part", "..."]
```

### Chunk Overlap Feature

Overlap ensures translation context is preserved between chunks:

```php
$splitter->configure([
  'chunk_size' => 100,
  'chunk_overlap' => 20,
]);

// Text: "AAAA BBBB CCCC DDDD EEEE FFFF GGGG"
// Chunks with overlap:
// Chunk 1: "AAAA BBBB CCCC DDDD EEEE"
// Chunk 2: "DDDD EEEE FFFF GGGG"  // ← "DDDD EEEE" overlaps with chunk 1
```

**Benefits:**
- Translation context is maintained across chunk boundaries
- Improves translation quality for long documents
- Helps preserve meaning that spans multiple chunks

---

## TextSplitterValidator Service

### Purpose

Provides optional HTML validation utilities for quality assurance. This service is primarily used for:

- Debugging HTML structure issues
- Quality assurance logging
- Final result validation (optional, non-blocking)

### Class Overview

```php
namespace Drupal\tmgmt_laratranslate\Service;

use Psr\Log\LoggerInterface;

/**
 * Service for HTML validation.
 */
class TextSplitterValidator {

  public function __construct(
    private readonly LoggerInterface $logger,
  ) {}

  /**
   * Validates that HTML is well-formed.
   */
  public function validateHtml(string $html): bool {
    // Implementation using DOMDocument
  }
}
```

### Public Methods

#### `validateHtml(string $html): bool`

Validates that HTML is well-formed using PHP's DOMDocument.

**Parameters:**
- `$html` (string) - The HTML to validate

**Returns:** `TRUE` if valid, `FALSE` otherwise

**Example:**
```php
$html = '<div><p>Valid HTML</p></div>';
$isValid = $validator->validateHtml($html); // Returns: TRUE

$html = '<div><p>Invalid HTML</div>';
$isValid = $validator->validateHtml($html); // Returns: FALSE
```

**Implementation Details:**

```php
public function validateHtml(string $html): bool {
  if ($html === '') {
    $this->logger->warning('Empty HTML provided for validation.');
    return TRUE;
  }

  $dom = new \DOMDocument();
  libxml_use_internal_errors(TRUE);

  // Load as HTML fragment (no doctype/html/body)
  $result = $dom->loadHTML(
    '<?xml encoding="UTF-8">' . $html,
    LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
  );

  $errors = libxml_get_errors();
  libxml_clear_errors();
  libxml_use_internal_errors(FALSE);

  return $result && $errors === [];
}
```

### Usage Philosophy

**Important:** This validator is for **optional validation only**.

❌ **DO NOT** use for:
- Validating individual chunks (unnecessary overhead)
- Blocking translation operations (too strict)
- Production critical path (performance impact)

✅ **DO** use for:
- Debugging HTML issues during development
- Quality assurance logging (warnings, not errors)
- Final reassembled result validation (optional)

**Reasoning:**

Multiple layers of protection already exist:
1. **Drupal Text Formats** - Input HTML is filtered and sanitized
2. **Translation API** - Handles HTML processing internally
3. **Drupal Rendering** - Final output is filtered again

Additional validation is often redundant and adds unnecessary overhead.

---

## Key Features

### 1. UTF-8 Multibyte Support

All text operations use UTF-8 aware functions:

```php
// Default length function
$this->lengthFunction = fn (string $text): int => mb_strlen($text, 'UTF-8');

// Custom length functions are supported
$splitter->configure([
  'length_function' => function(string $text): int {
    // Count tokens instead of characters
    return count(explode(' ', $text));
  },
]);
```

### 2. Configurable Separator Strategy

Choose how to handle separators:

```php
// Keep separators with preceding text
$splitter->configure([
  'separators' => ['. ', '! ', '? '],
  'keep_separator' => KeepSeparator::End,
]);

// Result: ["Sentence one.", "Sentence two!", "Question?"]
```

### 3. Regex Support

Use regex patterns for advanced splitting:

```php
$splitter->configure([
  'separators' => ['\n#{1,6} ', '\n\n', '\n'],
  'is_separator_regex' => TRUE,
]);

// Splits on Markdown headings and paragraphs
```

### 4. Context Preservation

Chunk overlap maintains context:

```php
$splitter->configure([
  'chunk_size' => 1000,
  'chunk_overlap' => 200, // Last 200 chars repeated in next chunk
]);
```

### 5. Language-Specific Splitting

Optimized for different content types:

```php
// Code splitting
$phpSplitter = $splitter->forLanguage('php');
$jsSplitter = $splitter->forLanguage('javascript');

// Markdown splitting
$mdSplitter = $splitter->forLanguage('markdown');

// HTML splitting
$htmlSplitter = $splitter->forLanguage('html');
```

### 6. Fluent Configuration

Chainable configuration methods:

```php
$splitter
  ->configure(['chunk_size' => 5000])
  ->configure(['chunk_overlap' => 100])
  ->configure(['keep_separator' => KeepSeparator::End]);

$chunks = $splitter->splitText($text);
```

---

## Configuration Options

### Complete Configuration Example

```php
use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;
use Drupal\tmgmt_laratranslate\Enum\KeepSeparator;

$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);

$splitter->configure([
  // Separators to try in order (paragraphs → lines → words → chars)
  'separators' => ["\n\n", "\n", " ", ""],

  // How to handle separators in output
  'keep_separator' => KeepSeparator::End,

  // Whether separators are regex patterns
  'is_separator_regex' => FALSE,

  // Maximum chunk size (characters)
  'chunk_size' => 9900,

  // Overlap between chunks (for context)
  'chunk_overlap' => 200,

  // Custom length function (optional)
  'length_function' => fn(string $text): int => mb_strlen($text, 'UTF-8'),
]);
```

### Configuration for Lara Translate API

For the Lara Translate 10k character limit:

```php
$splitter->configure([
  'chunk_size' => 9900,        // 100 char safety buffer
  'chunk_overlap' => 200,      // Preserve context
  'separators' => ["\n\n", "\n", ". ", " ", ""],
  'keep_separator' => KeepSeparator::End,
]);
```

---

## Usage Examples

### Example 1: Basic Text Splitting

```php
use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;

// Get service via dependency injection or service container
$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);

// Configure for Lara API
$splitter->configure([
  'chunk_size' => 9900,
  'chunk_overlap' => 200,
]);

// Split long text
$longText = file_get_contents('long-article.txt');
$chunks = $splitter->splitText($longText);

echo "Split into " . count($chunks) . " chunks\n";

foreach ($chunks as $i => $chunk) {
  $length = $splitter->getTextLength($chunk);
  echo "Chunk " . ($i + 1) . ": {$length} characters\n";
}
```

### Example 2: Language-Specific Splitting

```php
// Split PHP code
$phpCode = file_get_contents('MyClass.php');

$phpSplitter = $splitter->forLanguage('php', [
  'chunk_size' => 5000,
  'chunk_overlap' => 100,
]);

$codeChunks = $phpSplitter->splitText($phpCode);

// Each chunk splits at function/class boundaries when possible
```

### Example 3: Custom Separator Strategy

```php
// Split on sentences, keep punctuation
$splitter->configure([
  'separators' => ["\n\n", ". ", "! ", "? ", " ", ""],
  'keep_separator' => KeepSeparator::End,
  'chunk_size' => 1000,
  'chunk_overlap' => 50,
]);

$text = "First sentence. Second sentence! Question? More text.";
$chunks = $splitter->splitText($text);

// Result: ["First sentence.", "Second sentence!", "Question?", "More text."]
```

### Example 4: Markdown Document Splitting

```php
// Split Markdown document at heading boundaries
$mdSplitter = $splitter->forLanguage('markdown', [
  'chunk_size' => 10000,
  'chunk_overlap' => 500,
]);

$markdown = file_get_contents('documentation.md');
$sections = $mdSplitter->splitText($markdown);

// Splits at headings (# ## ###), code blocks (```), horizontal rules
```

### Example 5: HTML Content Splitting

```php
// Split HTML at tag boundaries
$htmlSplitter = $splitter->forLanguage('html', [
  'chunk_size' => 8000,
  'chunk_overlap' => 200,
]);

$html = '<div><p>Paragraph 1</p><p>Paragraph 2</p>...</div>';
$htmlChunks = $htmlSplitter->splitText($html);

// Splits at block element boundaries when possible
```

### Example 6: Reassembling Chunks

```php
$originalText = "Very long text content...";

// Split
$chunks = $splitter->splitText($originalText);

// Process chunks (e.g., translate)
$processedChunks = array_map(function($chunk) {
  return strtoupper($chunk); // Example transformation
}, $chunks);

// Reassemble
$result = $splitter->reassembleChunks($processedChunks);
```

### Example 7: With HTML Validation

```php
use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;
use Drupal\tmgmt_laratranslate\Service\TextSplitterValidator;

$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);
$validator = \Drupal::service(TextSplitterValidator::class);

$html = '<div><p>Long HTML content...</p></div>';

// Split HTML content
$chunks = $splitter->forLanguage('html')->splitText($html);

// Optional: Validate each chunk (for debugging only)
foreach ($chunks as $chunk) {
  if (!$validator->validateHtml($chunk)) {
    \Drupal::logger('tmgmt_laratranslate')
      ->warning('Chunk has HTML structure issues');
  }
}

// Process chunks...
$translatedChunks = translateChunks($chunks);

// Reassemble
$result = $splitter->reassembleChunks($translatedChunks);

// Optional: Validate final result
if (!$validator->validateHtml($result)) {
  \Drupal::logger('tmgmt_laratranslate')
    ->warning('Final HTML may have structural issues');
}
```

---

## Integration with LaraTranslator

### Current Integration Status

The services are implemented and ready for integration with the `LaraTranslator` plugin.

### Recommended Integration Approach

```php
namespace Drupal\tmgmt_laratranslate\Plugin\tmgmt\Translator;

use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;

final class LaraTranslator extends TranslatorPluginBase {

  public function __construct(
    array $configuration,
    string $plugin_id,
    mixed $plugin_definition,
    private readonly LoggerInterface $logger,
    // ... other dependencies
    private readonly RecursiveCharacterTextSplitter $textSplitter,
  ) {
    parent::__construct($configuration, $plugin_id, $plugin_definition);
  }

  public static function create(
    ContainerInterface $container,
    array $configuration,
    $plugin_id,
    $plugin_definition
  ): static {
    // ... retrieve other services

    $text_splitter = $container->get(RecursiveCharacterTextSplitter::class);
    \assert($text_splitter instanceof RecursiveCharacterTextSplitter);

    return new static(
      $configuration,
      $plugin_id,
      $plugin_definition,
      $logger,
      // ... other services
      $text_splitter,
    );
  }

  private function translateText(
    string $text,
    string $source_langcode,
    string $target_langcode,
    LaraTranslatorSDK $lara_translator,
    TranslatorInterface $translator,
  ): string {
    // Configure splitter for Lara API limits
    $this->textSplitter->configure([
      'chunk_size' => 9900,      // 100 char safety buffer
      'chunk_overlap' => 200,    // Preserve context
      'keep_separator' => KeepSeparator::End,
    ]);

    $totalLength = $this->textSplitter->getTextLength($text);

    // No chunking needed
    if ($totalLength <= 9900) {
      return $this->performSingleTranslation(
        $text,
        $source_langcode,
        $target_langcode,
        $lara_translator,
        $translator
      );
    }

    // Chunk the text
    $chunks = $this->textSplitter->splitText($text);

    $this->logger->info('Text exceeds 9.9k limit (@length chars), split into @chunks chunks', [
      '@length' => $totalLength,
      '@chunks' => count($chunks),
    ]);

    // Translate each chunk
    $translations = [];
    foreach ($chunks as $index => $chunk) {
      $this->logger->debug('Translating chunk @num/@total', [
        '@num' => $index + 1,
        '@total' => count($chunks),
      ]);

      $translations[] = $this->performSingleTranslation(
        $chunk,
        $source_langcode,
        $target_langcode,
        $lara_translator,
        $translator
      );
    }

    // Reassemble translated chunks
    $result = $this->textSplitter->reassembleChunks($translations);

    $this->logger->info('Successfully reassembled @chunks chunks', [
      '@chunks' => count($chunks),
    ]);

    return $result;
  }

  private function performSingleTranslation(
    string $text,
    string $source_langcode,
    string $target_langcode,
    LaraTranslatorSDK $lara_translator,
    TranslatorInterface $translator,
  ): string {
    // Existing translation logic
  }
}
```

### Integration Considerations

**1. Character Limit Enforcement**

The API limit applies to the **complete text** sent to the API:

```php
// ✅ CORRECT: Check total text length
$totalLength = $this->textSplitter->getTextLength($text);

if ($totalLength <= 9900) {
  // No chunking needed
}
```

**2. Configuration per Content Type**

Different content types may need different configurations:

```php
// For HTML content
if ($isHtml) {
  $this->textSplitter
    ->forLanguage('html', [
      'chunk_size' => 9900,
      'chunk_overlap' => 200,
    ]);
}

// For plain text
else {
  $this->textSplitter->configure([
    'chunk_size' => 9900,
    'chunk_overlap' => 200,
    'separators' => ["\n\n", "\n", ". ", " ", ""],
  ]);
}
```

**3. Error Handling**

If a chunk translation fails, fail the entire text:

```php
foreach ($chunks as $chunk) {
  try {
    $translations[] = $this->performSingleTranslation(...);
  }
  catch (\Exception $e) {
    // Log error and re-throw
    $this->logger->error('Chunk translation failed: @message', [
      '@message' => $e->getMessage(),
    ]);
    throw $e; // Fail entire job item
  }
}
```

**4. Logging**

Comprehensive logging for debugging:

```php
$this->logger->info('Text chunking: @total chars → @chunks chunks', [
  '@total' => $totalLength,
  '@chunks' => count($chunks),
]);

foreach ($chunks as $i => $chunk) {
  $this->logger->debug('Chunk @num: @length chars', [
    '@num' => $i + 1,
    '@length' => $this->textSplitter->getTextLength($chunk),
  ]);
}
```

---

## Testing Strategy

### Unit Tests for RecursiveCharacterTextSplitter

```php
namespace Drupal\Tests\tmgmt_laratranslate\Unit\Service;

use Drupal\Tests\UnitTestCase;
use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;
use Drupal\tmgmt_laratranslate\Enum\KeepSeparator;

class RecursiveCharacterTextSplitterTest extends UnitTestCase {

  private RecursiveCharacterTextSplitter $splitter;

  protected function setUp(): void {
    parent::setUp();

    $logger = $this->createMock(\Psr\Log\LoggerInterface::class);
    $this->splitter = new RecursiveCharacterTextSplitter($logger);
  }

  public function testSplitTextNoChunkingNeeded(): void {
    $text = 'Short text.';
    $chunks = $this->splitter->splitText($text);

    $this->assertCount(1, $chunks);
    $this->assertEquals($text, $chunks[0]);
  }

  public function testSplitTextMultipleChunks(): void {
    $this->splitter->configure([
      'chunk_size' => 100,
      'chunk_overlap' => 20,
    ]);

    $text = str_repeat('This is a test sentence. ', 50);
    $chunks = $this->splitter->splitText($text);

    $this->assertGreaterThan(1, count($chunks));

    foreach ($chunks as $chunk) {
      $this->assertLessThanOrEqual(100, mb_strlen($chunk, 'UTF-8'));
    }
  }

  public function testKeepSeparatorStrategies(): void {
    $text = "Sentence one. Sentence two. Sentence three.";

    // Test KeepSeparator::End
    $this->splitter->configure([
      'separators' => ['. '],
      'keep_separator' => KeepSeparator::End,
      'chunk_size' => 20,
    ]);

    $chunks = $this->splitter->splitText($text);

    foreach ($chunks as $chunk) {
      if (strlen($chunk) > 0) {
        $this->assertStringEndsWith('.', trim($chunk));
      }
    }
  }

  public function testForLanguage(): void {
    $phpCode = "<?php\n\nfunction test() {\n  return true;\n}\n\nclass Foo {}";

    $phpSplitter = $this->splitter->forLanguage('php', [
      'chunk_size' => 50,
    ]);

    $chunks = $phpSplitter->splitText($phpCode);

    $this->assertGreaterThan(1, count($chunks));
  }

  public function testChunkOverlap(): void {
    $this->splitter->configure([
      'chunk_size' => 50,
      'chunk_overlap' => 10,
      'separators' => [' '],
      'keep_separator' => KeepSeparator::No,
    ]);

    $text = str_repeat('word ', 50);
    $chunks = $this->splitter->splitText($text);

    // Verify overlap exists between consecutive chunks
    for ($i = 0; $i < count($chunks) - 1; $i++) {
      $chunk1End = substr($chunks[$i], -20);
      $chunk2Start = substr($chunks[$i + 1], 0, 20);

      // There should be some overlap
      $this->assertNotEmpty($chunk1End);
      $this->assertNotEmpty($chunk2Start);
    }
  }

  public function testReassembleChunks(): void {
    $text = str_repeat('Test sentence. ', 100);

    $this->splitter->configure([
      'chunk_size' => 200,
      'chunk_overlap' => 20,
    ]);

    $chunks = $this->splitter->splitText($text);
    $reassembled = $this->splitter->reassembleChunks($chunks);

    // Length should be similar (accounting for overlap removal)
    $this->assertGreaterThan(
      mb_strlen($text, 'UTF-8') * 0.9,
      mb_strlen($reassembled, 'UTF-8')
    );
  }

  public function testUtf8Support(): void {
    $text = str_repeat('Hello 世界! ', 200);

    $this->splitter->configure(['chunk_size' => 100]);
    $chunks = $this->splitter->splitText($text);

    foreach ($chunks as $chunk) {
      $this->assertLessThanOrEqual(100, mb_strlen($chunk, 'UTF-8'));
    }
  }
}
```

### Unit Tests for TextSplitterValidator

```php
namespace Drupal\Tests\tmgmt_laratranslate\Unit\Service;

use Drupal\Tests\UnitTestCase;
use Drupal\tmgmt_laratranslate\Service\TextSplitterValidator;

class TextSplitterValidatorTest extends UnitTestCase {

  private TextSplitterValidator $validator;

  protected function setUp(): void {
    parent::setUp();

    $logger = $this->createMock(\Psr\Log\LoggerInterface::class);
    $this->validator = new TextSplitterValidator($logger);
  }

  public function testValidateValidHtml(): void {
    $html = '<div><p>Valid HTML</p></div>';
    $this->assertTrue($this->validator->validateHtml($html));
  }

  public function testValidateInvalidHtml(): void {
    $html = '<div><p>Invalid HTML</div>';
    $this->assertFalse($this->validator->validateHtml($html));
  }

  public function testValidateEmptyHtml(): void {
    $html = '';
    $this->assertTrue($this->validator->validateHtml($html));
  }

  public function testValidateComplexHtml(): void {
    $html = '
      <div class="content">
        <h1>Title</h1>
        <p>Paragraph with <strong>bold</strong> and <em>italic</em>.</p>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
        </ul>
      </div>
    ';

    $this->assertTrue($this->validator->validateHtml($html));
  }
}
```

### Integration Tests

```php
namespace Drupal\Tests\tmgmt_laratranslate\Kernel;

use Drupal\KernelTests\KernelTestBase;
use Drupal\tmgmt_laratranslate\Service\RecursiveCharacterTextSplitter;
use Drupal\tmgmt_laratranslate\Service\TextSplitterValidator;

class TextChunkingIntegrationTest extends KernelTestBase {

  protected static $modules = ['tmgmt_laratranslate'];

  public function testServicesAreRegistered(): void {
    $splitter = $this->container->get(RecursiveCharacterTextSplitter::class);
    $validator = $this->container->get(TextSplitterValidator::class);

    $this->assertInstanceOf(RecursiveCharacterTextSplitter::class, $splitter);
    $this->assertInstanceOf(TextSplitterValidator::class, $validator);
  }

  public function testEndToEndChunkingAndValidation(): void {
    $splitter = $this->container->get(RecursiveCharacterTextSplitter::class);
    $validator = $this->container->get(TextSplitterValidator::class);

    // Create long HTML content
    $html = '<div>' . str_repeat('<p>Test paragraph.</p>', 500) . '</div>';

    // Split using HTML language mode
    $htmlSplitter = $splitter->forLanguage('html', [
      'chunk_size' => 1000,
      'chunk_overlap' => 100,
    ]);

    $chunks = $htmlSplitter->splitText($html);

    $this->assertGreaterThan(1, count($chunks));

    // Validate final reassembled result
    $reassembled = $splitter->reassembleChunks($chunks);
    $this->assertTrue($validator->validateHtml($reassembled));
  }
}
```

---

## Performance Considerations

### Memory Usage

The recursive splitter is memory-efficient:

- Processes text in chunks, not entire document at once
- No DOM parsing (unless using validator)
- Minimal overhead from tag stack (language-specific mode)

### CPU Usage

Algorithm complexity:
- **Best case:** O(n) - single separator matches entire text
- **Worst case:** O(n × m) - where n = text length, m = number of separators
- **Typical case:** O(n) - usually finds match early in separator list

### Optimization Tips

**1. Choose appropriate chunk size:**
```php
// Larger chunks = fewer API calls but risk of hitting limits
$splitter->configure(['chunk_size' => 9900]);

// Smaller chunks = more API calls but safer
$splitter->configure(['chunk_size' => 5000]);
```

**2. Optimize separator order:**
```php
// Put most likely separators first
$splitter->configure([
  'separators' => ["\n\n", "\n", " ", ""], // Paragraphs first
]);
```

**3. Minimize overlap for performance:**
```php
// Less overlap = fewer redundant characters
$splitter->configure(['chunk_overlap' => 100]); // vs 500
```

**4. Avoid unnecessary validation:**
```php
// ❌ Don't validate every chunk
foreach ($chunks as $chunk) {
  $validator->validateHtml($chunk); // Slow!
}

// ✅ Only validate final result if needed
$result = $splitter->reassembleChunks($chunks);
if (!$validator->validateHtml($result)) {
  // Log warning
}
```

---

## Comparison with Previous Approaches

### Previous Approach: Simple Sentence Splitting

**From `TEXT_CHUNKING_IMPLEMENTATION.md`:**

```php
// Old approach: Simple regex split
$sentences = preg_split('/(?<=[.!?])\s+/u', $text, -1, PREG_SPLIT_NO_EMPTY);

$chunks = [];
$currentChunk = '';

foreach ($sentences as $sentence) {
  if (mb_strlen($currentChunk . $sentence, 'UTF-8') > 9900) {
    $chunks[] = $currentChunk;
    $currentChunk = $sentence;
  } else {
    $currentChunk .= $sentence . ' ';
  }
}
```

**Limitations:**
- ❌ No hierarchical splitting (only sentences)
- ❌ No overlap for context
- ❌ No language-specific handling
- ❌ No HTML-aware splitting
- ❌ Fixed separator strategy

### Current Approach: RecursiveCharacterTextSplitter

```php
// New approach: Recursive hierarchical splitting
$splitter->configure([
  'chunk_size' => 9900,
  'chunk_overlap' => 200,           // ← Context preservation
  'separators' => ["\n\n", "\n", ". ", " ", ""], // ← Hierarchical
  'keep_separator' => KeepSeparator::End,
]);

$chunks = $splitter->splitText($text);
```

**Advantages:**
- ✅ Hierarchical splitting (paragraphs → sentences → words → chars)
- ✅ Chunk overlap preserves context
- ✅ Language-specific modes (PHP, JavaScript, HTML, Markdown, etc.)
- ✅ Flexible separator strategies (Yes, No, Start, End)
- ✅ Regex support for advanced patterns
- ✅ Battle-tested algorithm (from LangChain)
- ✅ Highly configurable and extensible

### Comparison Table

| Feature | Simple Sentence Split | RecursiveCharacterTextSplitter |
|---------|----------------------|-------------------------------|
| Hierarchical splitting | ❌ | ✅ |
| Context preservation (overlap) | ❌ | ✅ |
| Language-specific modes | ❌ | ✅ (8+ languages) |
| Configurable separators | ❌ | ✅ |
| Separator strategies | Fixed | 4 strategies (Yes/No/Start/End) |
| Regex support | Limited | ✅ Full support |
| Custom length functions | ❌ | ✅ |
| UTF-8 support | ✅ | ✅ |
| Code complexity | Low | Medium |
| Flexibility | Low | High |
| Production-ready | For simple cases | For all cases |

---

## Conclusion

The `RecursiveCharacterTextSplitter` and `TextSplitterValidator` services provide a robust, flexible, and production-ready solution for handling the Lara Translate API's 10k character limit.

### Key Takeaways

1. **RecursiveCharacterTextSplitter** is the core service for text chunking
2. **Hierarchical splitting** preserves semantic units (paragraphs → sentences → words)
3. **Chunk overlap** maintains translation context across boundaries
4. **Language-specific modes** optimize for different content types
5. **Highly configurable** via fluent configuration interface
6. **UTF-8 safe** with proper multibyte character handling
7. **TextSplitterValidator** provides optional HTML validation
8. **Battle-tested** algorithm based on LangChain's implementation

### Next Steps

1. **Integration**: Integrate with `LaraTranslator` plugin
2. **Testing**: Comprehensive unit and integration tests
3. **Documentation**: Update module README with chunking info
4. **Configuration**: Add admin UI for chunk size configuration (optional)
5. **Monitoring**: Add logging for chunk statistics and performance

### References

- **LangChain Implementation**: [RecursiveCharacterTextSplitter.py](https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/character.py)
- **Issue Tracker**: [#3552783](https://www.drupal.org/project/tmgmt_laratranslate/issues/3552783)
- **Previous Documentation**:
  - `TEXT_CHUNKING_IMPLEMENTATION_OLD.md` - Initial analysis and design
  - `TEXT_CHUNKING_IMPLEMENTATION.md` - Simplified sentence-boundary approach

---

**Document Version:** 1.0
**Last Updated:** 24 October 2025
**Author:** TMGMT LaraTranslate Development Team
