Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- `DictionaryBuilder` for build-time dictionary construction with `normalizeKey` injection
- `Dictionary::fromRows()` as the canonical dictionary factory
- `tests/DictionaryTest.php` unit tests for dictionary authoring and validation

### Changed

- Dictionary author format is now `term`, `category`, and `severity` only; `normalized` is derived at build time
- `Entry::fromRow()` replaces `Entry::fromArray()` for internal row construction
- `data/tr.php` seed dictionary migrated to author-only rows
- `TurkishProfile` builds its dictionary via `NormalizationPipeline` so index keys match runtime normalization

### Removed

- `Dictionary::fromArray()` — use `Dictionary::fromRows($rows, $normalizeKey)` instead
- `Entry::fromArray()`

### Breaking changes (v0.2)

- `Dictionary::fromArray()` removed; custom profiles must use `Dictionary::fromRows()` with a `normalizeKey` callable
- Author dictionary rows no longer accept a `normalized` field

## [0.1.0] - 2026-07-01

Initial public release of VerbaGuard — a framework-independent PHP moderation engine for language-aware text analysis.
Expand Down
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,15 @@ Use `VerbaGuard::forLanguages()` with a custom profile when you need a productio

## Language Profiles

A language profile bundles a dictionary and profile-specific normalizers:
A language profile bundles a dictionary and profile-specific normalizers.

**v0.2 dictionary authoring:** write only `term`, `category`, and `severity` in dictionary rows. Do not author `normalized` — it is derived at build time via `Dictionary::fromRows()` and a `normalizeKey` callable that must match the profile's runtime normalization chain.

```php
use VerbaGuard\Contracts\LanguageProfile;
use VerbaGuard\Dictionary\Dictionary;
use VerbaGuard\Normalizer\Normalizer;
use VerbaGuard\Pipeline\NormalizationPipeline;
use VerbaGuard\VerbaGuard;

final class ExampleProfile implements LanguageProfile
Expand All @@ -195,14 +198,20 @@ final class ExampleProfile implements LanguageProfile

public function dictionary(): Dictionary
{
return Dictionary::fromArray([
$rows = [
[
'term' => 'badword',
'normalized' => 'badword',
'category' => 'profanity',
'severity' => 'medium',
],
]);
];

$normalization = new NormalizationPipeline($this->normalizers());

return Dictionary::fromRows(
$rows,
static fn (string $term): string => $normalization->normalize($term),
);
}

public function normalizers(): array
Expand Down
8 changes: 2 additions & 6 deletions data/tr.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,39 +3,35 @@
declare(strict_types=1);

/**
* Turkish seed dictionary for VerbaGuard v0.1.
* Turkish seed dictionary for VerbaGuard. Author rows contain term, category,
* and severity only; normalized keys are derived at build time.
*
* Contains a minimal set of offensive terms for testing purposes only.
* See README.md offensive language notice.
*/
return [
[
'term' => 'amk',
'normalized' => 'amk',
'category' => 'profanity',
'severity' => 'medium',
],
[
'term' => 'aq',
'normalized' => 'aq',
'category' => 'profanity',
'severity' => 'low',
],
[
'term' => 'siktir',
'normalized' => 'siktir',
'category' => 'profanity',
'severity' => 'high',
],
[
'term' => 'orospu',
'normalized' => 'orospu',
'category' => 'profanity',
'severity' => 'high',
],
[
'term' => 'mal',
'normalized' => 'mal',
'category' => 'insult',
'severity' => 'low',
],
Expand Down
50 changes: 40 additions & 10 deletions docs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,26 +86,55 @@ interface LanguageProfile

Dictionaries are plain PHP arrays loaded from files such as `data/tr.php`.

Each entry contains:
### Author format (v0.2+)

| Field | Description |
|-------------|--------------------------------------------------|
| `term` | Canonical dictionary term |
| `normalized`| Normalized form used for matching |
| `category` | Semantic category, e.g. `profanity`, `insult` |
| `severity` | One of `clean`, `low`, `medium`, `high` |
Each author row contains only user-written canonical fields:

Example:
| Field | Description |
|------------|-----------------------------------------------|
| `term` | Canonical dictionary term |
| `category` | Semantic category, e.g. `profanity`, `insult` |
| `severity` | One of `clean`, `low`, `medium`, `high` |

Do **not** include `normalized` in author rows. It is derived at dictionary build time.

Example author row:

```php
[
'term' => 'amk',
'normalized' => 'amk',
'category' => 'profanity',
'severity' => 'medium',
]
```

### Build-time construction

Use `Dictionary::fromRows()` with a `normalizeKey` callable. The callable must apply the same normalization chain the matcher uses at runtime (typically the profile's `NormalizationPipeline`).

```php
Dictionary::fromRows(
rows: $rows,
normalizeKey: fn (string $term): string => $normalization->normalize($term),
);
```

At build time, each `term` is passed through `normalizeKey` to produce the derived `normalized` lookup key stored on `Entry`.

### Runtime `Entry` fields

| Field | Source | Description |
|-------------|----------|--------------------------------------------------|
| `term` | Author | Canonical dictionary term |
| `category` | Author | Semantic category |
| `severity` | Author | Severity level |
| `normalized`| Derived | Build-time normalized form used for matching |

### Breaking changes in v0.2

- `Dictionary::fromArray()` removed — use `Dictionary::fromRows()` instead.
- Author dictionary rows no longer accept a `normalized` field.

---

## Normalization stages
Expand Down Expand Up @@ -257,7 +286,8 @@ The final score is the sum of all unique match severities.
## Future compatibility notes

- New normalization stages belong in the global pipeline unless language-specific.
- Dictionary entries should remain array-based so existing language files keep working.
- Dictionary author rows remain plain PHP arrays with `term`, `category`, and `severity`.
- Derived fields such as `normalized` are produced at build time via `Dictionary::fromRows()`.
- Additional severity levels or scoring policies require explicit interfaces in future versions.
- Framework adapters should live in separate packages depending on this core library.
- Matcher changes are bug-fix only while frozen; see `FOUNDATION.md`.
11 changes: 3 additions & 8 deletions src/Dictionary/Dictionary.php
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,11 @@ public function __construct(array $entries)
}

/**
* @param list<array{term: string, normalized: string, category: string, severity: string}> $rows
* @param list<array{term: string, category: string, severity: string}> $rows
*/
public static function fromArray(array $rows): self
public static function fromRows(array $rows, callable $normalizeKey): self
{
$entries = array_map(
static fn (array $row): Entry => Entry::fromArray($row),
$rows,
);

return new self($entries);
return (new DictionaryBuilder($normalizeKey))->build($rows);
}

public function find(string $normalized): ?Entry
Expand Down
116 changes: 116 additions & 0 deletions src/Dictionary/DictionaryBuilder.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
<?php

declare(strict_types=1);

namespace VerbaGuard\Dictionary;

use InvalidArgumentException;
use VerbaGuard\Severity;
use ValueError;

final class DictionaryBuilder
{
/** @var list<string> */
private const AUTHOR_FIELDS = ['term', 'category', 'severity'];

/** @var callable(string): string */
private $normalizeKey;

/**
* @param callable(string): string $normalizeKey
*/
public function __construct(callable $normalizeKey)
{
$this->normalizeKey = $normalizeKey;
}

/**
* @param list<array<string, string>> $rows
*/
public function build(array $rows): Dictionary
{
/** @var array<string, Entry> $byNormalized */
$byNormalized = [];

foreach ($rows as $index => $row) {
$this->assertAuthorRowShape($row, $index);

$term = $row['term'];
$category = $row['category'];
$severity = $row['severity'];

$this->assertNonEmptyString($term, 'term', $index);
$this->assertNonEmptyString($category, 'category', $index);
$this->assertValidSeverity($severity, $index);

$normalized = ($this->normalizeKey)($term);

if (isset($byNormalized[$normalized])) {
throw new InvalidArgumentException(
sprintf('Duplicate normalized key "%s" at row %d.', $normalized, $index),
);
}

$byNormalized[$normalized] = Entry::fromRow(
[
'term' => $term,
'category' => $category,
'severity' => $severity,
],
$normalized,
);
}

return new Dictionary(array_values($byNormalized));
}

/**
* @param array<string, string> $row
*/
private function assertAuthorRowShape(array $row, int $index): void
{
if (array_key_exists('normalized', $row)) {
throw new InvalidArgumentException(
sprintf('Author dictionary rows must not include "normalized" (row %d).', $index),
);
}

foreach (array_keys($row) as $field) {
if (! in_array($field, self::AUTHOR_FIELDS, true)) {
throw new InvalidArgumentException(
sprintf('Unknown author field "%s" at row %d.', $field, $index),
);
}
}

foreach (self::AUTHOR_FIELDS as $field) {
if (! array_key_exists($field, $row)) {
throw new InvalidArgumentException(
sprintf('Missing required author field "%s" at row %d.', $field, $index),
);
}
}
}

private function assertNonEmptyString(string $value, string $field, int $index): void
{
if ($value === '') {
throw new InvalidArgumentException(
sprintf('Author field "%s" must not be empty at row %d.', $field, $index),
);
}
}

private function assertValidSeverity(string $severity, int $index): void
{
try {
Severity::fromString($severity);
} catch (ValueError $exception) {
throw new InvalidArgumentException(
sprintf('Invalid severity "%s" at row %d.', $severity, $index),
0,
$exception,
);
}
}
}
18 changes: 12 additions & 6 deletions src/Dictionary/Entry.php
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@

final class Entry
{
/**
* @param string $term Author field — canonical dictionary term.
* @param string $normalized Derived field — build-time normalized lookup key.
* @param string $category Author field — semantic category.
* @param string $severity Author field — one of clean, low, medium, high.
*/
public function __construct(
public readonly string $term,
public readonly string $normalized,
Expand All @@ -15,15 +21,15 @@ public function __construct(
}

/**
* @param array{term: string, normalized: string, category: string, severity: string} $data
* @param array{term: string, category: string, severity: string} $row
*/
public static function fromArray(array $data): self
public static function fromRow(array $row, string $normalized): self
{
return new self(
$data['term'],
$data['normalized'],
$data['category'],
$data['severity'],
$row['term'],
$normalized,
$row['category'],
$row['severity'],
);
}
}
9 changes: 7 additions & 2 deletions src/Language/TurkishProfile.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
use VerbaGuard\Dictionary\Dictionary;
use VerbaGuard\Normalizer\Normalizer;
use VerbaGuard\Normalizer\TurkishNormalizer;
use VerbaGuard\Pipeline\NormalizationPipeline;

final class TurkishProfile implements LanguageProfile
{
Expand All @@ -21,9 +22,13 @@ public function code(): string
public function dictionary(): Dictionary
{
if ($this->dictionary === null) {
/** @var list<array{term: string, normalized: string, category: string, severity: string}> $rows */
/** @var list<array{term: string, category: string, severity: string}> $rows */
$rows = require dirname(__DIR__, 2) . '/data/tr.php';
$this->dictionary = Dictionary::fromArray($rows);
$normalization = new NormalizationPipeline($this->normalizers());
$this->dictionary = Dictionary::fromRows(
$rows,
static fn (string $term): string => $normalization->normalize($term),
);
}

return $this->dictionary;
Expand Down
Loading
Loading