Skip to content

Commit 9a2080e

Browse files
committed
feat: add deduplicate package for document similarity analysis
- Add @docen/deduplicate package with Levenshtein distance based text comparison - Implement extractParagraphs, calculateSimilarity, findDuplicates functions - Add compareDocuments for cross-document similarity detection - Add findMostSimilar for finding best match from candidates - Support multilingual text (English, Chinese) with configurable options - Add comprehensive test suite with 50% Chinese test cases - Update root README.md with package documentation links - Configure @funish/basis build system for the new package
1 parent b4cd152 commit 9a2080e

11 files changed

Lines changed: 1246 additions & 5 deletions

File tree

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,11 @@
99
1010
## Packages
1111

12-
- **[docen](./packages/docen)** - Universal document converter with unified API for Markdown, HTML, and DOCX transformations
13-
- **[@docen/extensions](./packages/extensions)** - Comprehensive TipTap extension collection with full TypeScript types
14-
- **[@docen/export-docx](./packages/export-docx)** - Export TipTap/ProseMirror content to Microsoft Word DOCX format
15-
- **[@docen/import-docx](./packages/import-docx)** - Import Microsoft Word DOCX files to TipTap/ProseMirror content
12+
- **[docen](./packages/docen/README.md)** - Universal document converter with unified API for Markdown, HTML, and DOCX transformations
13+
- **[@docen/extensions](./packages/extensions/README.md)** - Comprehensive TipTap extension collection with full TypeScript types
14+
- **[@docen/export-docx](./packages/export-docx/README.md)** - Export TipTap/ProseMirror content to Microsoft Word DOCX format
15+
- **[@docen/import-docx](./packages/import-docx/README.md)** - Import Microsoft Word DOCX files to TipTap/ProseMirror content
16+
- **[@docen/deduplicate](./packages/deduplicate/README.md)** - Document deduplication and similarity analysis utilities
1617

1718
## Quick Start
1819

packages/deduplicate/README.md

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
# @docen/deduplicate
2+
3+
![npm version](https://img.shields.io/npm/v/@docen/deduplicate)
4+
![npm downloads](https://img.shields.io/npm/dw/@docen/deduplicate)
5+
![npm license](https://img.shields.io/npm/l/@docen/deduplicate)
6+
7+
> Document deduplication and similarity analysis utilities for Tiptap/ProseMirror JSON content.
8+
9+
## Features
10+
11+
- 🔍 **Duplicate Detection** - Find duplicate or similar paragraphs within documents
12+
- 📊 **Similarity Calculation** - Calculate text similarity ratios (0-100%) using Levenshtein distance
13+
- 🔗 **Cross-Document Comparison** - Compare two documents and find similar paragraphs
14+
- 🎯 **Smart Matching** - Find the most similar text from a list of candidates
15+
- 🌐 **Multilingual Support** - Works with English, Chinese, and other languages
16+
- ⚙️ **Configurable Options** - Adjust similarity thresholds, whitespace handling, and case sensitivity
17+
- 🚀 **High Performance** - Optimized algorithms for fast similarity calculations
18+
- 🔒 **Full Type Safety** - Comprehensive TypeScript definitions for all functions
19+
20+
## Installation
21+
22+
```bash
23+
# Install with npm
24+
$ npm install @docen/deduplicate
25+
26+
# Install with yarn
27+
$ yarn add @docen/deduplicate
28+
29+
# Install with pnpm
30+
$ pnpm add @docen/deduplicate
31+
```
32+
33+
## Quick Start
34+
35+
```typescript
36+
import { findDuplicates } from "@docen/deduplicate";
37+
38+
// Your Tiptap/ProseMirror editor content
39+
const document = {
40+
type: "doc",
41+
content: [
42+
{
43+
type: "paragraph",
44+
content: [
45+
{ type: "text", text: "机器学习是人工智能的一个重要分支。" },
46+
],
47+
},
48+
{
49+
type: "paragraph",
50+
content: [
51+
{ type: "text", text: "机器学习是人工智能的一个重要分支。" },
52+
],
53+
},
54+
{
55+
type: "paragraph",
56+
content: [
57+
{ type: "text", text: "深度学习是机器学习的子领域。" },
58+
],
59+
},
60+
],
61+
};
62+
63+
// Find duplicate paragraphs (85% similarity threshold)
64+
const duplicates = findDuplicates(document, { threshold: 0.85 });
65+
66+
console.log(duplicates);
67+
// Output:
68+
// [
69+
// {
70+
// index: 0,
71+
// text: "机器学习是人工智能的一个重要分支。",
72+
// duplicates: [1],
73+
// similarities: [1.0]
74+
// }
75+
// ]
76+
```
77+
78+
## API Reference
79+
80+
### `extractParagraphs(doc)`
81+
82+
Extracts all paragraph text from a Tiptap JSON document.
83+
84+
**Parameters:**
85+
- `doc: JSONContent` - Tiptap/ProseMirror document
86+
87+
**Returns:** `string[]` - Array of paragraph texts
88+
89+
```typescript
90+
import { extractParagraphs } from "@docen/deduplicate";
91+
92+
const paragraphs = extractParagraphs(document);
93+
// ["机器学习是人工智能的一个重要分支。", "深度学习是机器学习的子领域。"]
94+
```
95+
96+
### `calculateSimilarity(text1, text2, options?)`
97+
98+
Calculates similarity ratio between two texts using Levenshtein distance.
99+
100+
**Parameters:**
101+
- `text1: string` - First text
102+
- `text2: string` - Second text
103+
- `options?: DeduplicateOptions` - Configuration options
104+
105+
**Returns:** `number` - Similarity ratio between 0 (completely different) and 1 (identical)
106+
107+
```typescript
108+
import { calculateSimilarity } from "@docen/deduplicate";
109+
110+
const similarity = calculateSimilarity(
111+
"机器学习是人工智能的一个重要分支。",
112+
"机器学习是人工智能的重要分支。",
113+
{ ignoreCase: true, ignoreWhitespace: true }
114+
);
115+
116+
console.log(similarity); // 0.94 (94% similar)
117+
```
118+
119+
**Options:**
120+
121+
```typescript
122+
interface DeduplicateOptions {
123+
threshold?: number; // Similarity threshold (0-1), default: 0.85
124+
ignoreWhitespace?: boolean; // Ignore whitespace differences, default: true
125+
ignoreCase?: boolean; // Ignore case differences, default: true
126+
}
127+
```
128+
129+
### `findDuplicates(doc, options?)`
130+
131+
Finds duplicate/similar paragraphs in a document.
132+
133+
**Parameters:**
134+
- `doc: JSONContent` - Tiptap/ProseMirror document
135+
- `options?: DeduplicateOptions` - Configuration options
136+
137+
**Returns:** `DuplicateMatch[]` - Array of duplicate matches
138+
139+
```typescript
140+
import { findDuplicates } from "@docen/deduplicate";
141+
142+
const duplicates = findDuplicates(document, {
143+
threshold: 0.85,
144+
ignoreWhitespace: true,
145+
ignoreCase: true,
146+
});
147+
148+
// Result type:
149+
interface DuplicateMatch {
150+
index: number; // Index of first occurrence
151+
text: string; // The paragraph text
152+
duplicates: number[]; // Indices of duplicate occurrences
153+
similarities: number[]; // Similarity scores for each duplicate
154+
}
155+
```
156+
157+
### `compareDocuments(doc1, doc2, options?)`
158+
159+
Compares two documents and finds similar paragraphs.
160+
161+
**Parameters:**
162+
- `doc1: JSONContent` - First document
163+
- `doc2: JSONContent` - Second document
164+
- `options?: DeduplicateOptions` - Configuration options
165+
166+
**Returns:** `DocumentComparison[]` - Array of similar paragraph pairs
167+
168+
```typescript
169+
import { compareDocuments } from "@docen/deduplicate";
170+
171+
const comparisons = compareDocuments(doc1, doc2, {
172+
threshold: 0.7, // Lower threshold for cross-document comparison
173+
});
174+
175+
// Result type:
176+
interface DocumentComparison {
177+
fromDoc1: { index: number; text: string };
178+
fromDoc2: { index: number; text: string };
179+
similarity: number;
180+
}
181+
```
182+
183+
### `findMostSimilar(targetText, candidates, options?)`
184+
185+
Finds the most similar text from a list of candidates.
186+
187+
**Parameters:**
188+
- `targetText: string` - Target text to match
189+
- `candidates: string[]` - Array of candidate texts
190+
- `options?: DeduplicateOptions` - Configuration options
191+
192+
**Returns:** `MostSimilarResult | null` - Best match or null if no candidates
193+
194+
```typescript
195+
import { findMostSimilar } from "@docen/deduplicate";
196+
197+
const target = "人工智能的快速发展给各个行业带来了巨大的变化。";
198+
const candidates = [
199+
"区块链技术不断发展和影响全球金融部门。",
200+
"人工智能的快速增长正在以显著的方式改变不同行业。",
201+
"气候变化仍然是人类面临的最紧迫的挑战之一。",
202+
];
203+
204+
const result = findMostSimilar(target, candidates);
205+
206+
// Result:
207+
// {
208+
// text: "人工智能的快速增长正在以显著的方式改变不同行业。",
209+
// index: 1,
210+
// similarity: 0.33
211+
// }
212+
```
213+
214+
### `distance(str1, str2)` & `closest(target, candidates)`
215+
216+
Calculate edit distance and find closest string matches.
217+
218+
```typescript
219+
import { distance, closest } from "@docen/deduplicate";
220+
221+
// Calculate edit distance between two strings
222+
const dist = distance("kitten", "sitting");
223+
console.log(dist); // 3
224+
225+
// Find the closest string from candidates
226+
const closestStr = closest("kitten", ["kitchen", "sitting", "kit"]);
227+
console.log(closestStr); // "kitchen"
228+
```
229+
230+
## Usage Examples
231+
232+
### Basic Duplicate Detection
233+
234+
```typescript
235+
import { findDuplicates } from "@docen/deduplicate";
236+
237+
const document = {
238+
type: "doc",
239+
content: [
240+
{ type: "paragraph", content: [{ type: "text", text: "First paragraph." }] },
241+
{ type: "paragraph", content: [{ type: "text", text: "Duplicate paragraph." }] },
242+
{ type: "paragraph", content: [{ type: "text", text: "Unique paragraph." }] },
243+
{ type: "paragraph", content: [{ type: "text", text: "Duplicate paragraph." }] },
244+
],
245+
};
246+
247+
const duplicates = findDuplicates(document);
248+
249+
duplicates.forEach((dup) => {
250+
console.log(`Found "${dup.text}" at index ${dup.index}`);
251+
console.log(` Duplicates at: ${dup.duplicates.join(", ")}`);
252+
console.log(` Similarities: ${dup.similarities.map(s => (s * 100).toFixed(1) + "%").join(", ")}`);
253+
});
254+
```
255+
256+
### Document Comparison for Plagiarism Detection
257+
258+
```typescript
259+
import { compareDocuments } from "@docen/deduplicate";
260+
261+
const studentEssay = parseEssay(studentSubmission);
262+
const referenceEssay = parseEssay(referenceMaterial);
263+
264+
const comparisons = compareDocuments(studentEssay, referenceEssay, {
265+
threshold: 0.75,
266+
});
267+
268+
comparisons.forEach((comp) => {
269+
console.log(`Suspicious similarity detected:`);
270+
console.log(` Student: "${comp.fromDoc1.text}"`);
271+
console.log(` Reference: "${comp.fromDoc2.text}"`);
272+
console.log(` Similarity: ${(comp.similarity * 100).toFixed(1)}%`);
273+
});
274+
```
275+
276+
### Custom Similarity Thresholds
277+
278+
```typescript
279+
import { findDuplicates, calculateSimilarity } from "@docen/deduplicate";
280+
281+
// High precision (fewer false positives)
282+
const exactDuplicates = findDuplicates(document, { threshold: 0.95 });
283+
284+
// High recall (catch more potential duplicates)
285+
const looseMatches = findDuplicates(document, { threshold: 0.70 });
286+
287+
// Manual similarity calculation
288+
const similarity = calculateSimilarity(
289+
"The quick brown fox jumps over the lazy dog.",
290+
"The quick brown cat jumps over the lazy dog.",
291+
{ ignoreCase: true, ignoreWhitespace: true }
292+
);
293+
294+
console.log(`Similarity: ${(similarity * 100).toFixed(1)}%`);
295+
```
296+
297+
### Language-Specific Options
298+
299+
```typescript
300+
import { calculateSimilarity } from "@docen/deduplicate";
301+
302+
// English: Case-insensitive comparison
303+
const enSimilarity = calculateSimilarity(
304+
"Hello World",
305+
"hello world",
306+
{ ignoreCase: true }
307+
);
308+
// 1.0 (100% similar)
309+
310+
// Chinese: No case sensitivity needed
311+
const zhSimilarity = calculateSimilarity(
312+
"机器学习是人工智能的重要分支",
313+
"机器学习是人工智能的重要分支",
314+
{ ignoreCase: true }
315+
);
316+
// 1.0 (100% similar)
317+
318+
// Whitespace handling
319+
const wsSimilarity = calculateSimilarity(
320+
"机器学习  是 一个 领域",
321+
"机器学习 是 一个 领域",
322+
{ ignoreWhitespace: true }
323+
);
324+
// 1.0 (100% similar, full-width vs half-width spaces)
325+
```
326+
327+
## Performance
328+
329+
Optimized algorithms for efficient document processing:
330+
331+
- **Time Complexity:** O(n×m) for text similarity calculation
332+
- **Space Complexity:** O(min(n,m))
333+
- **Scalability:** Handles large documents efficiently
334+
335+
For very large documents, consider processing in chunks or using Web Workers.
336+
337+
## TypeScript Types
338+
339+
All functions are fully typed with TypeScript:
340+
341+
```typescript
342+
import type {
343+
DeduplicateOptions,
344+
DuplicateMatch,
345+
DocumentComparison,
346+
MostSimilarResult,
347+
JSONContent,
348+
} from "@docen/deduplicate";
349+
```
350+
351+
## Contributing
352+
353+
Contributions are welcome! Please read our [Contributor Covenant](https://www.contributor-covenant.org/version/2/1/code_of_conduct/) and submit pull requests to the [main repository](https://github.com/DemoMacro/docen).
354+
355+
## License
356+
357+
- [MIT](LICENSE) © [Demo Macro](https://imst.xyz/)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
import { defineBuildConfig } from "@funish/basis/config";
2+
3+
export default defineBuildConfig({
4+
entries: ["src/index"],
5+
});

0 commit comments

Comments
 (0)