feat(transcript): Implement dictionary-based capitalization and censorship #2104

AhmedAlian7 · 2026-02-09T18:11:55Z

[FIX]

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Description

closes #2103
This Pull Request introduces dictionary-based spelling correction and profanity censorship to the transcript encoder (ccx_encoders_transcript.c). It leverages the shared correct_spelling_and_censor_words helper function to process subtitle text before it is written to the transcript output. This aligns the transcript output quality with other subtitle formats.

Changes

Enhanced Transcript Processing: In write_cc_subtitle_as_transcript, the code now calls correct_spelling_and_censor_words to apply capitalization rules and censorship filters defined in the user's configuration.
Improved Robustness: Added a specific check to ensure sub->data is not NULL before attempting to process or read its length. This prevents potential null pointer dereferences and improves the stability of the application.
Code Cleanup: Removed legacy placeholder comments and simplified the control flow for applying text corrections.

…rship

ccextractor-bot · 2026-02-09T18:50:48Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit dd29311...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	6/7
DVD	3/3
DVR-MS	2/2
General	27/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	85/86
Teletext	21/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
ccextractor --out=spupng c83f765c66...

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

ccextractor-bot · 2026-02-09T19:09:39Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit dd29311...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	6/7
DVD	3/3
DVR-MS	2/2
General	25/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	81/86
Teletext	21/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

Congratulations: Merging this PR would fix the following tests:

ccextractor --out=spupng c83f765c66..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

feat(transcript): Implement dictionary-based capitalization and censo…

2ac017f

…rship

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transcript): Implement dictionary-based capitalization and censorship #2104

feat(transcript): Implement dictionary-based capitalization and censorship #2104

Uh oh!

AhmedAlian7 commented Feb 9, 2026

Uh oh!

ccextractor-bot commented Feb 9, 2026

Uh oh!

ccextractor-bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(transcript): Implement dictionary-based capitalization and censorship #2104

Are you sure you want to change the base?

feat(transcript): Implement dictionary-based capitalization and censorship #2104

Uh oh!

Conversation

AhmedAlian7 commented Feb 9, 2026

Description

Changes

Uh oh!

ccextractor-bot commented Feb 9, 2026

Uh oh!

ccextractor-bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants