Copy cs to sk for prototyping by Adrijaned · Pull Request #132 · common-voice/cv-sentence-extractor

Adrijaned · 2020-12-07T22:56:54Z

Mainly to get the extraction running and to get an idea how much more work will need to be done.

Hrano · 2020-12-08T14:46:34Z

@@ -0,0 +1,17 @@
+allowed_symbols_regex="[A-Za-zěščřžýáíéóďťňúůĚŠČŘŽÝÁÍÉÓĎŤŇäöüÚ‚–\\. \"„“]"


allowed_symbols_regex="[A-Za-zěščŕřžýáíéóôďťňúůĺľÁÄĚŠČŔŘŽÝÁÍÉÓÔĎŤŇĹĽäöüÚ‚–\. "„“]"

Hrano · 2020-12-08T14:47:43Z

+needs_uppercase_start = true
+even_symbols = ["\""]
+broken_whitespace = ["  ", " ,", " .", " ?", " !", " ;"]
+abbreviation_patterns = ["[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+\\.*[a-z]*[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+", "atd\\.", "\\baj\\.", "tj\\.", "\\brec\\.", "[nN]apř\\.", 


abbreviation_patterns = ["[A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+\.[a-z][A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+", "a i\.", "a pod\.", "atď\.", "\baj\.", "tj\

.", "\brec\.", "[nN]apr\.",
""."", "\s[^aikosuvzáó]\s", "zkr\.", "[Tt]zv\.", "[dD]r\.", "\b[aAeE]d\.", "\b[sS]?[tT]r\.", "[aA]rch\.", "Inc\.", "Ltd\.", "[pP]opr\.",
"\b[fF]r\.", "\b[A-Z]+DR\b", "[pP]ozn\.", "[sS]rov\.", "\b[eE][a-z]\.", "[zZ]ejm\.", "[JS]r\.", "\b[lL][lL]",
"Mgr\.", "[mM]j\.", "\b[sS]tol\.", "\b[pP]ol\.", "Ing\.", "[cCkK]pt\.", "\b[lL]t\.", "Mr?s?\.", "\s[^\\s]{1,2}\.", "\bviz\.", "\b[sS]at\."]

Adrijaned · 2020-12-09T18:31:26Z

Blocklist generated from words of frequency 60 and lower

Hrano · 2020-12-14T09:04:34Z

Downloaded and sent for review to five native speakers.
Corrects error sentences in the second column next to it, in xls format.
Will it be OK like this?

MichaelKohler · 2021-05-25T20:26:29Z

Sorry, I missed that comment.

Corrects error sentences in the second column next to it, in xls format.
Will it be OK like this?

No. We can't accept corrected sentences, because we need to run a new, fresh export once the rules are added. This is needed to make sure that we fulfil all legal requirements. As sentences are picked at random, any changes to them would be lost.

Copy cs to sk for prototyping

3b67123

Hrano reviewed Dec 8, 2020

View reviewed changes

Incorporate suggested changes

b41fc93

MichaelKohler marked this pull request as draft March 14, 2021 13:49

MichaelKohler added the waiting on error rate review label May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy cs to sk for prototyping#132

Copy cs to sk for prototyping#132
Adrijaned wants to merge 2 commits into
common-voice:mainfrom
Adrijaned:patch-1

Adrijaned commented Dec 7, 2020

Uh oh!

Hrano Dec 8, 2020

Uh oh!

Hrano Dec 8, 2020

Uh oh!

Adrijaned commented Dec 9, 2020

Uh oh!

Hrano commented Dec 14, 2020

Uh oh!

MichaelKohler commented May 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,17 @@
		allowed_symbols_regex="[A-Za-zěščřžýáíéóďťňúůĚŠČŘŽÝÁÍÉÓĎŤŇäöüÚ‚–\\. \"„“]"

Conversation

Adrijaned commented Dec 7, 2020

Uh oh!

Hrano Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

Hrano Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

Adrijaned commented Dec 9, 2020

Uh oh!

Hrano commented Dec 14, 2020

Uh oh!

MichaelKohler commented May 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants