Skip to content

Failure to parse multi-tier etymology section into "etymology_number" #1619

@konstantinhenke

Description

@konstantinhenke

Context: Many common on many pages of Wiktionary have "Etymology" sections. The number of the etymology section is usually parsed into the etymology_number field in the jsonl output by wiktextract.

Problem: Arabic نفس has not only "Etymology 1", "Etymolgy 2", etc., but instead two-tiered sections like "Etymology 1.1", "Etymology 1.2", etc. Currently, only the line representing Etymology 3 in the jsonl file produced by wiktextract has an etymology_number. The other eight (otherwise correctly produced) lines lack such a field:

{"senses": [{"links": [["priceless", "pricel ……… ss?.vn:نَفَس,نِفَاس,نَفَاسَة.ap:++,+"}}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["stingy", "stingy"], ……… "I/i~a.nopass.vn:نَفَس.ap:نُفَسَاء"}}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["envy", "envy"], ["b ……… "I/i~a.pass.vn:نَفَاسَة.ap:نُفَسَاء"}}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"head_nr": 2, "links": [["child ……… s": {"1": "I.onlypass.vn:نِفَاس"}}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["comfort", "comfort" ………  "ar-conj", "args": {"1": "II"}}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["self", "self"]], "g ……… }], "sounds": [{"ipa": "/nafs/"}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["creature", "creatur ……… }], "sounds": [{"ipa": "/nafs/"}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"links": [["نَفِسَ", "نفس#Arabic"] ……… /Ar-%D9%86%D9%81%D8%B3.ogg.mp3"}], "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
{"senses": [{"tags": ["form-i", "no-gloss"]} ……… zation"]}], "etymology_number": 3, "word": "نفس", "lang": "Arabic", "lang_code": "ar"}
                                                             ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions