Skip to content

Segfault or encoding error when parsing a URL #59

Description

@lopuhin

See #58 (comment) and #58 (comment)

Also repeating here

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions