Skip to content

feat: add a crawl delay function in kage clone to honor the Crawl-del…#57

Merged
tamnd merged 1 commit into
tamnd:mainfrom
Xirui:feat/rate-limiting
Jun 22, 2026
Merged

feat: add a crawl delay function in kage clone to honor the Crawl-del…#57
tamnd merged 1 commit into
tamnd:mainfrom
Xirui:feat/rate-limiting

Conversation

@Xirui

@Xirui Xirui commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Add a crawl delay function in kage clone to honor the Crawl-delay directive parsed from robots.txt.

The next step would be adding a user flag (--crawl-delay) under clone cmd to resolve #6 .

@tamnd

tamnd commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Love this. It adds more respect for robots.txt and lets users control the delay.

However, I found an awesome package that already implements the rate-limiting logic, and we could reuse it: https://pkg.go.dev/golang.org/x/time/rate

import "golang.org/x/time/rate"

Then add a field:

type Cloner struct {
    ...
    crawlLimiter *rate.Limiter
}

And the setup is just:

func (c *Cloner) setupCrawlDelayLimiter() {
    if !c.cfg.RespectRobots || c.robots == nil || c.robots.CrawlDelay <= 0 {
        c.crawlLimiter = nil
        return
    }

    c.crawlLimiter = rate.NewLimiter(rate.Every(c.robots.CrawlDelay), 1)
}

Then your function becomes:

func (c *Cloner) waitForCrawlDelay(ctx context.Context) bool {
    if c.crawlLimiter == nil {
        return true
    }

    return c.crawlLimiter.Wait(ctx) == nil
}

Underlying, it is almost same logic: https://cs.opensource.google/go/x/time/+/refs/tags/v0.15.0:rate/rate.go;l=253

…e parsed from robots.txt during clone.

      2. add --craw-delay flag to specify/override robots directive.
@Xirui Xirui force-pushed the feat/rate-limiting branch from 20552bf to 4de9338 Compare June 21, 2026 23:50
@Xirui

Xirui commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Thx for the suggestion!

I've amended the commit to use rate pkg and added the --crawl-delay user flag.

@tamnd

tamnd commented Jun 22, 2026

Copy link
Copy Markdown
Owner

lgtm, I will cut new release soon!

@tamnd tamnd merged commit 8bc9c8b into tamnd:main Jun 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

limit crawling

2 participants