Skip to content

Simple api #39

@RustGrow

Description

@RustGrow

I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.

I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.

In the readme.md file:
How to install the crate
cargo add skyscraper

Dependencies in the cargo.toml file

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.6.4"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }

And functions with clear names
use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr};
Find and SelectAttr return vec of values.
FindOne and SelectOneAttr return &str value.

As well as a similar API with very simple, understandable examples:
From url

  1. Load HTML document from url. Default timeout is 30s
    let doc = Query::url("http://example.com/").expect("");

  2. Load HTML document from URL with client settings
    let doc = Query::url_client("http://example.com/", &client).expect("");

From file

let file_path = "/home/user/sample.html";
let doc = Query::file(file_path).expect("");

From text

let text = r#"<html>....</html>"#;
let doc = Query::text(text).expect("");

Also, add a Find and FindOne function:
Find all A elements.
let list = Find(&doc, "//a").expect("");

Find all A elements that have an href attribute.
let list = Find(&doc, "//a[@href]").expect("");

Find all A elements with href attribute and only return all links.
let list = Find(&doc, "//a/@href").expect("");

Find the first A element.
let a = FindOne(&doc, "//a[1]").expect("");

Find the third A element.
let a = FindOne(&doc, "//a[3]").expect("");

Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath:
//a/@href
//div/@inner_parameter

--
Select all Attribute

let attr = SelectAttr(&doc, "//img", "src").expect("");

Select one Attribute

let attr = SelectOneAttr(&doc, "//img[1]", "src").expect("");

--
Get count of elements.

let list = Find(&doc, "//a").expect("");
let count = list.len();

But this is just a subjective example of an API that looks simple and understandable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions