I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.
I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.
In the readme.md file:
How to install the crate
cargo add skyscraper
Dependencies in the cargo.toml file
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.6.4"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }
And functions with clear names
use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr};
Find and SelectAttr return vec of values.
FindOne and SelectOneAttr return &str value.
As well as a similar API with very simple, understandable examples:
From url
-
Load HTML document from url. Default timeout is 30s
let doc = Query::url("http://example.com/").expect("");
-
Load HTML document from URL with client settings
let doc = Query::url_client("http://example.com/", &client).expect("");
From file
let file_path = "/home/user/sample.html";
let doc = Query::file(file_path).expect("");
From text
let text = r#"<html>....</html>"#;
let doc = Query::text(text).expect("");
Also, add a Find and FindOne function:
Find all A elements.
let list = Find(&doc, "//a").expect("");
Find all A elements that have an href attribute.
let list = Find(&doc, "//a[@href]").expect("");
Find all A elements with href attribute and only return all links.
let list = Find(&doc, "//a/@href").expect("");
Find the first A element.
let a = FindOne(&doc, "//a[1]").expect("");
Find the third A element.
let a = FindOne(&doc, "//a[3]").expect("");
Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath:
//a/@href
//div/@inner_parameter
--
Select all Attribute
let attr = SelectAttr(&doc, "//img", "src").expect("");
Select one Attribute
let attr = SelectOneAttr(&doc, "//img[1]", "src").expect("");
--
Get count of elements.
let list = Find(&doc, "//a").expect("");
let count = list.len();
But this is just a subjective example of an API that looks simple and understandable.
I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.
I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.
In the readme.md file:
How to install the crate
cargo add skyscraperDependencies in the cargo.toml file
And functions with clear names
use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr};Find and SelectAttr return vec of values.
FindOne and SelectOneAttr return &str value.
As well as a similar API with very simple, understandable examples:
From url
Load HTML document from url. Default timeout is 30s
let doc = Query::url("http://example.com/").expect("");Load HTML document from URL with client settings
let doc = Query::url_client("http://example.com/", &client).expect("");From file
From text
Also, add a Find and FindOne function:
Find all A elements.
let list = Find(&doc, "//a").expect("");Find all A elements that have an href attribute.
let list = Find(&doc, "//a[@href]").expect("");Find all A elements with href attribute and only return all links.
let list = Find(&doc, "//a/@href").expect("");Find the first A element.
let a = FindOne(&doc, "//a[1]").expect("");Find the third A element.
let a = FindOne(&doc, "//a[3]").expect("");Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath:
//a/@href
//div/@inner_parameter
--
Select all Attribute
Select one Attribute
--
Get count of elements.
But this is just a subjective example of an API that looks simple and understandable.