diff --git a/.github/workflows/wordlist.txt b/.github/workflows/wordlist.txt index 63f5b77..16072a8 100644 --- a/.github/workflows/wordlist.txt +++ b/.github/workflows/wordlist.txt @@ -140,3 +140,5 @@ iframes webhooks unix customizable +subfolder +SHA diff --git a/docs/Web Scraper Cloud.md b/docs/Web Scraper Cloud.md index cc3f0cf..0a412df 100644 --- a/docs/Web Scraper Cloud.md +++ b/docs/Web Scraper Cloud.md @@ -21,6 +21,7 @@ enabled, scraper changes the IP address and retries to scrape the page. * [API][api] * [Parser][parser] * [Data export][data-export] +* [Image export][image-export] * [Data quality control][data-quality-control] * [Notifications][notifications] * [Sitemap sync][sitemap-sync] @@ -110,6 +111,7 @@ element click selector. If the timeout is reached, no data will be scraped from [api]: Web%20Scraper%20Cloud/API.md [parser]: Web%20Scraper%20Cloud/Parser.md [data-export]: Web%20Scraper%20Cloud/Data%20Export.md +[image-export]: Web%20Scraper%20Cloud/Image%20Export.md [scraping-job-performance-graph]: ./images/cloud/scraping-job-performance-graph.png?raw=true [parallel-tasks]: images/cloud/parallel-tasks.png [Subscription manager]: https://cloud.webscraper.io/subscription-manager diff --git a/docs/Web Scraper Cloud/Image Export.md b/docs/Web Scraper Cloud/Image Export.md new file mode 100644 index 0000000..63324ae --- /dev/null +++ b/docs/Web Scraper Cloud/Image Export.md @@ -0,0 +1,33 @@ +# Image Export + +Web Scraper Cloud supports automated image export to `Amazon S3, Google Cloud Storage, and Azure Blob Storage`. This feature is available exclusively for `Scale` plan users. Image downloading is performed during the execution of the scraping job. As pages are processed, associated images are downloaded in parallel with data extraction. + +## Image Export Configuration + +The Image Export tab will be visible when the sitemap contains at least one `Image` selector. + +![Fig. 1: Image Export Tab in Web Scraper Cloud][image-export-tab-web-scraper-cloud] + +## Exported Image Location + +Images are exported to the same path as the data export, within an `images` subfolder. For example, if data is exported to `bucket/web-scraper/my-sitemap` in S3, images will be exported to `bucket/web-scraper/my-sitemap/images`. + +## Image Columns + +Each `Image` selector creates a separate column in the exported data. The column name follows the format `{image_selector_id}_stored_filename`. File names are generated using the SHA-256 hash of the image URL. + +![Fig. 2: Image Export Column Name][image-export-column-name] + +## Image Column Structure Based on Selector Configuration + +Example image selector ID: `product_image` + +* **First record only** - `product_image, product_image_stored_filename` +* **Multiple records in multiple columns** - `product_image_1, product_image_1_stored_filename, product_image_2, product_image_2_stored_filename, ...` +* **Multiple records in one column** - `product_image, product_image_stored_filename` (file names separated by newlines) + +[image-export-tab-web-scraper-cloud]: ../images/image-export/image-export-tab-web-scraper-cloud.png?raw=true +[image-export-column-name]: ../images/image-export/image-export-column-name.png?raw=true + +description: Web Scraper Cloud supports automated image export to Amazon S3, Google Cloud Storage, and Azure Blob Storage +keywords: image export, image download, automated image export, web scraper cloud image export diff --git a/docs/images/image-export/image-export-column-name.png b/docs/images/image-export/image-export-column-name.png new file mode 100644 index 0000000..1b697e8 Binary files /dev/null and b/docs/images/image-export/image-export-column-name.png differ diff --git a/docs/images/image-export/image-export-tab-web-scraper-cloud.png b/docs/images/image-export/image-export-tab-web-scraper-cloud.png new file mode 100644 index 0000000..95893f7 Binary files /dev/null and b/docs/images/image-export/image-export-tab-web-scraper-cloud.png differ