Skip to content

Latest commit

 

History

History
963 lines (489 loc) · 22.6 KB

File metadata and controls

963 lines (489 loc) · 22.6 KB

downflux


downflux / WikimediaProvider

Class: WikimediaProvider

Defined in: packages/providers/wikimedia/WikimediaProvider.ts:5

Generic provider for sites that can use the default extraction pipeline while site-specific parsers are still being built.

Extends

Constructors

Constructor

new WikimediaProvider(url): WikimediaProvider

Defined in: packages/providers/wikimedia/WikimediaProvider.ts:6

Parameters

url

string

Returns

WikimediaProvider

Overrides

GenericContentProvider.constructor

Properties

executionOptions

protected executionOptions: ExecutionOptions = {}

Defined in: packages/base/BaseProvider.ts:39

Inherited from

GenericContentProvider.executionOptions


httpOptions

protected httpOptions: HttpFetchOptions = {}

Defined in: packages/base/BaseProvider.ts:40

Inherited from

DefaultProvider.httpOptions


deps

protected readonly deps: CoordinatorDependencies

Defined in: packages/base/BaseProvider.ts:41

Inherited from

DefaultProvider.deps


provider

protected readonly provider: Provider

Defined in: packages/base/BaseProvider.ts:42

Inherited from

GenericContentProvider.provider


urlPattern

protected readonly urlPattern: RegExp

Defined in: packages/base/BaseProvider.ts:43

Inherited from

DefaultProvider.urlPattern


providerMetadata

protected readonly providerMetadata: ProviderMetadata

Defined in: packages/base/BaseProvider.ts:44

Inherited from

DefaultProvider.providerMetadata


url

protected readonly url: string

Defined in: packages/base/BaseProvider.ts:52

Inherited from

GenericContentProvider.url


config

protected config: ProviderConfig

Defined in: packages/base/BaseProvider.ts:53

Inherited from

DefaultProvider.config

Accessors

metadata

Get Signature

get protected metadata(): ProviderMetadata

Defined in: packages/base/BaseProvider.ts:47

Provider capabilities, integration status, and access restrictions.

Returns

ProviderMetadata

Inherited from

DefaultProvider.metadata


ORIGIN

Get Signature

get protected ORIGIN(): string

Defined in: packages/base/BaseProvider.ts:81

Returns

string

Inherited from

GenericContentProvider.ORIGIN


HOST_NAME

Get Signature

get protected HOST_NAME(): string

Defined in: packages/base/BaseProvider.ts:85

Returns

string

Inherited from

DefaultProvider.HOST_NAME

Methods

isValidHostName()

protected isValidHostName(): boolean

Defined in: packages/base/BaseProvider.ts:89

Returns

boolean

Inherited from

GenericContentProvider.isValidHostName


setAuth()

setAuth(auth): this

Defined in: packages/base/BaseProvider.ts:110

Sets authentication credentials for the provider.

Parameters

auth

AuthenticatedCrawlOptions

Authentication options including cookie, bearer token, CSRF token, API key, client ID, and user agent

Returns

this

Remarks

Configures HTTP headers and user agent based on provided authentication credentials. Supports multiple authentication methods: cookies, bearer tokens, CSRF tokens, API keys, and client IDs.

Inherited from

GenericContentProvider.setAuth


setHeaders()

setHeaders(headers): this

Defined in: packages/base/BaseProvider.ts:129

Sets custom HTTP headers.

Parameters

headers

Record<string, string>

Request header map

Returns

this

Inherited from

GenericContentProvider.setHeaders


setTimeout()

setTimeout(timeoutMs): this

Defined in: packages/base/BaseProvider.ts:138

Sets HTTP timeout.

Parameters

timeoutMs

number

Timeout in milliseconds

Returns

this

Inherited from

GenericContentProvider.setTimeout


setRetries()

setRetries(retries): this

Defined in: packages/base/BaseProvider.ts:147

Sets fetch retry count.

Parameters

retries

number

Retry attempt count

Returns

this

Inherited from

GenericContentProvider.setRetries


setTransformOutput()

setTransformOutput(transform?): this

Defined in: packages/base/BaseProvider.ts:156

Transform output to provider-specific result type.

Parameters

transform?

boolean = true

Default is true, which applies the default transformation. Set to false to return raw extracted data.

Returns

this

Inherited from

GenericContentProvider.setTransformOutput


setHttpOptions()

setHttpOptions(opts): this

Defined in: packages/base/BaseProvider.ts:165

Sets HTTP fetch options.

Parameters

opts

HttpFetchOptions

HTTP options to merge

Returns

this

Inherited from

GenericContentProvider.setHttpOptions


setNoDownload()

setNoDownload(noDownload?): this

Defined in: packages/base/BaseProvider.ts:175

Sets no download flag.

Parameters

noDownload?

boolean = false

No download flag

Returns

this

Default Value

false - set to true to skip the download phase and only perform extraction (useful for debugging or when you only need metadata)

Inherited from

GenericContentProvider.setNoDownload


setTranscodeOptions()

setTranscodeOptions(opts): this

Defined in: packages/base/BaseProvider.ts:188

Sets transcode options.

Parameters

opts

TranscodeOptions

Sometimes due to nature of the OS, the video might not play after download.

In such cases, you can set transcodeOptions to re-encode the video using ffmpeg which should resolve most compatibility issues. Make sure your OS can handle it

Returns

this

Inherited from

GenericContentProvider.setTranscodeOptions


setPreferredFormat()

setPreferredFormat(format): this

Defined in: packages/base/BaseProvider.ts:197

Sets preferred video format.

Parameters

format

VideoFormat

Video format (hls or mp4)

Returns

this

Inherited from

GenericContentProvider.setPreferredFormat


setPreferredCodec()

setPreferredCodec(codec): this

Defined in: packages/base/BaseProvider.ts:211

Sets preferred video codec.

Parameters

codec

VideoCodec

Video codec (h264 or av1)

This feature is still experimental not yet implemented for all providers.

It allows you to specify a preferred video codec which can help with compatibility or performance in some cases. If the provider supports it, it will try to download the video in the specified codec. If not available, it will fall back to the default behavior.

Returns

this

Inherited from

GenericContentProvider.setPreferredCodec


setJobOptions()

setJobOptions(opts): this

Defined in: packages/base/BaseProvider.ts:220

Sets ExecutionCoordinator options.

Parameters

opts

ExecutionOptions

Job options to merge

Returns

this

Inherited from

GenericContentProvider.setJobOptions


setAgentOptions()

setAgentOptions(opts): this

Defined in: packages/base/BaseProvider.ts:229

Sets HTTP agent options.

Parameters

opts

HttpAgentOptions

HTTP agent options to merge

Returns

this

Inherited from

GenericContentProvider.setAgentOptions


setMaxDownloads()

setMaxDownloads(maxDownloads): this

Defined in: packages/base/BaseProvider.ts:238

Sets maximum downloads.

Parameters

maxDownloads

number

Download limit

Returns

this

Inherited from

GenericContentProvider.setMaxDownloads


setAllowedExtensions()

setAllowedExtensions(...extensions): this

Defined in: packages/base/BaseProvider.ts:247

Sets allowed file extensions.

Parameters

extensions

...AllowedExtension[]

File extensions such as jpg or png

Returns

this

Inherited from

GenericContentProvider.setAllowedExtensions


onProgress()

onProgress(handler): this

Defined in: packages/base/BaseProvider.ts:256

Sets progress handler.

Parameters

handler

(event) => void

Progress event callback

Returns

this

Inherited from

GenericContentProvider.onProgress


setProgressLogging()

setProgressLogging(enabled?): this

Defined in: packages/base/BaseProvider.ts:266

Enables console progress logging.

Parameters

enabled?

boolean = true

Console logging flag

Returns

this

Default Value

true

Inherited from

GenericContentProvider.setProgressLogging


setOutput()

setOutput(type, config?): this

Defined in: packages/base/BaseProvider.ts:277

Sets output type.

Parameters

type

OutputType

Job output mode

config?

DirectoryOutputOptions = {}

Directory output configuration

Returns

this

Default Value

OutputType.JSON

Inherited from

GenericContentProvider.setOutput


setExecutionType()

setExecutionType(type): this

Defined in: packages/base/BaseProvider.ts:298

Sets execution strategy.

Parameters

type

ExecutionType

Execution mode

Returns

this

Default Value

ExecutionType.SEQUENTIAL

This feature is still experimental and not yet implemented for all providers. It allows you to specify the execution strategy for the extraction and download process.

  • SEQUENTIAL: Extracts and downloads items one by one. This is the most compatible mode and should work with all providers, but can be slower for large batches.

  • PARALLEL: Extracts all items first, then downloads them in parallel. This can be faster for large batches, but may cause issues with providers that have strict rate limits or anti-bot measures. Use with caution and test thoroughly if you choose to use PARALLEL execution.

Inherited from

GenericContentProvider.setExecutionType


buildRequest()

protected buildRequest(overrides?): WikimediaExecArgs

Defined in: packages/base/BaseProvider.ts:309

Builds the execution request passed to the coordinator layer.

Parameters

overrides?

Partial<WikimediaExecArgs>

Provider method options that should override defaults.

Returns

WikimediaExecArgs

A typed request containing provider metadata and execution options.

Inherited from

GenericContentProvider.buildRequest


execute()

protected execute<TResult>(overrides): Promise<TResult>

Defined in: packages/base/BaseProvider.ts:330

Runs extraction and optional downloads through the shared coordinator.

Type Parameters

TResult

TResult

Parameters

overrides

{ entryUrl?: string; } | WikimediaExecArgs & object

Provider method request data, including execution shape.

Returns

Promise<TResult>

Extracted output in the shape requested by the provider method.

Inherited from

GenericContentProvider.execute


makeTargets()

protected makeTargets(sourceUrl, range, provider, method, addTrailingSlash?): object

Defined in: packages/base/BaseProvider.ts:359

Builds paginated target URLs for list-like provider methods.

Parameters

sourceUrl

string

Base URL before the page number.

range

Range

Page or start/end range to expand.

provider

Provider

Provider used for range validation errors.

method

string

Provider method used for range validation errors.

addTrailingSlash?

boolean = true

Whether generated target URLs should end with /.

Returns

object

Provider, method, and generated target URLs.

targets

targets: string[]

provider

provider: Provider

method

method: string

Inherited from

GenericContentProvider.makeTargets


getMetadata()

getMetadata(): Promise<DefaultExecutionResult<unknown>>

Defined in: packages/providers/shared/GenericContentProvider.ts:29

Returns

Promise<DefaultExecutionResult<unknown>>

Inherited from

GenericContentProvider.getMetadata


getLinks()

getLinks(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:39

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getLinks


getImages()

getImages(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:44

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getImages


getVideos()

getVideos(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:49

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getVideos


getAudio()

getAudio(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:54

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getAudio


getAllUrls()

getAllUrls(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:59

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getAllUrls


getDownloadableResources()

getDownloadableResources(): Promise<string[]>

Defined in: packages/providers/shared/GenericContentProvider.ts:72

Returns

Promise<string[]>

Inherited from

GenericContentProvider.getDownloadableResources