The pdftools package is available
for importing PDF files. However, this package does not work optimally
when importing PDF files with multiple columns and text boxes. Since the
pdftools::pdf_text() function from the pdftools package processes
text line by line, it often fails to maintain the context of the text.
As a result, the output may contain sentences with unrelated fragments
of text from different parts of the page. Words that are not placed in
the correct context are unsuitable for text analysis.
However, the words grouped into clusters by this package using a Density-Based Spatial Clustering algorithm are likely to be contextually related and thus suitable for text analysis. This package directly utilizes the clustering algorithms implemented in the dbscan package.
See vignette("pdftextclusteR") for more information on the usage of
the package.
You can install the current version of pdftextclusteR using the following code.
devtools::install_github("coeneisma/pdftextclusteR")The latest version can be found on the development branch. You can install it using the function:
devtools::install_github("coeneisma/pdftextclusteR")
ref = "development")This is a basic example of the capabilities of this package.
library(pdftextclusteR)
library(pdftools)
# Read a PDF-file with pdftools::pdf_data()
ka <- pdf_data("https://www.rijksoverheid.nl/binaries/rijksoverheid/documenten/rapporten/2024/06/10/kwaliteitsagenda-2024-2027-mediacollege-amsterdam/Kwaliteitsagenda+2024-2027+Mediacollege+Amsterdam.pdf")
# Detect clusters on the 7th page
ka_clusters <- ka[[7]] |>
pdf_detect_clusters()
# Plot clusters on the first page
ka_clusters |>
pdf_plot_clusters()Compared with the orriginal document it is quite acurate.
Text can be extracted to do
further analysis:
ka_clusters_text <- ka_clusters |>
pdf_extract_clusters()
ka_clusters_text
#> # A tibble: 7 × 3
#> .cluster word_count text
#> <fct> <int> <chr>
#> 1 0 2 "Inleiding\n 7\n"
#> 2 1 89 "Waar liggen de belangrijkste ontwikkelopgaven voor onze …
#> 3 2 89 "Met deze Kwaliteitsagenda 2024-2027 wil MA haar\n ambiti…
#> 4 3 3 "Kwaliteitsagenda 2024-2027\n"
#> 5 4 63 "We bouwen voort op de doelstellingen uit de\n Kwaliteits…
#> 6 5 53 "De strategie en daarmee de prioriteiten voor de MA\n Kwa…
#> 7 6 93 "Ook het werkveld is binnen verschillende overleggen\n en…
