Skip to content

Commit 901b82f

Browse files
committed
fix: replace Korean characters with Unicode escapes in getUrlList.R
- Convert Korean characters in regex patterns to \uXXXX escape sequences - Korean regex patterns remain functional (조선비즈, 뉴스, 연합, 종합, etc.) - Resolves R CMD check non-ASCII warning for portable packages R CMD check no longer reports non-ASCII characters.
1 parent 71f55ed commit 901b82f

1 file changed

Lines changed: 3 additions & 4 deletions

File tree

R/getUrlList.R

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,10 @@ getUrlList <- function(turl = url) {
2222
news_links <- all_href[news_mask]
2323

2424
# Clean up titles
25-
news_title <- gsub("\\s+(조선비즈|뉴스1|연합뉴스|MBN|YTN|중앙일보|동아일보|한겨레|경향신문|서울신문|한국경제|매일경제|시사IN|주간동아|시사매거진|여성신문|서울신문|[가-힣A-Za-z]+)\\s+\\d+분? 전\\s*$", "", news_title)
25+
news_title <- gsub("\\s+(\uc870\uc120\ube44\uc988|\ub274\uc2a41|\uc5f0\uac74\ub274\uc2a4|MBN|YTN|\uc911\uc559\uc77c\ubcf4|\ub3d9\uc544\uc77c\ubcf4|\ud55c\uaca8\ub808|\uacbd\ud5a5\uc2dc\ubb38|\uc11c\uc6b8\uc2dc\ubb38|\ud55c\uad6d\uacbd\uc801|\ub9e4\uc77c\uacbd\uc801|\uc2dc\uc0acIN|\uc8fc\uac04\ub3d9\uc544|\uc2dc\uc0ac\ub9e4\uac70\uc9c4|\uc5ec\uc131\uc2dc\ubb38|\uc11c\uc6b8\uc2dc\ubb38|[\uac00-\ud7a3])\\s+\\d+\ubd84? \uc804\\s*$", "", news_title)
2626
# Remove trailing patterns with source prefix
27-
news_title <- gsub("\\s*\\([가-힣A-Za-z]+=[가-힣A-Za-z]+\\)\\s*[가-힣A-Za-z]+(?:\\s+기자)?\\s*=.*$", "", news_title)
28-
# Remove trailing "(종합)" pattern
29-
news_title <- gsub("\\s*\\(종합\\)\\s*.*$", "", news_title)
27+
news_title <- gsub("\\s*\\([\uac00-\ud7a3]+=[\uac00-\ud7a3]+\\)\\s*[\uac00-\ud7a3]+(?:\\s+\uae30\uc790)?\\s*=.*$", "", news_title)
28+
news_title <- gsub("\\s*\\(\uc885\ud569\\)\\s*.*$", "", news_title)
3029
# 4. For longer texts (with article content), truncate to 150 chars
3130
news_title <- sapply(news_title, function(x) {
3231
if (nchar(x) > 150) {

0 commit comments

Comments
 (0)