Skip to content

RIKEN-BCIL/jhbm-wordcloud

Repository files navigation

Academic Abstract Word Cloud Pipeline

学術抄録ワードクラウド生成パイプライン

License: MIT Python 3.9+

An NLP pipeline for extracting domain-specific terminology from biomedical conference abstracts and visualizing them as word clouds. Supports mixed Japanese/English input.

生物医学系学会の抄録から専門用語を自動抽出し、ワードクラウドとして可視化するNLPパイプラインです。日本語・英語混在入力に対応しています。

JHBM2026 Word Cloud
Example: Word cloud generated from JHBM2026 abstracts, shaped by the conference logo.


Features / 特徴

  • Bilingual support — Japanese abstracts are automatically translated to English via neural machine translation (Helsinki-NLP/opus-mt-ja-en) before term extraction. 日本語抄録はニューラル機械翻訳で自動的に英語に変換されます。
  • Biomedical NERSciSpaCy identifies domain-specific entities (brain regions, methods, disorders, etc.). SciSpaCyにより脳領域・手法・疾患名などの専門用語を自動認識します。
  • POS-constrained filtering — Only noun/proper noun entities are retained, removing verbs and adjectives that add noise to word clouds. 品詞制約(名詞・固有名詞のみ)により、ワードクラウドのノイズとなる動詞・形容詞を除去します。
  • Academic stopword removal — Common academic boilerplate terms ("study", "results", "significant", etc.) are filtered out. Stopwords are fully customizable via external text files (--stopwords). 学術論文で頻出する汎用語を自動フィルタリングします。ストップワードは外部ファイルで自由にカスタマイズできます。
  • Custom mask support — Shape your word cloud using any silhouette image (e.g., a brain, a conference logo). 任意のシルエット画像をマスクとして使用できます。
  • Logo → Mask conversion — Automatically convert a logo image into a word cloud mask, with auto-detection and removal of text portions (make_mask.py). ロゴ画像からワードクラウド用マスクを自動生成。テキスト部分を自動検出・除去します。

Installation / インストール

Requirements / 動作環境

  • Python 3.9+
  • ~2 GB disk space for models (SciSpaCy + Opus-MT)

Setup / セットアップ

# Clone the repository
git clone https://github.com/RIKEN-BCIL/jhbm-wordcloud.git
cd jhbm-wordcloud

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Note: The first run will download the Opus-MT translation model (~300 MB) and SciSpaCy model automatically. Subsequent runs will use cached models.

注意: 初回実行時に翻訳モデル(約300 MB)とSciSpaCyモデルが自動ダウンロードされます。2回目以降はキャッシュが使用されます。

Usage / 使い方

Step 0 (optional): Create mask from logo / ロゴからマスク自動生成

A pre-made silhouette mask for JHBM is included as custom_silhouette_jhbm.png. If you want to generate a mask from a different logo, make_mask.py can auto-extract a silhouette:

JHBM用のシルエットマスクは custom_silhouette_jhbm.png として同梱されています。 別のロゴからマスクを自動生成する場合は make_mask.py を使用できます:

# Auto-detect and remove text, then create silhouette
python make_mask.py --input logo.png --crop-bottom-auto --out mask.png

# Manual crop (remove bottom 20%) + save cropped image
python make_mask.py --input logo.png --crop-bottom 0.2 \
    --save-cropped logo_no_text.png --out mask.png

Note: For complex logos with fine internal structure, manually prepared silhouette images may produce better results.

注意: 内部構造が複雑なロゴの場合、手動で作成したシルエット画像の方がより良い結果が得られます。

Step 1: Extract terms / 専門用語の抽出

# Japanese + English input (日本語・英語混在入力)
python extract_terms.py --input_ja abstracts_ja.txt --input_en abstracts_en.txt --out freq.txt

# With custom stopwords (カスタムストップワード使用)
python extract_terms.py --input_en abstracts_en.txt \
    --stopwords custom_stopword_jhbm2026.txt --out freq.txt

# English only, no default stopwords — use only your custom list
python extract_terms.py --input_en abstracts_en.txt \
    --stopwords custom_stopword_jhbm2026.txt --no-default-stopwords --out freq.txt

Input format: Plain text files containing abstracts (one per line or continuous text). HTML tags are automatically stripped.

入力形式: 抄録を含むプレーンテキストファイル。HTMLタグは自動除去されます。

Output: A TSV file with term frequencies:

connectivity    45
fmri            38
network         35
brain activity  28
...

Step 2: Generate word cloud / ワードクラウドの生成

# With mask image (custom shape)
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png --out wordcloud.png

# Basic (rectangular, no mask)
python make_wordcloud.py --freq freq.txt --out wordcloud.png

# Custom options
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png \
    --bg-color "#050814" --max-words 300 --out wordcloud.png

Mask image requirements:

  • White (255) regions → excluded (background)
  • Black (0) regions → words are placed here
  • PNG format recommended
  • A pre-made silhouette for JHBM is included: custom_silhouette_jhbm.png

Options / オプション

make_mask.py

Option Default Description
--input (required) Input logo/icon image (PNG/JPG)
--out mask.png Output mask image
--crop-bottom-auto off Auto-detect and remove text below graphic
--crop-bottom None Manual crop: fraction of height to remove (e.g. 0.2)
--close-iterations 8 Morphological closing iterations
--blur 3.0 Edge smoothing radius
--padding 30 White padding around mask (px)
--save-cropped None Save intermediate cropped image

extract_terms.py

Option Description
--input_ja Japanese text file (日本語テキストファイル)
--input_en English text file (英語テキストファイル)
--out Output frequency file (default: freq.txt)
--stopwords External stopword file, one word per line (外部ストップワードファイル)
--no-default-stopwords Disable built-in defaults, use only --stopwords file

make_wordcloud.py

Option Default Description
--freq (required) Frequency file (TSV)
--mask None Mask image (PNG). Black=word area, White=excluded
--out wordcloud.png Output image file
--bg-color #1E1D34 Silhouette fill color (dark navy)
--outside-color #FFFFFF Color outside the mask (white)
--max-words 200 Maximum words to display
--width 1200 Image width (px, ignored with mask)
--height 800 Image height (px, ignored with mask)

How It Works / 技術詳細

Term Extraction Strategy / 用語抽出の戦略

  1. Translation: Japanese text is translated to English in batches using Opus-MT, a lightweight neural MT model trained on OPUS parallel corpora.

  2. Named Entity Recognition: SciSpaCy (en_core_sci_sm) identifies biomedical entities — brain regions, proteins, disorders, techniques, etc.

  3. POS Filtering: Each candidate entity is re-analyzed for part-of-speech tags. Only entities composed entirely of nouns (NOUN) and proper nouns (PROPN) are retained. This eliminates verbal phrases ("was measured", "significantly increased") that NER may capture.

  4. Stopword Filtering: A curated set of academic boilerplate terms is removed. A small built-in default set is always available, and users can supply a custom stopword file (--stopwords) for conference-specific tuning. An example file for JHBM2026 is included as custom_stopword_jhbm2026.txt.

  5. Noise Cleanup: Subword tokenization artifacts (e.g., repeated hyphenated fragments like "t-t-t-t") are detected and removed via regex heuristics.

Word Cloud Generation / ワードクラウド生成

  • Word size is proportional to term frequency.
  • Rainbow HSL coloring with high saturation against a dark background.
  • When a mask is used, the word cloud is rendered with a transparent (RGBA) background and alpha-composited onto a two-tone canvas (dark silhouette + white exterior), producing clean edges without contour artifacts.

Authors / 著者

  • Natsuko Kashida (柏田 夏子) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)
  • Takuya Hayashi (林 拓也) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)

Citation / 引用

If you use this tool in your research, please cite:

JHBM Abstract Word Cloud Pipeline
Natsuko Kashida & Takuya Hayashi
RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab
https://github.com/RIKEN-BCIL/jhbm-wordcloud

License / ライセンス

This project is licensed under the MIT License — see LICENSE for details.

Acknowledgments / 謝辞

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages