学術抄録ワードクラウド生成パイプライン
An NLP pipeline for extracting domain-specific terminology from biomedical conference abstracts and visualizing them as word clouds. Supports mixed Japanese/English input.
生物医学系学会の抄録から専門用語を自動抽出し、ワードクラウドとして可視化するNLPパイプラインです。日本語・英語混在入力に対応しています。
Example: Word cloud generated from JHBM2026 abstracts, shaped by the conference logo.
- Bilingual support — Japanese abstracts are automatically translated to English via neural machine translation (Helsinki-NLP/opus-mt-ja-en) before term extraction. 日本語抄録はニューラル機械翻訳で自動的に英語に変換されます。
- Biomedical NER — SciSpaCy identifies domain-specific entities (brain regions, methods, disorders, etc.). SciSpaCyにより脳領域・手法・疾患名などの専門用語を自動認識します。
- POS-constrained filtering — Only noun/proper noun entities are retained, removing verbs and adjectives that add noise to word clouds. 品詞制約(名詞・固有名詞のみ)により、ワードクラウドのノイズとなる動詞・形容詞を除去します。
- Academic stopword removal — Common academic boilerplate terms ("study", "results", "significant", etc.) are filtered out. Stopwords are fully customizable via external text files (
--stopwords). 学術論文で頻出する汎用語を自動フィルタリングします。ストップワードは外部ファイルで自由にカスタマイズできます。 - Custom mask support — Shape your word cloud using any silhouette image (e.g., a brain, a conference logo). 任意のシルエット画像をマスクとして使用できます。
- Logo → Mask conversion — Automatically convert a logo image into a word cloud mask, with auto-detection and removal of text portions (
make_mask.py). ロゴ画像からワードクラウド用マスクを自動生成。テキスト部分を自動検出・除去します。
- Python 3.9+
- ~2 GB disk space for models (SciSpaCy + Opus-MT)
# Clone the repository
git clone https://github.com/RIKEN-BCIL/jhbm-wordcloud.git
cd jhbm-wordcloud
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtNote: The first run will download the Opus-MT translation model (~300 MB) and SciSpaCy model automatically. Subsequent runs will use cached models.
注意: 初回実行時に翻訳モデル(約300 MB)とSciSpaCyモデルが自動ダウンロードされます。2回目以降はキャッシュが使用されます。
A pre-made silhouette mask for JHBM is included as custom_silhouette_jhbm.png.
If you want to generate a mask from a different logo, make_mask.py can auto-extract a silhouette:
JHBM用のシルエットマスクは custom_silhouette_jhbm.png として同梱されています。
別のロゴからマスクを自動生成する場合は make_mask.py を使用できます:
# Auto-detect and remove text, then create silhouette
python make_mask.py --input logo.png --crop-bottom-auto --out mask.png
# Manual crop (remove bottom 20%) + save cropped image
python make_mask.py --input logo.png --crop-bottom 0.2 \
--save-cropped logo_no_text.png --out mask.pngNote: For complex logos with fine internal structure, manually prepared silhouette images may produce better results.
注意: 内部構造が複雑なロゴの場合、手動で作成したシルエット画像の方がより良い結果が得られます。
# Japanese + English input (日本語・英語混在入力)
python extract_terms.py --input_ja abstracts_ja.txt --input_en abstracts_en.txt --out freq.txt
# With custom stopwords (カスタムストップワード使用)
python extract_terms.py --input_en abstracts_en.txt \
--stopwords custom_stopword_jhbm2026.txt --out freq.txt
# English only, no default stopwords — use only your custom list
python extract_terms.py --input_en abstracts_en.txt \
--stopwords custom_stopword_jhbm2026.txt --no-default-stopwords --out freq.txtInput format: Plain text files containing abstracts (one per line or continuous text). HTML tags are automatically stripped.
入力形式: 抄録を含むプレーンテキストファイル。HTMLタグは自動除去されます。
Output: A TSV file with term frequencies:
connectivity 45
fmri 38
network 35
brain activity 28
...
# With mask image (custom shape)
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png --out wordcloud.png
# Basic (rectangular, no mask)
python make_wordcloud.py --freq freq.txt --out wordcloud.png
# Custom options
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png \
--bg-color "#050814" --max-words 300 --out wordcloud.pngMask image requirements:
- White (255) regions → excluded (background)
- Black (0) regions → words are placed here
- PNG format recommended
- A pre-made silhouette for JHBM is included:
custom_silhouette_jhbm.png
| Option | Default | Description |
|---|---|---|
--input |
(required) | Input logo/icon image (PNG/JPG) |
--out |
mask.png |
Output mask image |
--crop-bottom-auto |
off | Auto-detect and remove text below graphic |
--crop-bottom |
None | Manual crop: fraction of height to remove (e.g. 0.2) |
--close-iterations |
8 | Morphological closing iterations |
--blur |
3.0 | Edge smoothing radius |
--padding |
30 | White padding around mask (px) |
--save-cropped |
None | Save intermediate cropped image |
| Option | Description |
|---|---|
--input_ja |
Japanese text file (日本語テキストファイル) |
--input_en |
English text file (英語テキストファイル) |
--out |
Output frequency file (default: freq.txt) |
--stopwords |
External stopword file, one word per line (外部ストップワードファイル) |
--no-default-stopwords |
Disable built-in defaults, use only --stopwords file |
| Option | Default | Description |
|---|---|---|
--freq |
(required) | Frequency file (TSV) |
--mask |
None | Mask image (PNG). Black=word area, White=excluded |
--out |
wordcloud.png |
Output image file |
--bg-color |
#1E1D34 |
Silhouette fill color (dark navy) |
--outside-color |
#FFFFFF |
Color outside the mask (white) |
--max-words |
200 | Maximum words to display |
--width |
1200 | Image width (px, ignored with mask) |
--height |
800 | Image height (px, ignored with mask) |
-
Translation: Japanese text is translated to English in batches using Opus-MT, a lightweight neural MT model trained on OPUS parallel corpora.
-
Named Entity Recognition: SciSpaCy (
en_core_sci_sm) identifies biomedical entities — brain regions, proteins, disorders, techniques, etc. -
POS Filtering: Each candidate entity is re-analyzed for part-of-speech tags. Only entities composed entirely of nouns (NOUN) and proper nouns (PROPN) are retained. This eliminates verbal phrases ("was measured", "significantly increased") that NER may capture.
-
Stopword Filtering: A curated set of academic boilerplate terms is removed. A small built-in default set is always available, and users can supply a custom stopword file (
--stopwords) for conference-specific tuning. An example file for JHBM2026 is included ascustom_stopword_jhbm2026.txt. -
Noise Cleanup: Subword tokenization artifacts (e.g., repeated hyphenated fragments like "t-t-t-t") are detected and removed via regex heuristics.
- Word size is proportional to term frequency.
- Rainbow HSL coloring with high saturation against a dark background.
- When a mask is used, the word cloud is rendered with a transparent (RGBA) background and alpha-composited onto a two-tone canvas (dark silhouette + white exterior), producing clean edges without contour artifacts.
- Natsuko Kashida (柏田 夏子) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)
- Takuya Hayashi (林 拓也) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)
If you use this tool in your research, please cite:
JHBM Abstract Word Cloud Pipeline
Natsuko Kashida & Takuya Hayashi
RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab
https://github.com/RIKEN-BCIL/jhbm-wordcloud
This project is licensed under the MIT License — see LICENSE for details.
- SciSpaCy — Neumann et al., 2019
- Helsinki-NLP/Opus-MT — Tiedemann & Thottingal, 2020
- wordcloud — Andreas Mueller
- Developed for the 28th Annual Meeting of the Japanese Society for Human Brain Mapping (JHBM2026)