Academic Abstract Word Cloud Pipeline

学術抄録ワードクラウド生成パイプライン

An NLP pipeline for extracting domain-specific terminology from biomedical conference abstracts and visualizing them as word clouds. Supports mixed Japanese/English input.

生物医学系学会の抄録から専門用語を自動抽出し、ワードクラウドとして可視化するNLPパイプラインです。日本語・英語混在入力に対応しています。

Example: Word cloud generated from JHBM2026 abstracts, shaped by the conference logo.

Features / 特徴

Bilingual support — Japanese abstracts are automatically translated to English via neural machine translation (Helsinki-NLP/opus-mt-ja-en) before term extraction. 日本語抄録はニューラル機械翻訳で自動的に英語に変換されます。
Biomedical NER — SciSpaCy identifies domain-specific entities (brain regions, methods, disorders, etc.). SciSpaCyにより脳領域・手法・疾患名などの専門用語を自動認識します。
POS-constrained filtering — Only noun/proper noun entities are retained, removing verbs and adjectives that add noise to word clouds. 品詞制約（名詞・固有名詞のみ）により、ワードクラウドのノイズとなる動詞・形容詞を除去します。
Academic stopword removal — Common academic boilerplate terms ("study", "results", "significant", etc.) are filtered out. Stopwords are fully customizable via external text files (--stopwords). 学術論文で頻出する汎用語を自動フィルタリングします。ストップワードは外部ファイルで自由にカスタマイズできます。
Custom mask support — Shape your word cloud using any silhouette image (e.g., a brain, a conference logo). 任意のシルエット画像をマスクとして使用できます。
Logo → Mask conversion — Automatically convert a logo image into a word cloud mask, with auto-detection and removal of text portions (make_mask.py). ロゴ画像からワードクラウド用マスクを自動生成。テキスト部分を自動検出・除去します。

Installation / インストール

Requirements / 動作環境

Python 3.9+
~2 GB disk space for models (SciSpaCy + Opus-MT)

Setup / セットアップ

# Clone the repository
git clone https://github.com/RIKEN-BCIL/jhbm-wordcloud.git
cd jhbm-wordcloud

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Note: The first run will download the Opus-MT translation model (~300 MB) and SciSpaCy model automatically. Subsequent runs will use cached models.

注意: 初回実行時に翻訳モデル（約300 MB）とSciSpaCyモデルが自動ダウンロードされます。2回目以降はキャッシュが使用されます。

Usage / 使い方

Step 0 (optional): Create mask from logo / ロゴからマスク自動生成

A pre-made silhouette mask for JHBM is included as custom_silhouette_jhbm.png. If you want to generate a mask from a different logo, make_mask.py can auto-extract a silhouette:

JHBM用のシルエットマスクは custom_silhouette_jhbm.png として同梱されています。別のロゴからマスクを自動生成する場合は make_mask.py を使用できます：

# Auto-detect and remove text, then create silhouette
python make_mask.py --input logo.png --crop-bottom-auto --out mask.png

# Manual crop (remove bottom 20%) + save cropped image
python make_mask.py --input logo.png --crop-bottom 0.2 \
    --save-cropped logo_no_text.png --out mask.png

Note: For complex logos with fine internal structure, manually prepared silhouette images may produce better results.

注意: 内部構造が複雑なロゴの場合、手動で作成したシルエット画像の方がより良い結果が得られます。

Step 1: Extract terms / 専門用語の抽出

# Japanese + English input (日本語・英語混在入力)
python extract_terms.py --input_ja abstracts_ja.txt --input_en abstracts_en.txt --out freq.txt

# With custom stopwords (カスタムストップワード使用)
python extract_terms.py --input_en abstracts_en.txt \
    --stopwords custom_stopword_jhbm2026.txt --out freq.txt

# English only, no default stopwords — use only your custom list
python extract_terms.py --input_en abstracts_en.txt \
    --stopwords custom_stopword_jhbm2026.txt --no-default-stopwords --out freq.txt

Input format: Plain text files containing abstracts (one per line or continuous text). HTML tags are automatically stripped.

入力形式: 抄録を含むプレーンテキストファイル。HTMLタグは自動除去されます。

Output: A TSV file with term frequencies:

connectivity    45
fmri            38
network         35
brain activity  28
...

Step 2: Generate word cloud / ワードクラウドの生成

# With mask image (custom shape)
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png --out wordcloud.png

# Basic (rectangular, no mask)
python make_wordcloud.py --freq freq.txt --out wordcloud.png

# Custom options
python make_wordcloud.py --freq freq.txt --mask custom_silhouette_jhbm.png \
    --bg-color "#050814" --max-words 300 --out wordcloud.png

Mask image requirements:

White (255) regions → excluded (background)
Black (0) regions → words are placed here
PNG format recommended
A pre-made silhouette for JHBM is included: custom_silhouette_jhbm.png

Options / オプション

`make_mask.py`

Option	Default	Description
`--input`	(required)	Input logo/icon image (PNG/JPG)
`--out`	`mask.png`	Output mask image
`--crop-bottom-auto`	off	Auto-detect and remove text below graphic
`--crop-bottom`	None	Manual crop: fraction of height to remove (e.g. `0.2`)
`--close-iterations`	8	Morphological closing iterations
`--blur`	3.0	Edge smoothing radius
`--padding`	30	White padding around mask (px)
`--save-cropped`	None	Save intermediate cropped image

`extract_terms.py`

Option	Description
`--input_ja`	Japanese text file (日本語テキストファイル)
`--input_en`	English text file (英語テキストファイル)
`--out`	Output frequency file (default: `freq.txt`)
`--stopwords`	External stopword file, one word per line (外部ストップワードファイル)
`--no-default-stopwords`	Disable built-in defaults, use only `--stopwords` file

`make_wordcloud.py`

Option	Default	Description
`--freq`	(required)	Frequency file (TSV)
`--mask`	None	Mask image (PNG). Black=word area, White=excluded
`--out`	`wordcloud.png`	Output image file
`--bg-color`	`#1E1D34`	Silhouette fill color (dark navy)
`--outside-color`	`#FFFFFF`	Color outside the mask (white)
`--max-words`	200	Maximum words to display
`--width`	1200	Image width (px, ignored with mask)
`--height`	800	Image height (px, ignored with mask)

How It Works / 技術詳細

Term Extraction Strategy / 用語抽出の戦略

Translation: Japanese text is translated to English in batches using Opus-MT, a lightweight neural MT model trained on OPUS parallel corpora.
Named Entity Recognition: SciSpaCy (en_core_sci_sm) identifies biomedical entities — brain regions, proteins, disorders, techniques, etc.
POS Filtering: Each candidate entity is re-analyzed for part-of-speech tags. Only entities composed entirely of nouns (NOUN) and proper nouns (PROPN) are retained. This eliminates verbal phrases ("was measured", "significantly increased") that NER may capture.
Stopword Filtering: A curated set of academic boilerplate terms is removed. A small built-in default set is always available, and users can supply a custom stopword file (--stopwords) for conference-specific tuning. An example file for JHBM2026 is included as custom_stopword_jhbm2026.txt.
Noise Cleanup: Subword tokenization artifacts (e.g., repeated hyphenated fragments like "t-t-t-t") are detected and removed via regex heuristics.

Word Cloud Generation / ワードクラウド生成

Word size is proportional to term frequency.
Rainbow HSL coloring with high saturation against a dark background.
When a mask is used, the word cloud is rendered with a transparent (RGBA) background and alpha-composited onto a two-tone canvas (dark silhouette + white exterior), producing clean edges without contour artifacts.

Authors / 著者

Natsuko Kashida (柏田夏子) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)
Takuya Hayashi (林拓也) — RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab (BCIL)

Citation / 引用

If you use this tool in your research, please cite:

JHBM Abstract Word Cloud Pipeline
Natsuko Kashida & Takuya Hayashi
RIKEN Center for Biosystems Dynamics Research, Brain Connectomics Imaging Lab
https://github.com/RIKEN-BCIL/jhbm-wordcloud

License / ライセンス

This project is licensed under the MIT License — see LICENSE for details.

Acknowledgments / 謝辞

SciSpaCy — Neumann et al., 2019
Helsinki-NLP/Opus-MT — Tiedemann & Thottingal, 2020
wordcloud — Andreas Mueller
Developed for the 28th Annual Meeting of the Japanese Society for Human Brain Mapping (JHBM2026)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Academic Abstract Word Cloud Pipeline

Features / 特徴

Installation / インストール

Requirements / 動作環境

Setup / セットアップ

Usage / 使い方

Step 0 (optional): Create mask from logo / ロゴからマスク自動生成

Step 1: Extract terms / 専門用語の抽出

Step 2: Generate word cloud / ワードクラウドの生成

Options / オプション

`make_mask.py`

`extract_terms.py`

`make_wordcloud.py`

How It Works / 技術詳細

Term Extraction Strategy / 用語抽出の戦略

Word Cloud Generation / ワードクラウド生成

Authors / 著者

Citation / 引用

License / ライセンス

Acknowledgments / 謝辞

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
custom_silhouette_jhbm.png		custom_silhouette_jhbm.png
custom_stopword_jhbm2026.txt		custom_stopword_jhbm2026.txt
extract_terms.py		extract_terms.py
make_mask.py		make_mask.py
make_wordcloud.py		make_wordcloud.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Academic Abstract Word Cloud Pipeline

Features / 特徴

Installation / インストール

Requirements / 動作環境

Setup / セットアップ

Usage / 使い方

Step 0 (optional): Create mask from logo / ロゴからマスク自動生成

Step 1: Extract terms / 専門用語の抽出

Step 2: Generate word cloud / ワードクラウドの生成

Options / オプション

make_mask.py

extract_terms.py

make_wordcloud.py

How It Works / 技術詳細

Term Extraction Strategy / 用語抽出の戦略

Word Cloud Generation / ワードクラウド生成

Authors / 著者

Citation / 引用

License / ライセンス

Acknowledgments / 謝辞

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`make_mask.py`

`extract_terms.py`

`make_wordcloud.py`

Packages