The sections below detail the steps taken to generate files and run scripts for this project.
These scripts are capable of populating the database with structured paper and figure information for future OCR runs.
This url returns >90k figures from PMC articles matching "signaling pathways". Approximately 80% of these are actually pathway figures. These make a reasonably efficient source of sample figures to test methods. Consider other search terms and other sources when scaling up.
http://www.ncbi.nlm.nih.gov/pmc/?term=signaling+pathway&report=imagesdocsum&dispmax=100
You can add publication dates to the query with additional terms. Note the use of a colon for date ranges.
https://www.ncbi.nlm.nih.gov/pmc/?term=signaling+pathway+AND+2018+[pdat]&report=imagesdocsum&dispmax=100
https://www.ncbi.nlm.nih.gov/pmc/?term=signaling+pathway+AND+2016+:+2018+[pdat]&report=imagesdocsum&dispmax=100
For sample sets you can simply save dozens of pages of results and quickly get 1000s of pathway figures. Consider automating this step when scaling up.
Save raw html to designated folder, e.g., pmc/20150501/rawhtml
Next, configure and run this php script to generated annotated sets of image and html files.
php pmc_image_parse.php
- depends on simple_html_dom.php
- outputs images as "PMC######__.
- outputs caption as "PMC######__..html
Consider loading caption information directly into database and skip exporting this html file
These files are exported to a designated folder, e.g., pmc/20150501/images
Another manual step here to increase accuracy of downstream counts. Make a copy of the images dir, renaming to images_pruned. View the extracted images in Finder, for example, and delete pairs of files associated with figures that are not actually pathways. In this first sample run, ~20% of images were pruned away. The most common non-pathway figures were of gel electrophoresis runs. Consider automated ways to either exclude gel figures or select only pathway images to scale this step up.
Before any of these steps, be sure you've entered the nix-shell:
nix-shell
Create database and load pmc and organism data. To get the data, see sections
gene2pubmed, pmc2pmid & organism2pubmed and
gene2pubtator & organism2pubtator below. Change database name, if desired, in
the sql files. Then run:
psql
\i database/create_tables.sql
\i database/load_data.sql
\q
Change dbname in get_pg_conn.py, if desired, then load figure data:
First time (update with your image dir):
./pfocr.py load_figures ../pmc/20181216/images/After first time, use this to copy everything:
sh ./copy_tables.shOr this to copy everything except the previously loaded figures:
sh copy_all_except_figures.sh These scripts are capable of reading selected sets of figures from the database and performing individual runs of OCR
- figures (filepath)
Exploration of settings to improve OCR by pre-processing of image:
convert test1.jpg -colorspace gray test1_gr.jpg
convert test1_gr.jpg -threshold 50% test1_gr_th.jpg
convert test1_gr_th.jpg -define connected-components:verbose=true -define connected-components:area-threshold=400 -connected-components 4 -auto-level -depth 8 test1_gr_th_cc.jpg
- Set parameters
- 'LanguageCode':'en' - to restrict to English language characters
- Produce JSON files
Enter nix-shell:
nix-shell
Caution: if you don't specify a limit value, it'll run until the last figure. Default start value is 0.
./pfocr.py ocr gcv --preprocessor noop --start 1 --limit 20Note: This command calls ocr_pmc.py at the end, passing along args and functions. The ocr_pmc.py script then:
-
gets an
ocr_processor_idcorresponding the unique hash of processing parameters -
retrieves all figure rows and steps through rows, starting with
start- runs image pre-processing
- performs OCR
- populates
ocr_processors__figureswithocr_processor_id,figure_idandresult
Example psql query to select words from result:
select substring(regexp_replace(ta->>'description', E'[\\n\\r]+',',','g'),1,45) as word from ocr_processors__figures opf, json_array_elements(opf.result::json->'textAnnotations') AS ta ;
These scripts are capable of processing the results from one or more ocr runs previously stored in the database.
-n for normalizations -m for mutations
Enter nix-shell:
nix-shell
bash run.sh- Extract words from JSON in
ocr_processors__figures.result - Applies transforms (see
transforms/*.py) - populates
wordswith unique occurences of normalized words - populates
match_attemptswith allfigure_idandword_idoccurences
- xrefs (id, xref)
- figures__xrefs (ocr_processor_id, figure_id, xref, symbol, unique_wp_hs, filepath)
Example psql query to rank order figures by unique xrefs:
select figure_id, count(unique_wp_hs) as unique from figures__xrefs where unique_wp_hs = TRUE group by figure_id order by 2 desc;
- Export a table view to file. Can only write to /tmp dir; then sftp to download.
copy (select * from figures__xrefs) to '/tmp/filename.csv' with csv;
or
copy (\i database/pubtator_gene_matches.sql) to '/tmp/filename.csv' with csv;
- Words extracted for a given paper:
select pmcid,figure_number,result from ocr_processors__figures join figures on figures.id=figure_id join papers on papers.id=figures.paper_id where pmcid='PMC2780819';
- All paper figures for a given word:
select pmcid, figure_number, word from match_attempts join words on words.id=word_id join figures on figures.id=figure_id join papers on papers.id=paper_id where word = 'AC' group by pmcid, figure_number,word;
- batches__ocr_processors (batch_id, ocr_processor_id)
- batches (timestamp, parameters, paper_count, figure_count, total_word_gross, total_word_unique, total_xrefs_gross, total_xrefs_unique)
Do not apply upper() or remove non-alphanumerics during lexicon constuction. These normalizations will be applied in parallel to both the lexicon and extracted words during post-processing.
- Download
protein-coding-geneTXT file from http://www.genenames.org/cgi-bin/statistics - Import TXT into Excel, first setting all columns to "skip" then explicitly choosing "text" for symbol, alias_symbol, prev_symbol and entrez_id columns during import wizard (to avoid date conversion of SEPT1, etc)
- Delete rows without entrez_id mappings
- In separate tabs, expand 'alias symbol' and 'prev symbol' lists into single-value rows, maintaining entrez_id mappings for each row. Used Data>Text to Columns>Other:|>Column types:Text. Delete empty rows. Collapse multiple columns by pasting entrez_id before each column, sorting and stacking.
- Filter each list for unique pairs (only affected alias and prev)
- For prev and alias, only keep symbols of 3 or more characters, using:
IF(LEN(B2)<3,"",B2)
- Enter these formulas into columns C and D, next to sorted alias in order to "tag" all instances of symbols that match more than one entrez. Delete all of these instances.
MATCH(B2,B3:B$###,0)andMATCH(B2,B$1:B1,0), where ### is last row in sheet.
- Then delete (ignore) all of these instances (i.e., rather than picking one arbitrarily via a unique function)
IF(AND(ISNA(C2),ISNA(D2)),A2,"")andIF(AND(ISNA(C2),ISNA(D2)),B2,"")
- Export as separate CSV files.
- Starting with this file from our fork of bioentities: https://raw.githubusercontent.com/wikipathways/bioentities/master/relations.csv. It captures complexes, generic symbols and gene families, e.g., "WNT" mapping to each of the WNT## entries.
- Import CSV into Excel, setting identifier columns to import as "text".
- Delete "isa" column. Add column names: type, symbol, type2, bioentities. Turn column filters on.
- Filter on 'type' and make separate tabs for rows with "BE" and "HGNC" values. Sort "be" tab by "symbol" (Column B).
- Add a column to "hgnc" tab based on =VLOOKUP(D2,be!B$2:D$116,3,FALSE). Copy/paste B and D into new tab and copy/paste-special B and E to append the list. Sort bioentities and remove rows with #N/A.
- Copy f_symbol tab (from hgnc protein-coding_gene workbook) and sort symbol column. Then add entrez_id column to bioentities via lookup on hgnc symbol using =LOOKUP(A2,n_symbol.csv!$B$2:$B$19177,n_symbol.csv!$A$2:$A$19177).
- Copy/paste-special columns of entrez_id and bioentities into new tab. Filter for unique pairs.
- Export as CSV file.
- Download human GMT from http://data.wikipathways.org/current/gmt/
- Import GMT file into Excel
- Select complete matrix and name 'matrix' (upper left text field)
- Insert column and paste this in to A1
- =OFFSET(matrix,TRUNC((ROW()-ROW($A$1))/COLUMNS(matrix)),MOD(ROW()-ROW($A$1),COLUMNS(matrix)),1,1)
- Copy equation down to bottom of sheet, e.g., at least to =ROWS(matrix)*COLUMNS(matrix)
- Filter out '0', then filter for unique
- Export as CSV file.
Taxonomy names file (names.dmp): tax_id -- the id of node associated with this name name_txt -- name itself unique name -- the unique variant of this name if name not unique name class -- (synonym, common name, ...)
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz names.dmp
sed -r 's/\|\t(Microtetraspora parvosata subsp. kistnae)"\t\|/|\t"\1"\t|/g' names.dmp | sed -r 's/\t\|$//g' | sed -r 's/\t\|\t/\t/g' > organism_names.tsv
rm names.dmp taxdump.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
gunzip gene2pubmed.gz
mv gene2pubmed gene2pubmed.tsv
head -n 1 gene2pubmed.tsv | cut -f 1,3 > organism2pubmed.tsv
tail -n +2 gene2pubmed.tsv | cut -f 1,3 | sort -u >> organism2pubmed.tsv
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
gunzip PMC-ids.csv.gz
# There is a weird section that has a newline in the middle of what should be
# one row (lines 5286267 and 5286268):
# Transbound Emerg Dis,1865-1674,1865-1682,2017,65,Suppl.
# 1,199,10.1111/tbed.12682,PMC6190748,28984428,,live^M
# It appears to be that most rows are delimited by Windows \r\n,
# but that one line ends with just \n.
# Here's a fix:
tr -d '\n' < PMC-ids.csv | sed -e "s/\r/\n/g" > PMC-ids.unix.csv
wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/gene2pubtator.gz
gunzip gene2pubtator.gz
# There can be multiple genes per row. Reshape wide -> long.
awk -F '\t' -v OFS='\t' '{split($2,a,/,|;/); for(i in a) print $1,a[i],$3,$4}' gene2pubtator > gene2pubtator.tsv
head -n 1 gene2pubtator.tsv | cut -f 1,2 > gene2pubtator_uniq.tsv
tail -n +2 gene2pubtator.tsv | cut -f 1,2 | sort -u >> gene2pubtator_uniq.tsv
rm gene2pubtator
wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/species2pubtator.gz
gunzip species2pubtator.gz
# There are some incorrect PMIDs. The first sed is needed to fix those.
# The second sed removes leading zeros from pmids.
# There can be multiple organisms per row. Reshape wide -> long with awk.
sed -E 's/^2[0-9]*?(27[0-9]{6}\t)/\1/g' species2pubtator |\
sed -E 's/^0*//g' |\
awk -F '\t' -v OFS='\t' '{split($2,a,/,|;/); for(i in a) print $1,a[i],$3,$4}' > organism2pubtator.tsv
head -n 1 organism2pubtator.tsv | cut -f 1,2 > organism2pubtator_uniq.tsv
tail -n +2 organism2pubtator.tsv | cut -f 1,2 | sort -u >> organism2pubtator_uniq.tsv
rm species2pubtator
See database/README.md