-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathparentage_report.pl
More file actions
executable file
·270 lines (232 loc) · 9.58 KB
/
parentage_report.pl
File metadata and controls
executable file
·270 lines (232 loc) · 9.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
#!/usr/bin/env perl
# Program: parentage_report.pl
# Program description: see usage message.
# Steven Cannon 2024
use feature say;
use warnings;
use strict;
use Getopt::Long;
use JSON;
use FindBin qw($Bin);
use lib "$Bin";
my ($query, $verbose, $table_out, $text_out, $outfile, $help, $pretty);
my $format = "string";
my $line_ct = 0;
my $max_count = 999;
my $parents = "data/parentage.tsv";
my $synonyms = "data/parentage-synonyms.tsv";
my $comments = "data/parentage-comments.tsv";
my $plist;
my $usage = <<EOS;
Usage:
parentage_report.pl -query ID [-options]
Three data files are required, and an optional fourth ensures a much faster run.
If the data files named as follows and available in a directory "data" at
the same location as the script, the script will use these file names and locations
by default, so they don't need to be specified explicitly:
-parentage data/parentage.tsv
-synonyms data/parentage-synonyms.tsv
-comments data/parentage-comments.tsv
-plist data/parentage-list.tsv # Optional but recommended
Given the requried input data, generate a report about an individual, including the pedigree,
any aliases/synonyms for the line, the lines which have the individual in their pedigree,
and any available comments about the individual.
In the invocation without -plist, the parentage.tsv file is taken in as data for calculating
pedigrees for all lines, and then the query is checked against those pedigrees to find which lines
contain the query individual in the pedigree. This option is space-efficient (parentage.tsv is small)
but relatively time-consuming to run (it takes several seconds to recalculate all pedigrees).
In the invocation WITH -plist, the parentage-list.tsv file has, for each individual, the lines in the
pedigrees of that individual. The query is checked against each of those lists to find which lines
contain the query individual in the pedigree. This option is relatively space-inefficient
(the parentage-list.tsv file may be several megabytes) but fast to run.
The parentage-list.tsv can be generated by the script parentage.pl:
./parentage.pl -par data/parentage.tsv -f list -outfile data/parentage-list.tsv
To generate a standard report in JSON format
./parentage_report.pl -plist data/parentage-list.tsv -q Hardin
To generate just the three-column pedigree table suitable for submitting to the Helium viewer:
./parentage_report.pl -plist data/parentage-list.tsv -table -q Hardin
Some other lines to try, to check various characteristics of the data:
Hardin, Hayes, Hamlin, Gnome, Franklin, Flyer, Flambeau, Williams, "Williams 82", Lee
Required:
-query ID of an individual for which to generate a report
Required, with defaults indicated above:
-parents File with three columns: individuals and parents individuals and the parents;
-synonyms File with two columns: individual and synonym (if multiple synonyms, one line for each);
-comments File with two columns: individual and comments
Options:
-plist Tab-separated file with individual (first column) and all progenitors for that individual
-text_out Print a plain-text report to STDOUT; otherwise to JSON (default)
-pretty For JSON output, format for human viewing (with line returns, indentation)
-table_out Print pedigree table, suitable for submitting to the Helium viewer, to STDOUT.
With this option, the other output is not reported (neither JSON nor test_out)
-max_count The maximum number of individuals in the pedigree to report.
-verbose Report some intermediate information.
-help This message.
EOS
GetOptions(
'query=s' => \$query, # required
'parents:s' => \$parents, # required but with default value
'synonyms:s' => \$synonyms, # required but with default value
'comments:s' => \$comments, # required but with default value
'plist:s' => \$plist,
'text_out' => \$text_out,
'table_out' => \$table_out,
'pretty' => \$pretty,
'outfile:s' => \$outfile,
'max_count:i' => \$max_count,
'v|verbose' => \$verbose,
'h|help' => \$help,
);
unless (-e $parents && -e $synonyms && -e $comments ){ die "$usage" }
unless ( $query ){ die "$usage" }
if ($text_out && $table_out){
die "The options -text_out and -table_out are mutually exclusive. Please specify at most one.\n";
}
open (my $PAR_FH, "<", $parents) or die "Can't open in parents: $parents $!\n";
open (my $SYN_FH, "<", $synonyms) or die "Can't open in synonyms: $synonyms $!\n";
open (my $COM_FH, "<", $comments) or die "Can't open in comments: $comments $!\n";
my $PLIST_FH;
my @progenitors;
my %PED_HSH;
my @matches;
if ($plist){
open ($PLIST_FH, "<", $plist) or die "Can't open in plist: $plist $!\n";
while (<$PLIST_FH>){
chomp;
my ($ind, @progenitors) = split(/\t+/, $_);
if ( grep { $query eq $_ } @progenitors ){
push @matches, $ind;
}
}
}
else {
# Use parentage.pl to calculate lists of strains in the pedigree of each individual. The serialized structure
# is a hash of arrays, with the hash key being the individual and the strains being the array values:
# { indivd [strain1 strain2 strain3] }
my @args1 = ( "-parents", "$parents", "-format", "list" );
#say "perl $Bin/parentage.pl @args1";
my $serialized_result1 = `perl "$Bin/parentage.pl" @args1`;
my @ped_ary = split(/\n/, $serialized_result1);
for my $ped_line (@ped_ary){
my ($ind, @ped_list) = split(/\t/, $ped_line);
$PED_HSH{ $ind } = [ @ped_list ];
if ( grep { $query eq $_ } @ped_list ){
push @matches, $ind;
}
}
}
# Check if query is in the parentage file
my %seen_query;
while (<$PAR_FH>){
chomp;
next if (/^#/ || /^Strain/);
my $line = $_;
my ($ind, $p1, $p2) = split(/\t+/, $line);
$seen_query{$ind}++;
}
unless ($seen_query{$query}){
my $error_no_q = "Error: Query $query was not found in the parentage file $parents";
die $error_no_q;
}
my @synonyms;
while (<$SYN_FH>){
chomp;
my $line = $_;
my ($ind, $alt) = split(/\t+/, $_);
if ($query eq $ind){ push @synonyms, $alt }
}
my @q_comments;
while (<$COM_FH>){
chomp;
my $line = $_;
my ($ind, $comment) = split(/\t+/, $_);
if ($query eq $ind){ push @q_comments, $comment }
}
# Generate parentage result for the given query
my @args2 = ( "-parents", "$parents", "-query", "\"$query\"", "-max", "$max_count" );
#say "perl $Bin/parentage.pl @args2";
my $serialized_ped_constr = `perl "$Bin/parentage.pl" @args2`;
my $serialized_ped_table;
if ($serialized_ped_constr){ # Report the following only if a pedigree is returned.
# Call parentage.pl to produce tabular output for https://helium.hutton.ac.uk/#/pedigree
my @args3;
if ($outfile){
@args3 = ( "-parents", "$parents", "-query", "\"$query\"", "-format", "table0", "-outfile", "$outfile" );
}
else { # print table to STDOUT -- either in JSON object or plain-text
@args3 = ( "-parents", "$parents", "-query", "\"$query\"", "-format", "table0" );
}
#say "perl $Bin/parentage.pl @args3";
$serialized_ped_table = `perl "$Bin/parentage.pl" @args3`;
if ($table_out){
print "$serialized_ped_table";
exit;
}
}
else {
say "No pedigree is available for this individual.";
}
my %ped_components; # data structure to hold all components of the report
my @qry_ref;
$qry_ref[0] = $query; # Store query in an array, for consistency with all other data in %ped_components
$ped_components{"query"} = \@qry_ref;
if ( $serialized_ped_constr ){
my @ped_constr_ary = split("\n", $serialized_ped_constr);
$ped_components{"construction"} = \@ped_constr_ary;
}
else { $ped_components{"construction"} = "NULL" }
if ( $serialized_ped_table ){
my @ped_table_ary = split("\n", $serialized_ped_table);
# Split each line into a three element array: individual parent1 paren2
my @ped_table_AoA;
for my $line ( @ped_table_ary ){
my @individ_and_parents = split("\t", $line);
push @ped_table_AoA, \@individ_and_parents;
}
$ped_components{"table"} = \@ped_table_AoA;
}
else { $ped_components{"table"} = "NULL" }
if ( @matches ){ $ped_components{"matches"} = \@matches }
else { $ped_components{"matches"} = "NULL" }
if ( @synonyms ){ $ped_components{"synonyms"} = \@synonyms }
else { $ped_components{"synonyms"} = "NULL" }
if ( @q_comments ){ $ped_components{"comments"} = \@q_comments }
else { $ped_components{"comments"} = "NULL" }
my $json = JSON->new->allow_nonref;
my $json_text = $json->encode( \%ped_components );
my $JSON;
if ($pretty){
$JSON = $json->pretty->encode( \%ped_components );
}
else {
$JSON = $json->encode( \%ped_components );
}
if ($text_out){ # If -table_out, we printed previously and exited.
my $hsh_ref = decode_json $JSON;
for my $key (sort keys %$hsh_ref){
say "$key:";
if ($hsh_ref->{$key} eq "NULL"){ say " NULL" }
else {
foreach my $item (@{$hsh_ref->{$key}}) {
if ($key =~ /table/){ # Table needs to be further split for display.
say " ", join ("\t", @{$item});
}
else {
say " $item";
}
}
}
say "";
}
}
else {
say $JSON;
}
__END__
Versions
2024-11-01 Initial version
2024-11-03 Calculate pedigrees from parentage table using parentage.pl, rather than take in as a precalculated file
2024-11-04 Fix call to parentage.pl to permit query with spaces. Also generate a text file that can be uploaded to Helium.
2024-11-08 Set defaults for required data files. For increased speed, optionally take in -plist data/parentage-list.tsv
2024-11-19 Encode all data in a JSON object
2024-11-20 Add pretty-print option for JSON, and option -table to print three-column pedigree table to STDOUT