|
2 | 2 |
|
3 | 3 | ### Operational Routine for the Ingest and Output of Networks |
4 | 4 |
|
5 | | -This package takes data sets from various sources and converts them into Knowledge Graphs. |
| 5 | +ORION ingests data from knowledge sources and converts them into [Biolink Model](https://biolink.github.io/biolink-model/) knowledge graphs in [KGX](https://github.com/biolink/kgx) format. |
6 | 6 |
|
7 | | -Each data source will go through the following pipeline before it can be included in a graph: |
| 7 | +Each data source goes through the following pipeline: |
8 | 8 |
|
9 | | -1. Fetch (retrieve an original data source) |
10 | | -2. Parse (convert the data source into KGX files) |
11 | | -3. Normalize (use normalization services to convert identifiers and ontology terms to preferred synonyms) |
12 | | -4. Supplement (add supplementary knowledge specific to that source) |
| 9 | +1. **Fetch** - retrieve the original data source |
| 10 | +2. **Parse** - transform the data into KGX files |
| 11 | +3. **Normalize** - use normalization services to convert identifiers and ontology terms to preferred synonyms |
| 12 | +4. **Supplement** - add supplementary knowledge specific to that source |
13 | 13 |
|
14 | | -To build a graph use a Graph Spec yaml file to specify the sources you want. Some examples live in `graph_specs` folder. |
| 14 | +Sources are defined in a Graph Spec yaml file (see examples in the `graph_specs/` directory). ORION automatically runs each specified source through the pipeline and merges them into a Knowledge Graph. |
15 | 15 |
|
16 | | -ORION will automatically run each data source specified through the necessary pipeline. Then it will merge the specified sources into a Knowledge Graph. |
| 16 | +### Installation |
17 | 17 |
|
18 | | -### Installing and Configuring ORION |
| 18 | +ORION requires [uv](https://docs.astral.sh/uv/) for dependency management. |
19 | 19 |
|
20 | | -Create a parent directory: |
21 | | - |
22 | | -``` |
23 | | -mkdir ~/ORION_root |
24 | | -``` |
25 | | - |
26 | | -Clone the code repository: |
27 | | - |
28 | | -``` |
29 | | -cd ~/ORION_root |
| 20 | +```bash |
30 | 21 | git clone https://github.com/RobokopU24/ORION.git |
| 22 | +cd ORION |
| 23 | +uv sync --extra robokop |
31 | 24 | ``` |
32 | 25 |
|
33 | | -Next create directories where data sources, graphs, and logs will be stored. |
34 | | - |
35 | | -**ORION_STORAGE** - for storing data sources |
| 26 | +The core library is also available on PyPI (`pip install robokop-orion`), but the full repository is needed to utilize ingest modules from the [ROBOKOP](https://robokop.renci.org/) project. |
36 | 27 |
|
37 | | -**ORION_GRAPHS** - for storing knowledge graphs |
| 28 | +### CLI Commands |
38 | 29 |
|
39 | | -**ORION_LOGS** - for storing logs |
| 30 | +After installation, the following commands are available (prefix with `uv run` if not using a uv-managed shell): |
40 | 31 |
|
41 | | -You can do this manually, or use the script indicated below to set up a default workspace. |
| 32 | +| Command | Description | |
| 33 | +|---|-------------------------------------------------------| |
| 34 | +| `orion-build` | Build complete knowledge graphs from a Graph Spec | |
| 35 | +| `orion-ingest` | Run the ingest pipeline for individual data sources | |
| 36 | +| `orion-merge` | Merge KGX node/edge files | |
| 37 | +| `orion-meta-kg` | Generate MetaKG and test data files | |
| 38 | +| `orion-redundant-kg` | Generate edge files with redundant biolink predicates | |
| 39 | +| `orion-ac` | Generate AnswerCoalesce files | |
| 40 | +| `orion-neo4j-dump` | Generate Neo4j database dumps | |
| 41 | +| `orion-memgraph-dump` | Generate Memgraph database dumps | |
42 | 42 |
|
43 | | -Option 1: Use this script to create the directories and set the environment variables: |
| 43 | +### Configuring ORION |
44 | 44 |
|
45 | | -``` |
46 | | -cd ~/ORION_root/ORION/ |
47 | | -source ./set_up_test_env.sh |
48 | | -``` |
| 45 | +ORION uses three directories for its data, configured via environment variables: |
49 | 46 |
|
50 | | -Option 2: Create three directories and set environment variables specifying paths to the locations of those directories. |
| 47 | +| Variable | Purpose | |
| 48 | +|---|--------------------------------------| |
| 49 | +| `ORION_STORAGE` | Data ingest pipeline storage | |
| 50 | +| `ORION_GRAPHS` | Knowledge graph outputs | |
| 51 | +| `ORION_LOGS` | Log files | |
51 | 52 |
|
52 | | -``` |
53 | | -mkdir ~/ORION_root/storage/ |
54 | | -export ORION_STORAGE=~/ORION_root/storage/ |
| 53 | +You can set these up manually or use the provided script: |
55 | 54 |
|
56 | | -mkdir ~/ORION_root/graphs/ |
57 | | -export ORION_GRAPHS=~/ORION_root/graphs/ |
58 | | -
|
59 | | -mkdir ~/ORION_root/logs/ |
60 | | -export ORION_LOGS=~/ORION_root/logs/ |
| 55 | +```bash |
| 56 | +source ./set_up_test_env.sh |
61 | 57 | ``` |
62 | 58 |
|
63 | | -#### Specify Graph Spec file. |
64 | | - |
65 | | -Next create or select a Graph Spec yaml file, where the content of knowledge graphs to be built is specified. |
| 59 | +#### Graph Spec |
66 | 60 |
|
67 | | -Set either of the following environment variables, but not both: |
| 61 | +A Graph Spec yaml file defines which sources to include in a knowledge graph. Set one of the following environment variables (not both): |
68 | 62 |
|
69 | | -Option 1: ORION_GRAPH_SPEC - the name of a Graph Spec file located in the graph_specs directory of ORION |
70 | | - |
71 | | -``` |
| 63 | +```bash |
| 64 | +# Option 1: Name of a file in the graph_specs/ directory |
72 | 65 | export ORION_GRAPH_SPEC=example-graph-spec.yaml |
73 | | -``` |
74 | | - |
75 | | -Option 2: ORION_GRAPH_SPEC_URL - a URL pointing to a Graph Spec yaml file |
76 | 66 |
|
77 | | -``` |
| 67 | +# Option 2: URL pointing to a Graph Spec yaml file |
78 | 68 | export ORION_GRAPH_SPEC_URL=https://stars.renci.org/var/data_services/graph_specs/default-graph-spec.yaml |
79 | 69 | ``` |
80 | 70 |
|
81 | | -#### Building graph |
82 | | - |
83 | | -To build a custom graph, alter a Graph Spec file, which is composed of a list of graphs. |
84 | | - |
85 | | -For each graph, specify: |
| 71 | +Here is a simple Graph Spec example: |
86 | 72 |
|
87 | | -**graph_id** - a unique identifier string for the graph, with no spaces |
88 | | - |
89 | | -**sources** - a list of sources identifiers for data sources to include in the graph |
90 | | - |
91 | | -See the full list of data sources and their identifiers in the [data sources file](https://github.com/RobokopU24/ORION/blob/master/orion/data_sources.py). |
92 | | - |
93 | | -Here is a simple example. |
94 | | - |
95 | | -``` |
| 73 | +```yaml |
96 | 74 | graphs: |
97 | 75 | - graph_id: Example_Graph |
98 | 76 | graph_name: Example Graph |
99 | 77 | graph_description: A free text description of what is in the graph. |
100 | 78 | output_format: neo4j |
101 | 79 | sources: |
102 | | - - source_id: CTD |
| 80 | + - source_id: DrugCentral |
103 | 81 | - source_id: HGNC |
104 | 82 | ``` |
105 | 83 |
|
106 | | -There are variety of ways to further customize a knowledge graph. The following are parameters you can set for a particular data source. Mostly, these parameters are used to indicate that you'd like to use a previously built version of a data source or a specific normalization of a source. If you specify versions that are not the latest, and haven't previously built a data source or graph with those versions, it probably won't work. |
107 | | - |
108 | | -**source_version** - the version of the data source, as determined by ORION |
109 | | - |
110 | | -**parsing_version** - the version of the parsing code in ORION for this source |
111 | | - |
112 | | -**merge_strategy** - used to specify alternative merge strategies |
113 | | - |
114 | | -The following are parameters you can set for the entire graph, or for an individual data source: |
115 | | - |
116 | | -**node_normalization_version** - the version of the node normalizer API (see: https://nodenormalization-sri.renci.org/openapi.json) |
117 | | - |
118 | | -**edge_normalization_version** - the version of biolink model used to normalize predicates and validate the KG |
| 84 | +See the full list of data sources and their identifiers in the [data sources file](https://github.com/RobokopU24/ORION/blob/master/orion/data_sources.py). |
119 | 85 |
|
120 | | -**strict_normalization** - True or False specifying whether to discard nodes, node types, and edges connected to those nodes when they fail to normalize |
| 86 | +#### Graph Spec Parameters |
121 | 87 |
|
122 | | -**conflation** - True or False flag specifying whether to conflate genes with proteins and chemicals with drugs |
| 88 | +The following parameters can be set per data source: |
123 | 89 |
|
124 | | -For example, we could customize the previous example: |
| 90 | +- **merge_strategy** - alternative merge strategies |
| 91 | +- **strict_normalization** - whether to discard nodes that fail to normalize (true/false) |
| 92 | +- **conflation** - whether to conflate genes with proteins and chemicals with drugs (true/false) |
125 | 93 |
|
126 | | -``` |
127 | | -graphs: |
128 | | - - graph_id: Example_Graph |
129 | | - graph_name: Example Graph |
130 | | - graph_description: A free text description of what is in the graph. |
131 | | - output_format: neo4j |
132 | | - sources: |
133 | | - - source_id: CTD |
134 | | - - source_id: HGNC |
135 | | -``` |
| 94 | +The following can be set at the graph level: |
136 | 95 |
|
137 | | -See the `graph_specs` directory for more examples. |
| 96 | +- **add_edge_id** - whether to add unique identifiers to edges (true/false) |
| 97 | +- **edge_id_type** - if add_edge_id is true, the type of identifier can be specified (uuid or orion) |
138 | 98 |
|
139 | | -### Running ORION |
| 99 | +See the `graph_specs/` directory for more examples. |
140 | 100 |
|
141 | | -Install Docker to create and run the necessary containers. |
| 101 | +### Running with Docker |
142 | 102 |
|
143 | | -Use the following command to build the necessary images. |
| 103 | +Build the image: |
144 | 104 |
|
145 | | -``` |
| 105 | +```bash |
146 | 106 | docker compose build |
147 | 107 | ``` |
148 | 108 |
|
149 | | -To build every graph in your Graph Spec use the following command. This runs `orion-build all` on the image. |
| 109 | +Build all graphs in the configured Graph Spec: |
150 | 110 |
|
151 | | -``` |
| 111 | +```bash |
152 | 112 | docker compose up |
153 | 113 | ``` |
154 | 114 |
|
155 | | -#### Building specific graphs |
156 | | - |
157 | | -To build an individual graph use `orion-build` with a graph_id from the Graph Spec. |
| 115 | +Build a specific graph: |
158 | 116 |
|
159 | | -Usage: `orion-build [-h] graph_id` |
160 | | -positional arguments: |
161 | | -`graph_id` : ID of the graph to build. Must match an ID from the configured Graph Spec. |
162 | | - |
163 | | -Example command to create a graph from a Graph Spec with graph_id: Example_Graph: |
164 | | - |
165 | | -``` |
| 117 | +```bash |
166 | 118 | docker compose run --rm orion orion-build Example_Graph |
167 | 119 | ``` |
168 | 120 |
|
169 | | -#### Run ORION Pipeline on a single data source. |
| 121 | +Run the ingest pipeline for a single data source: |
170 | 122 |
|
171 | | -To run the ORION pipeline for a single data source and transform it into KGX files, you can use `orion-load`. |
172 | | - |
173 | | -``` |
174 | | -optional arguments: |
175 | | - -h, --help : show this help message and exit |
176 | | - -t, --test_mode : Test mode will process a small sample version of the data. |
177 | | - -f, --fresh_start_mode : Fresh start mode will ignore previous states and overwrite previous data. |
178 | | - -l, --lenient_normalization : Lenient normalization mode will allow nodes that do not normalize to persist in the finalized kgx files. |
| 123 | +```bash |
| 124 | +docker compose run --rm orion orion-ingest DrugCentral |
179 | 125 | ``` |
180 | 126 |
|
181 | | -Example command to convert data source CTD to KGX files. |
| 127 | +See available data sources and options: |
182 | 128 |
|
183 | | -``` |
184 | | -docker compose run --rm orion orion-load CTD |
| 129 | +```bash |
| 130 | +docker compose run --rm orion orion-ingest -h |
185 | 131 | ``` |
186 | 132 |
|
187 | | -To see the available arguments and a list of supported data sources: |
| 133 | +### Development |
188 | 134 |
|
189 | | -``` |
190 | | -docker compose run --rm orion orion-load -h |
191 | | -``` |
| 135 | +Install dev dependencies with [uv](https://docs.astral.sh/uv/): |
192 | 136 |
|
193 | | -#### Testing and Troubleshooting |
| 137 | +```bash |
| 138 | +uv sync --extra robokop --group dev |
| 139 | +``` |
194 | 140 |
|
195 | | -If you are experiencing issues or errors you may want to run tests: |
| 141 | +Run tests: |
196 | 142 |
|
197 | | -``` |
198 | | -docker-compose run --rm orion pytest /ORION |
| 143 | +```bash |
| 144 | +uv run pytest tests/ |
199 | 145 | ``` |
200 | 146 |
|
201 | | -#### Contributing to ORION |
| 147 | +### Contributing |
202 | 148 |
|
203 | | -Contributions are welcome, see the [Contributer README](README-CONTRIBUTER.md). |
| 149 | +Contributions are welcome, see the [Contributor README](README-CONTRIBUTER.md). |
0 commit comments