-
Notifications
You must be signed in to change notification settings - Fork 1
Project Plan :)
Sorghum is one of the earliest domesticated food crops (Wendorf et al. 1992) The origin of the cereal is expected to be in the west of Africa. It is one of the most important cereal crops today. There are several studies investigating the genome of the crop, most comparing wild types and the domesticated ones (Mace et al. 2013, Lozano et al. 2021). Wild type has more diversity and the domesticated variants show signs of selection (Mace et al. 2013). The allele frequency is skewed towards certain features, many of which are known to be beneficial for grain production, suggesting the selection to be artificial and man made. Many of the features are in regulatory sequences.
There are several studies investigating linguistic groups of Africa, correlating their geographical position, migration patterns and genetic variation (Schlebusch & Jakobsson 2018). One of the methods used is Principal Component Analysis (PCA) on SNPs of different linguistic populations, which can be used to generate representations of the diversity, found to significantly resemble the geographic location of the sampled populations(Gurdasani et al. 2015).
There are multiple proposed origins of agricultural practice in Africa. A proposed theory of the origin could be in east-Africa (Gurdasani et al. 2015)). Here, the practices were introduced by interaction with Eurasian populations, shown by ancient admixture. The practices then spread to the south, which is evident by lactase persistence genes, with east-African origin, in south-Africa populations (Schlebusch et al. 2012). Another possible origin of the agricultural practices is the Bantu group, which expanded from west-Africa east and southwards 3000-4000 years ago (Patin et al. 2017). The group had prior to the expansion adapted agricultural practices. This expansion may be particularly noteworthy for this project due to the suspected origin of Sorghum domestication.
There are many studies that investigate either Sorghum or the human populations, but there are, from what the client or the project group could find, no studies that correlate both of them. This project will deliver the pipeline for variant calling to the client. This can then be further used to study correlations between genetic variation in crop plants and ancient human practices of domestication and seed exchange.
The large-scale goal is to assist the client with finding variation between the indigenous species of Sorghum. This should then be used to further investigate how crop diversity is related to domestication by different linguistic groups of Africa and their migration. Examining patterns of neutral as well as adaptive evolution with correlation to agricultural practices, could give insight in the interactions between crops species and human farming populations.
In order to assist the client, our group's primary goal is to implement a variant calling procedure as a pipeline. If time allows, the secondary goal for this project is to perform a Principal Component Analysis (PCA) of the crop variants to correlate to geography and linguistic groups of Africa. This may allow the group to compare the patterns found in the cereal to the patterns found in human studies. This may in turn allow us to hypothesise agricultural practice patterns during early stages of civilisation.
The client provided the project group with illumina short read sequence data of 152 Sorghum bicolor samples from seed banks. The samples are either cultivated or wild, and we have information about sample location and where it was sequenced. The sample originates from different geographic regions of Africa, for which the passport information is available.
The first step would be to develop a preprocessing step for the autosomal sequence data, starting with quality check of the sequences with FastQC (Andrews et al. 2019) or fastp (Chen et al. 2018), trimming if necessary and eventually mapping the reads to reference a genome. The standard tool for this, which is also recommended for the best practices of GATK4 (Auwera & O’Connor 2020), is the Burrows-Wheeler Alignment tool (BWA) (Li & Durbin 2009). Some time may have to be devoted to learning how to set up Nextflow (Seqera Labs). This should then be followed by development of the variant calling pipeline, using the GATK4 toolkit (Auwera & O’Connor 2020). GATK is a standardised toolkit for variant calling that has been used in several other studies on Sorghum. One of the post processing steps is functional annotation of the SNPs. This can be done using the snpEFF(Cingolani et al.) tool (or other if better tool is found during the project) available in GATK4. The tools that are to be used for this pipeline are available in UPPMAX.
The group will then attempt to do some simple analysis through PCA & Admixture . These analyses could be used to compare crop diversity to geographical data (coordinates) to find patterns of domestication. This may, however, according to our client, require doing down sampling beforehand. The tools that will be used for this step are EIGENSOFT smartPCA (Harvard University) for the PCA & Admixture (Alexander et al. 2009) and PONG (Behr et al. 2016)for the admixture analysis.
If further time is available the project may also include using the variant calling pipeline on chloroplast genome. But this will only be done if the project is ahead of schedule by the 12th of December. This is our preliminary set of tools chosen for the pipeline, this will be further subjected to changes through the course of the project.
The group will, at the end of the project, deliver to the client the final pipeline, the report and the annotated Variant Calling files for the new sequences. The report will include the functional annotations, the PCA, admixture results and their respective interpretations. These analyses will be done with the guidance of the client, and a hypothesis will be formulated based on those results.
The project will run for 10 weeks. Week 44 and half of week 45 will be spent working on the project plan. The project work starts on 9th November after the revised project plan has been submitted and approved. The project work, with report structuring in parallel, is scheduled for the weeks 45 to 51. The remaining weeks are dedicated solely to writing the report, assuming there isn’t need to schedule more time to finish the project work. There is also time dedicated to prepare for scheduled presentations as well as the journal clubs. Meetings with the supervisor and the client will be done weekly, and will be scheduled after each meeting, to fit all of our schedules.
Since the group is small, having a single leader is redundant. The group believes in being able to come to decisions through discussion and if needed through voting. During the meeting, roles for the meeting will be assigned, but certain communication and documentation roles may be fixed. (see Group Contract).
Weekly tasks for the group will be fixed and the tasks will be segregated each week - according to the workflow. To streamline this each part of the project will have a person responsible, see table below. This includes reading up on that part, and all potential sub-parts, before the group starts on that part (according to timeplan), and making sure that the part is finished on time (according to timeplan). Each person will NOT be solely responsible for performing their respective parts. The group is supposed to work together and the whole group is responsible for executing each part of the workflow. Presentations will be quality checked by Carl, while the preparation and actual presentation will be done by all the group members.
| Processes | Responsibility |
|---|---|
| Preprocessing | Kaavya |
| Pipeline design(Nextflow) | Carl |
| Variant calling(GATK4) | Andreas |
| Post Processing | Kaavya |
| Admixture | Carl* |
| PCA Analysis | Carl* |
| Chloroplast | Andreas |
| Presentations | Carl |
| *preliminarily depends on how far the project have reached at that point | |
| Table 1: Responsible group members for different parts of the project. |
The group will strive to perform as much documentation as possible during the project. Since the project will include coding we will create a git(GitHub) with a wiki page (https://github.com/RoaringDragon/Fruitloop/wiki). Note that only documentation of the code will be stored and no result or input data. This is partially because of the file size and partially at request of the client. Meetings will be documented with meeting protocols stored in a folder on google drive (Google) accessible to all group members. Andreas will be responsible for keeping track of the protocols. The group will record the work hours in a work log.
| Stakeholder | Purpose | Form of communication | Frequency | Responsible |
|---|---|---|---|---|
| Group | Coordination & Work | All times | Group | |
| Group | Coordination & Work | Meetings | Alternating days | Group |
| Client | Insight & Advice | As needed | Kaavya | |
| Client | Insight & Advice | Meetings | Weekly | Group |
| Supervisor | Insight & Advice | As needed | Carl | |
| Supervisor | Insight & Advice | Meetings | Weekly | Group |
| The class | Peer review | Peer review & Seminars | Scedule | Everyone |
| -- | -- | -- | -- | -- |
| Risk | Probability (1-5) | Consequence (1-5) | Risk value (Prob*Con) | Proactive action | Responsible |
|---|---|---|---|---|---|
| Fail to reach primary goal | 1 | 5 | 5 | Work efficiently, Keep schedule | Group is responsible for keeping pace |
| Uppmax maintenance | 3 | 2.5 | 7.5 | Plan around | University to inform, Group needs to pay attention announcements |
| Sickness | 3 | 1.5 | 4.5 | Stay warm, Extra work after recovery | Self |
| Tools | 2.5 | 3 | 7.5 | Find alternative, Seek support | Group |
Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, Karthikeyan S, Iles L, Pollard MO, Choudhury A, Ritchie GRS, Xue Y, Asimit J, Nsubuga RN, Young EH, Pomilla C, Kivinen K, Rockett K, Kamali A, Doumatey AP, Asiki G, Seeley J, Sisay-Joof F, Jallow M, Tollman S, Mekonnen E, Ekong R, Oljira T, Bradman N, Bojang K, Ramsay M, Adeyemo A, Bekele E, Motala A, Norris SA, Pirie F, Kaleebu P, Kwiatkowski D, Tyler-Smith C, Rotimi C, Zeggini E, Sandhu MS. 2015. The African Genome Variation Project shapes medical genetics in Africa. Nature 517: 327–332.
Mace ES, Tai S, Gilding EK, Li Y, Prentis PJ, Bian L, Campbell BC, Hu W, Innes DJ, Han X, Cruickshank A, Dai C, Frère C, Zhang H, Hunt CH, Wang X, Shatte T, Wang M, Su Z, Jun L, Lin X, Godwin ID, Jordan DR, Wang J. 2013. Whole-genome sequencing reveals untapped genetic potential in Africa’s indigenous cereal crop sorghum - PMC. WWW document 27 August 2013: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3759062/. Accessed 1 November 2022.
Patin E, Lopez M, Grollemund R, Verdu P, Harmant C, Quach H, Laval G, Perry GH, Barreiro LB, Froment A, Heyer E, Massougbodji A, Fortes-Lima C, Migot-Nabias F, Bellis G, Dugoujon J-M, Pereira JB, Fernandes V, Pereira L, Van der Veen L, Mouguiama-Daouda P, Bustamante CD, Hombert J-M, Quintana-Murci L. 2017. Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science 356: 543–546.
Schlebusch CM, Jakobsson M. 2018. Tales of Human Migration, Admixture, and Selection in Africa. Annual Review of Genomics and Human Genetics 19: 405–428.
Schlebusch CM, Skoglund P, Sjödin P, Gattepaille LM, Hernandez D, Jay F, Li S, De Jongh M, Singleton A, Blum MGB, Soodyall H, Jakobsson M. 2012. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science (New York, NY) 338: 374–379.
Wendorf F, Close AE, Schild R, Wasylikowa K, Housley RA, Harlan JR, Królik H. 1992. Saharan exploitation of plants 8,000 years BP. Nature 359: 721–724.