-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define data structures that would accommodate general purpose genetic workflows #15
Comments
BGEN
|
PLINK
|
Scikit-Allele
|
Hail
|
I'm a bit confused as to how copy number variants and ploidy are related. I would imagine coding the copy number as a distinct allele. For example, if at a certain locus across the population there are alleles with 1, 2, or 3 copies, then support for multiple alleles seems sufficient to capture this information. If these genomes are phased, then you can put a copy number on each copy of the chromosome, but the ploidy should not change, unless my understanding of the molecular biology of CNVs is off? Also, on the support for multiple alleles: are there any explicit carve-outs for the MHC locus? If we want to use this software for infectious disease susceptibility analyses, for example, that locus becomes critical, and is obviously highly polymorphic so likely stresses whatever support is available for multiple alleles at a locus. |
Good point. "Effective ploidy" is a term I first say in the scikit-allele docs and does seem to be used elsewhere as being synonymous with copy number variation (1 2). I agree that thinking of a variant on a single chromosome as having some kind of variable ploidy is weird though. If I'm understanding the scikit-allele GenotypeAlleleCountsArray model correctly, I believe it is implying that, for example, a diploid organism could have two different alleles that are each copied some different number of times. In your example, are you saying that for 2 alleles and each of 1, 2 or 3 copies there would be 6 possible alleles (i.e. each allele + copy number combination)? My guess would be that PLINK and scikit-allele use separate data structures for CNVs entirely since the analysis they run on them requires the count numbers in some matrix form anyhow, so they simply never need to find a way to represent them in a multi-allelic encoding like you mentioned. Separately, this is probably worth a read at some point if it's not already on the paper queue: Whole-genome sequencing reveals high complexity of copy number variation at insecticide resistance loci in malaria mosquitoes |
I am imagining that the best starting point for a data structure is an Nd array with shape (variants, samples, allele, ploidy) and with values equal to either dosages, probabilities, or copy number (entirely equal to 1 in most cases) -- whatever is appropriate for the analysis. Let me know if my understanding of copy number representation in the last comment doesn't match with yours. I'm also planning to start on a prototype that defines this and some appropriate degenerate cases, e.g. haplotypes for diploid, bi-allelic hard calls, as Xarray subclasses. My plan is for the prototype to work like this, when it comes to data structures: # Load genotype data, possible from PLINK/bgen
arr = xr.DataArray(...)
# Build subclass object (this is where all preconditions will be applied)
gt = api.GeneticDataArray(arr)
# Do something within the API, which will not operate on raw Xarray classes
gt = api.ld_prune(gt)
# Do something outside the API
def my_custom_analysis(gt: xarray.Dataset) -> xarray.Dataset:
... # Do some stuff that only operates on xarray structures
return gt
gt = my_custom_analysis(gt)
# Convert back to API structures if necessary by invoking the appropriate constructor
gt = api.GeneticDataArray(gt) Any thoughts/concerns? |
I've got what I think is a pretty solid start on this up now in this notebook. The internals for it are in this folder. |
I refactored this nearly from scratch into data_stuctures.ipynb (internals). This version is much better, with some notable differences being:
|
I am working on the genome .my background is from computer science. plz, guide which dataset or type of dataset busing any tool like crisper or whatever..... plz guide..thanks in advance. |
Closing and adding any more updates related to xarray prototyping in #5 |
It would be useful to document the capabilities of underlying data structures for existing genetic data file formats and processing platforms. This issue should serve as a place to collect findings on what existing data structures look like and what dimensions they support, without an emphasis on how they are implemented. At a minimum, this should cover:
Some questions worth asking for these are:
The text was updated successfully, but these errors were encountered: