-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process the clinical matrix to extract sample attributes #10
Comments
We would like to extract sample information for two purposes:
|
To begin building a The more that I think about it, the more I am liking the idea of scraping the I think having a service that describes the mutations across tissues/genders/age/etc. would be great but we have to be careful as to not reinvent the wheel here since many other services already do this. See COSMIC, NCI GDC, Broad Firehose, or CBioPortal |
@gwaygenomics, provenance of |
@dhimmel my keyboard! |
A few questions. re: tissue dictionary re: mutations re: covariates |
@ypar thanks for these questions!
The TCGA acronym is how they identify "tissue source site" but you're right, they're not strictly "tissues" and "diseases" would be more appropriate. E.g. LUAD is "lung adenocarcinoma" and LUSC is "lung squamous cell carcinoma". TCGA has adopted this broad terminology however and to keep consistent, so will we. You're point about tumor vs. normal is definitely something we should consider in the final model. We'll need to filter out "normal" which is really "adjacent normal" - normal tissue from the same individual taken from close proximity within the actual tumor debulking surgery. We will also probably want to filter out "metastasis" and patients measured twice. Much of this sample curation is performed before the data is made public - but a lot is left in intentionally, or sneaks past the filters. We can use a combination of the representative columns and official TCGA Barcodes to create an official sample list. For unsupervised feature construction however, I think it is important to leave all the samples in!
Right now the mutation selector is as follows: user select a gene or genes, cognoma builds a Y matrix of 1's and 0's corresponding to samples in the expression matrix (X) indicating presence or absence of mutation. I think this is the minimum case example and should be focused on getting implemented before we try to get fancy. How we define impactful mutations is another story. We will use the official mutation calls from the mutation matrix to determine if the sample has a mutation in a given input gene. Currently, the plan is to filter only
I am not sure how to handle covariates at the moment... I think some sort of adjustment should be discussed but I don't know of the optimal solution. Right now I'm think it would be best to include performance of the model across different covariates in the results viewer. |
@gwaygenomics in the case of multiple samples per individual are you sure we want to leave those in? I think some unsupervised approaches will assume independent observations.
That's not our current implementation. We ignore all code orange and code green mutations, based on a classification system developed by the Xena Browser team. See #2 (comment) for more information and a table of mutation counts by classification. |
Good point. Yeah, we should remove those |
IMO, it is particularly important for unsupervised methods to have a cleanest possible data although one could argue that it is equally important for supervised methods. |
Discussion on this issue has become off topic. So if we want to keep discussing issues that are not related to processing the |
Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in cognoma#10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes cognoma#10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in cognoma#14. Closes cognoma#17: only sample_ids with expression, mutation, and clinical data are output to `data/`.
* Extract sample info from PANCAN_clinicalMatrix Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306. Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to `data/samples.tsv`. Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14. Closes #17: only sample_ids with expression, mutation, and clinical data are output to `data/`. * Retain primary blood cancers Retain cancers whose type is "Primary Blood Derived Cancer - Peripheral Blood". See #20 (comment)
An issue has been raised in today's meeting.
The clinical matrix should be carefully analyzed to select a specific covariate or a set of covariates we can use for analyses.
The relevant notebook is here
tcga notebook for data download
and the dataset is named
PANCAN-clinicalMatrix
The text was updated successfully, but these errors were encountered: