-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current dataset seems to have an issue #99
Comments
The following returns import pandas
url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/samples.tsv'
sample_df = pandas.read_table(url)
url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2'
mutation_df = pandas.read_table(url)
all(mutation_df.sample_id == sample_df.sample_id) Let me check the figshare: https://doi.org/10.6084/m9.figshare.3487685.v7. |
Checksums I compute locally for cognoma/cancer-data@383668e: $ md5sum data/*.tsv.bz2 data/samples.tsv
32d561ba90aa6efd2cf3fc29125118fc data/expression-matrix.tsv.bz2
acd20fbc57ef4f61a515a89c09fb86a3 data/mutation-matrix.tsv.bz2
ef0b6dc40658f66934937e29c5749cca data/samples.tsv According to the figshare API, these checksums match the v7 figshare deposits. See https://api.figshare.com/v2/articles/3487685/versions/7 So I think the |
No matches from the following:
So I guess the question is why is core-service requesting an ID that doesn't exist? Perhaps |
hmm, this is interesting. I am glad that core-services caught this, even if it wasn't supposed to. I did some digging into why. It looks like tumors with Clinical Matrix all Primary Samples (that weren't redacted)path = os.path.join('download', 'Survival_SupplementalTable_S1_20171025_xena_sp.tsv.gz')
# Mapping to rename and filter columns
renamer = collections.OrderedDict([
('sample', 'sample_id'),
('_PATIENT', 'patient_id'),
('cancer type abbreviation', 'acronym'),
('__placeholder__', 'disease'),
('age_at_initial_pathologic_diagnosis', 'age_diagnosed'),
('gender', 'gender'),
('race', 'race'),
('ajcc_pathologic_tumor_stage', 'ajcc_stage'),
('clinical_stage', 'clinical_stage'),
('histological_type', 'histological_type'),
('histological_grade', 'histological_grade'),
('initial_pathologic_dx_year', 'initial_pathologic_dx_year'),
('menopause_status', 'menopause_status'),
('birth_days_to', 'birth_days_to'),
('vital_status', 'vital_status'),
('tumor_status', 'tumor_status'),
('last_contact_days_to', 'last_contact_days_to'),
('death_days_to', 'death_days_to'),
('cause_of_death', 'cause_of_death'),
('new_tumor_event_type', 'new_tumor_event_type'),
('new_tumor_event_site', 'new_tumor_event_site'),
('new_tumor_event_site_other', 'new_tumor_event_site_other'),
('new_tumor_event_dx_days_to', 'days_recurrence_free'),
('treatment_outcome_first_course', 'treatment_outcome_first_course'),
('margin_status', 'margin_status'),
('residual_tumor', 'residual_tumor'),
('_EVENT', 'event_status'),
('_TIME_TO_EVENT', 'event_days'),
('OS', 'dead'),
('OS.time', 'days_survived'),
('DSS', 'disease_specific_survival_status'),
('DSS.time', 'disease_specific_survival_days'),
('DFI', 'disease_free_interval_status'),
('DFI.time', 'disease_free_interval_days'),
('PFI', 'progression_free_interval_status'),
('PFI.time', 'progression_free_interval_days'),
('Redaction', 'Redaction')
])
clinmat_df = (
pandas.read_table(path)
.rename(columns=renamer)
.merge(disease_df, how='left')
[list(renamer.values())]
.set_index('sample_id', drop=False)
.sort_values('sample_id')
)
# Fix capitalization of gender and race
clinmat_df.gender = clinmat_df.gender.str.title()
clinmat_df.race = clinmat_df.race.str.title()
# Extract sample-type with the code dictionary
clinmat_df = clinmat_df.assign(sample_type = clinmat_df.sample_id.str[-2:])
clinmat_df.sample_type = clinmat_df.sample_type.replace(sampletype_codes_dict)
# Remove "Redacted" tumors
# These patients either withdrew consent or had genome data mismatch errors
clinmat_df = clinmat_df.query('Redaction != "Redacted"').drop("Redaction", axis=1)
# Keep only these sample types
# filters duplicate samples per patient
sample_types = {
'Primary Solid Tumor',
'Primary Blood Derived Cancer - Peripheral Blood',
}
clinmat_df.query("sample_type in @sample_types", inplace=True) Results in: # Get acronym value counts
clinmat_df.acronym.value_counts()
BRCA 1092
GBM 588
UCEC 547
OV 537
KIRC 536
HNSC 528
LUAD 519
LGG 515
THCA 507
LUSC 504
PRAD 498
COAD 457
STAD 443
BLCA 409
LIHC 376
CESC 307
KIRP 291
SARC 261
LAML 200
ESCA 185
PAAD 185
PCPG 179
READ 166
TGCT 134
THYM 124
SKCM 108
ACC 92
MESO 87
UVM 80
KICH 66
UCS 57
DLBC 48
CHOL 36
Name: acronym, dtype: int64 Filtered Clinical Matrix (samples with mutation, RNAseq, and Clinical attributes)sample_ids = sorted(clinmat_df.index & gene_mutation_mat_df.index & expr_df.index)
# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :] Results in: sample_df.acronym.value_counts()
BRCA 787
LGG 510
LUAD 508
HNSC 499
PRAD 492
THCA 489
LUSC 477
UCEC 436
STAD 411
BLCA 403
KIRC 366
LIHC 357
COAD 287
CESC 286
KIRP 280
SARC 234
ESCA 183
PCPG 178
PAAD 168
GBM 149
TGCT 128
THYM 118
SKCM 103
READ 89
MESO 81
UVM 80
ACC 79
KICH 66
UCS 57
DLBC 37
CHOL 36
OV 14
Name: acronym, dtype: int64
Why does OV drop so many samples?My thought was that many of these OV tumors were filtered because they didn't have observed mutations. Indeed, here are all of the samples that were filtered from the mutation set. sample_ids = set(clinmat_df.index) - set(gene_mutation_mat_df.index)
# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]
sample_df.acronym.value_counts()
OV 475
BRCA 303
GBM 279
LAML 200
COAD 169
KIRC 168
UCEC 100
READ 77
SARC 25
LUSC 24
HNSC 21
CESC 18
THCA 16
LIHC 14
DLBC 11
KIRP 10
PAAD 10
LUAD 7
MESO 6
TGCT 6
LGG 5
STAD 5
PRAD 5
SKCM 4
THYM 2
BLCA 2
PCPG 1
ESCA 1
Name: acronym, dtype: int64 Besides decimating the ovarian cancer sample set, this step also removes many additional tumors. My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples, but there are a bit more things to consider here. I intend on following up with this search in cognoma/cancer-data |
I was wondering why our HGSC numbers were so low. Can you use the PanCancerAtlas criteria? |
Ah well this explains it. I just reran the scripts without changing the locations of the data because I thought it had been updated in place. @dhimmel can you confirm Please remember that I have no context for why that script is the way it is nor why some of them are coming from raw.githubusercontent.com and some are coming from figshare. Likewise I have no idea what the relation between the figshare URLs you have posted and the raw.githubusercontent.com links you used in your python code above. |
I support always using versioning such that the raw data URLs we use never point to changed versions of the data.
In the past, we used fighsare to store large files, which did not fit in GitHub. However, we recently updated cancer-data to use Git LFS. As such, we can now download all files from GitHub. There is no need for the figshare. Therefore, I'd suggest switching everything to versioned GitHub URLs. All the files will be in this directory. So in other words, we should either be using ALL figshare or ALL github URLs at this point. Given that uploading to figshare requires an extra step, we may want to switch to all github URLs, which will be easiest going forward. You can add a single |
Cool, sounds good to me. I can point to that directory. |
One issue with trying to use that directory: there's no |
|
Is there a reason that's not consistent? I can make this Just Work with that being it's location, but it involves a tad bit extra code to correct for something that seems like it may not have been intentional. |
download is for inputs that we incorporate from external locations. data is for datasets we produce. Honestly, |
Ah okay, I don't care much one way or another, just didn't want to write code to workaround a bug. However while testing my code for the |
Currently, core-service/api/management/commands/acquiredata.py Lines 31 to 34 in b9b2e4f
If you continue to retrieve it from However, we'll need to look into how the genes table gets used. It may be possible that |
New commit for cancer-data datasets: cognoma/cancer-data@93e4c53 |
Note to self, use this for diseases: https://github.com/cognoma/cancer-data/blob/master/mapping/diseases.tsv |
While testing #98 I ran into the following issue while it was trying to load the Mutations table:
Checking https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv for the string
TCGA-04
turned up nothing, so either the samples table is missing data or the mutations table has rows it shouldn't.The text was updated successfully, but these errors were encountered: