Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The percentage of selected cell is: 0.000% #42

Open
hyjforesight opened this issue Dec 13, 2022 · 4 comments
Open

The percentage of selected cell is: 0.000% #42

hyjforesight opened this issue Dec 13, 2022 · 4 comments

Comments

@hyjforesight
Copy link

Hello Scissor,
I'm running Scissor, but no matter what alpha I used, it always came out 0 selected cells.
Could you please help me with this issue?
Thanks!
Best,
Yuanjian

infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = seq(1,5,1)/10000000, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
[1] "|**************************************************|"
[1] "Performing quality-check for the correlations"
[1] "The five-number summary of correlations:"
         0%         25%         50%         75%        100% 
-0.04986036  0.14333164  0.23372394  0.35075314  0.79686469 
[1] "|**************************************************|"
[1] "Current phenotype contains 48 Wild type and 17 Mutated samples."
[1] "Perform logistic regression on the given phenotypes:"
[1] "alpha = 1e-07"
[1] "Scissor identified 0 Scissor+ cells and 0 Scissor- cells."
[1] "The percentage of selected cell is: 0.000%"
@sunduanchen
Copy link
Owner

sunduanchen commented Dec 13, 2022 via email

@hyjforesight
Copy link
Author

hyjforesight commented Dec 13, 2022

Hello Duanchen,
Thanks for the response.
However, your email address was not shown in the reply.
Now, I found 2 genes that cannot be computed Scissor selected cells in my case. They're TP53 and MSH6. I want a cutoff of 0.2 for both.
For MSH6, a=0.00001, Scissor selected 82.336% cells; a=0.000011, 81.302%; a=0.000012-0.00005, 0.
For TP53, a>0.0000001, 0.
Here I shared the link for downloading my data and the code I used. Thanks for the debug!

Data on Google drive: https://drive.google.com/file/d/1ZWuyQD7LVIfS1DBHDjdoP5-ZoAo16NBY/view?usp=sharing (2.1GB)
如果你在国内的话,我可以上传到百度网盘。

Sys.setenv(LANG = "en_US")
library(Seurat)
library(ggplot2)
library(sctransform)
library(dplyr)
library(patchwork)
library(SeuratDisk)
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(biomaRt)
library(Scissor)
memory.limit(512000)
options(future.globals.maxSize = 256000*1024^2)

# 1. Get the patient list with the interested gene mutated
# download TCGA CNV information
query_1 <- GDCquery(project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking", legacy = FALSE, access= 'open', data.format='TXT')
GDCdownload(query_1)
CNV <- GDCprepare(query=query_1, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.rda')
write.csv(CNV, file="C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# retrieve the patient list with the interested gene mutated
mutation_df <- CNV[CNV$Hugo_Symbol == 'MSH6',]    # subset CNV table into a table with only interested gene mutated
write.csv(mutation_df, file="C:/Users/Park_Lab/Documents/mutation_df.csv", sep = ",", row.names=TRUE, col.names = TRUE)

mutation_patient=as.data.frame(mutation_df$Tumor_Sample_Barcode)    # use column `Tumor_Sample_Barcode` to build a new data frame
colnames(mutation_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID
mutation_patient$Patient_ID <- substr(mutation_patient$Patient_ID, 0, 12)    # only keep the first 12 character as Patient_ID, these patients are with interested gene mutated, note that this patient list includes all cancer types
mutation_patient$Status <- 1    # set 1 as the mutation status
write.csv(mutation_patient, file="C:/Users/Park_Lab/Documents/mutation_patient_list.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 2. Get the patient list with interested pathological type
# download TCGA-COAD clinical index data
clinical_index <- GDCquery_clinic(project = "TCGA-COAD", type = "Clinical", save.csv = TRUE)    # will generate TCGA-COAD_clinical.csv automatically

# retrieve the patient list with the interested pathological type
disease_clinical <- clinical_index %>% filter(primary_diagnosis=="Mucinous adenocarcinoma")    # only keep the mucinous cancer patients clinical data
disease_patient=as.data.frame(disease_clinical$submitter_id)    # use column `submitter_id` to build a new data frame
colnames(disease_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID, note that this patient list includes all mucinous cancer with or without interested gene mutated 
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list1.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 3. Prepare phenotype list (For patient diagnosed with one pathological type, set mutated ones as 1, wild type as 0)
disease_patient$Status=mutation_patient$Status[match(disease_patient$Patient_ID, mutation_patient$Patient_ID)]    # copy mutation_patient status (1) to disease_patient if disease_patient also exist in mutation_patient data frame
disease_patient[is.na(disease_patient)] = 0    # set left patient status as 0, indicating wild type
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list2.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 4. Prepare human gene symbol annotated TCGA-COAD RNA-seq gene matrix 
# download TCGA-COAD gene matrix
query_2 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", legacy = FALSE, access= 'open', data.format='TXT', experimental.strategy='RNA-Seq', sample.type=c('Primary Tumor'))    # didn't download 'Solid Tissue Normal' data, because this script is for tumor CNV
GDCdownload(query_2)
RNA_seq <- GDCprepare(query=query_2, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_RNA_seq.rda')
GeneMatrix <- assay(RNA_seq)
write.csv(GeneMatrix, file="C:/Users/Park_Lab/Documents/TCGA_COAD_GeneMatrix.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# convert human ensembl_id of gene matrix to gene_symbol
human <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
ensembl_id <- rownames(GeneMatrix)
gene_symbol <- getBM(attributes = c('ensembl_gene_id_version', 'hgnc_symbol'), filters = 'ensembl_gene_id_version', values = ensembl_id, mart = human)
head(gene_symbol)

gene_symbol <- gene_symbol[!(gene_symbol$hgnc_symbol==""),]    # remove rows with empty human gene
gene_symbol <- gene_symbol[!duplicated(gene_symbol$ensembl_gene_id_version),]    # remove rows with duplicated human ensembl
gene_symbol <- gene_symbol[!duplicated(gene_symbol$hgnc_symbol),]    # remove rows with duplicated human gene

class(GeneMatrix)
GeneMatrix_df<- as.data.frame(GeneMatrix)
class(GeneMatrix_df)

GeneMatrix_df$genes <- gene_symbol$hgnc_symbol[match(rownames(GeneMatrix_df), gene_symbol$ensembl_gene_id_version)]    # copy gene symbols to gene expression matrix, matching human ensembl id
GeneMatrix_df <- na.omit(GeneMatrix_df)    # remove rows with 'NA' in 'genes' column
rownames(GeneMatrix_df) <- GeneMatrix_df$genes    # set 'genes' as rownames
GeneMatrix_HGNC = GeneMatrix_df[,!(colnames(GeneMatrix_df) %in% c('genes'))]    # remove 'genes' column
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC.csv", sep = ",", row.names=TRUE, col.names = TRUE)    # gene matrix with gene_symbol

# 5. Match gene matrix and phenotype list
# subset gene matrix into the patients that only exist in the phenotype list
colnames(GeneMatrix_HGNC) <- substr(colnames(GeneMatrix_HGNC), 0, 12)    # only keep the first 12 character of sample ID to for matching with Patient_ID later
GeneMatrix_HGNC <- GeneMatrix_HGNC[, !duplicated(colnames(GeneMatrix_HGNC))]    # 1 patient may have >1 samples, after simplify sample ID, there will be duplicates, but phenotype list has no duplicates, so we have to remove these sample duplicates
GeneMatrix_HGNC <- subset(GeneMatrix_HGNC, select = colnames(GeneMatrix_HGNC) %in% disease_patient$Patient_ID)    # only include patients that also exist in the phenotype list
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make gene matrix and phenotype list the same order
GeneMatrix_HGNC_sorted <- GeneMatrix_HGNC[ , order(match(colnames(GeneMatrix_HGNC), disease_patient$Patient_ID))]    # sort gene matrix the same order as patient ids of phenotype list
all(colnames(GeneMatrix_HGNC_sorted) == disease_patient$Patient_ID)    # check whether the orders of sample ids in gene matrix and phenotype list are the same
write.csv(GeneMatrix_HGNC_sorted, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease_sorted.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make the sample id annotated numeric phenotype list
class(disease_patient)
rownames(disease_patient) <- disease_patient$Patient_ID    # set Patient_ID as data frame index
disease_patient_input = disease_patient[,!(colnames(disease_patient) %in% c('Patient_ID'))]    # remove column 'Patient_ID'
disease_patient_input    # disease_patient becomes numeric, only the values of column 'Status' will show
names(disease_patient_input) = rownames(disease_patient)    # copy the rownames back
disease_patient_input
class(disease_patient_input)
table(disease_patient_input)    # show statistics of status 0 and 1
write.csv(disease_patient_input, file="C:/Users/Park_Lab/Documents/disease_patient_input.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 6. Prepare scRNA-seq data
Merge2.combined <- readRDS(file = "C:/Users/Park_Lab/Documents/Merge2.combined_all.rds")

# 7. Run Scissor
phenotype <- disease_patient_input
tag <- c('Wild type', 'Mutated')    # name the phenotype 0 (Wild type) first, then phenotype 1 (Mutated)
GeneMatrix_HGNC_input  <- as.matrix(GeneMatrix_HGNC_sorted)    # convert gene matrix (data.frame) to 'matrix'
# data is too big, the original Scissor() got errors. To fix this error, run the code of Scissor_change.R and as_matrix.R first, then run below steps
infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = 0.000012, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
# MSH6 0.00001, 82.336%; 0.000011, 81.302%; 0.000012-0.00005, 0;
# TP53 >0.0000001, 0

@Zhao-yufeiii
Copy link

Hello Duanchen, Thanks for the response. However, your email address was not shown in the reply. Now, I found 2 genes that cannot be computed Scissor selected cells in my case. They're TP53 and MSH6. I want a cutoff of 0.2 for both. For MSH6, a=0.00001, Scissor selected 82.336% cells; a=0.000011, 81.302%; a=0.000012-0.00005, 0. For TP53, a>0.0000001, 0. Here I shared the link for downloading my data and the code I used. Thanks for the debug!

Data on Google drive: https://drive.google.com/file/d/1ZWuyQD7LVIfS1DBHDjdoP5-ZoAo16NBY/view?usp=sharing (2.1GB) 如果你在国内的话,我可以上传到百度网盘。

Sys.setenv(LANG = "en_US")
library(Seurat)
library(ggplot2)
library(sctransform)
library(dplyr)
library(patchwork)
library(SeuratDisk)
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(biomaRt)
library(Scissor)
memory.limit(512000)
options(future.globals.maxSize = 256000*1024^2)

# 1. Get the patient list with the interested gene mutated
# download TCGA CNV information
query_1 <- GDCquery(project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking", legacy = FALSE, access= 'open', data.format='TXT')
GDCdownload(query_1)
CNV <- GDCprepare(query=query_1, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.rda')
write.csv(CNV, file="C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# retrieve the patient list with the interested gene mutated
mutation_df <- CNV[CNV$Hugo_Symbol == 'MSH6',]    # subset CNV table into a table with only interested gene mutated
write.csv(mutation_df, file="C:/Users/Park_Lab/Documents/mutation_df.csv", sep = ",", row.names=TRUE, col.names = TRUE)

mutation_patient=as.data.frame(mutation_df$Tumor_Sample_Barcode)    # use column `Tumor_Sample_Barcode` to build a new data frame
colnames(mutation_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID
mutation_patient$Patient_ID <- substr(mutation_patient$Patient_ID, 0, 12)    # only keep the first 12 character as Patient_ID, these patients are with interested gene mutated, note that this patient list includes all cancer types
mutation_patient$Status <- 1    # set 1 as the mutation status
write.csv(mutation_patient, file="C:/Users/Park_Lab/Documents/mutation_patient_list.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 2. Get the patient list with interested pathological type
# download TCGA-COAD clinical index data
clinical_index <- GDCquery_clinic(project = "TCGA-COAD", type = "Clinical", save.csv = TRUE)    # will generate TCGA-COAD_clinical.csv automatically

# retrieve the patient list with the interested pathological type
disease_clinical <- clinical_index %>% filter(primary_diagnosis=="Mucinous adenocarcinoma")    # only keep the mucinous cancer patients clinical data
disease_patient=as.data.frame(disease_clinical$submitter_id)    # use column `submitter_id` to build a new data frame
colnames(disease_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID, note that this patient list includes all mucinous cancer with or without interested gene mutated 
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list1.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 3. Prepare phenotype list (For patient diagnosed with one pathological type, set mutated ones as 1, wild type as 0)
disease_patient$Status=mutation_patient$Status[match(disease_patient$Patient_ID, mutation_patient$Patient_ID)]    # copy mutation_patient status (1) to disease_patient if disease_patient also exist in mutation_patient data frame
disease_patient[is.na(disease_patient)] = 0    # set left patient status as 0, indicating wild type
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list2.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 4. Prepare human gene symbol annotated TCGA-COAD RNA-seq gene matrix 
# download TCGA-COAD gene matrix
query_2 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", legacy = FALSE, access= 'open', data.format='TXT', experimental.strategy='RNA-Seq', sample.type=c('Primary Tumor'))    # didn't download 'Solid Tissue Normal' data, because this script is for tumor CNV
GDCdownload(query_2)
RNA_seq <- GDCprepare(query=query_2, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_RNA_seq.rda')
GeneMatrix <- assay(RNA_seq)
write.csv(GeneMatrix, file="C:/Users/Park_Lab/Documents/TCGA_COAD_GeneMatrix.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# convert human ensembl_id of gene matrix to gene_symbol
human <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
ensembl_id <- rownames(GeneMatrix)
gene_symbol <- getBM(attributes = c('ensembl_gene_id_version', 'hgnc_symbol'), filters = 'ensembl_gene_id_version', values = ensembl_id, mart = human)
head(gene_symbol)

gene_symbol <- gene_symbol[!(gene_symbol$hgnc_symbol==""),]    # remove rows with empty human gene
gene_symbol <- gene_symbol[!duplicated(gene_symbol$ensembl_gene_id_version),]    # remove rows with duplicated human ensembl
gene_symbol <- gene_symbol[!duplicated(gene_symbol$hgnc_symbol),]    # remove rows with duplicated human gene

class(GeneMatrix)
GeneMatrix_df<- as.data.frame(GeneMatrix)
class(GeneMatrix_df)

GeneMatrix_df$genes <- gene_symbol$hgnc_symbol[match(rownames(GeneMatrix_df), gene_symbol$ensembl_gene_id_version)]    # copy gene symbols to gene expression matrix, matching human ensembl id
GeneMatrix_df <- na.omit(GeneMatrix_df)    # remove rows with 'NA' in 'genes' column
rownames(GeneMatrix_df) <- GeneMatrix_df$genes    # set 'genes' as rownames
GeneMatrix_HGNC = GeneMatrix_df[,!(colnames(GeneMatrix_df) %in% c('genes'))]    # remove 'genes' column
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC.csv", sep = ",", row.names=TRUE, col.names = TRUE)    # gene matrix with gene_symbol

# 5. Match gene matrix and phenotype list
# subset gene matrix into the patients that only exist in the phenotype list
colnames(GeneMatrix_HGNC) <- substr(colnames(GeneMatrix_HGNC), 0, 12)    # only keep the first 12 character of sample ID to for matching with Patient_ID later
GeneMatrix_HGNC <- GeneMatrix_HGNC[, !duplicated(colnames(GeneMatrix_HGNC))]    # 1 patient may have >1 samples, after simplify sample ID, there will be duplicates, but phenotype list has no duplicates, so we have to remove these sample duplicates
GeneMatrix_HGNC <- subset(GeneMatrix_HGNC, select = colnames(GeneMatrix_HGNC) %in% disease_patient$Patient_ID)    # only include patients that also exist in the phenotype list
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make gene matrix and phenotype list the same order
GeneMatrix_HGNC_sorted <- GeneMatrix_HGNC[ , order(match(colnames(GeneMatrix_HGNC), disease_patient$Patient_ID))]    # sort gene matrix the same order as patient ids of phenotype list
all(colnames(GeneMatrix_HGNC_sorted) == disease_patient$Patient_ID)    # check whether the orders of sample ids in gene matrix and phenotype list are the same
write.csv(GeneMatrix_HGNC_sorted, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease_sorted.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make the sample id annotated numeric phenotype list
class(disease_patient)
rownames(disease_patient) <- disease_patient$Patient_ID    # set Patient_ID as data frame index
disease_patient_input = disease_patient[,!(colnames(disease_patient) %in% c('Patient_ID'))]    # remove column 'Patient_ID'
disease_patient_input    # disease_patient becomes numeric, only the values of column 'Status' will show
names(disease_patient_input) = rownames(disease_patient)    # copy the rownames back
disease_patient_input
class(disease_patient_input)
table(disease_patient_input)    # show statistics of status 0 and 1
write.csv(disease_patient_input, file="C:/Users/Park_Lab/Documents/disease_patient_input.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 6. Prepare scRNA-seq data
Merge2.combined <- readRDS(file = "C:/Users/Park_Lab/Documents/Merge2.combined_all.rds")

# 7. Run Scissor
phenotype <- disease_patient_input
tag <- c('Wild type', 'Mutated')    # name the phenotype 0 (Wild type) first, then phenotype 1 (Mutated)
GeneMatrix_HGNC_input  <- as.matrix(GeneMatrix_HGNC_sorted)    # convert gene matrix (data.frame) to 'matrix'
# data is too big, the original Scissor() got errors. To fix this error, run the code of Scissor_change.R and as_matrix.R first, then run below steps
infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = 0.000012, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
# MSH6 0.00001, 82.336%; 0.000011, 81.302%; 0.000012-0.00005, 0;
# TP53 >0.0000001, 0

Hi, I met the same problem that Scissor turned out to select 0.000%. Could I ask you how you work this out?

@YZJenny
Copy link

YZJenny commented Jul 26, 2024

Hi, I met the same problem that Scissor turned out to select 0.000%. Could I ask you how you work this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants