https://github.com/karolinasenkow/python-wrapper.git
cd python-wrapper
python3 wrapper.py -d 1
rm SRR*
python3 wrapper.py -d 0
Human herpesvirus 5 is also known as Human cytomegalovirus and is typically abbreviated as HCMV. From Wikipedia: Although they may be found throughout the body, HCMV infections are frequently associated with the salivary glands. HCMV infection is typically unnoticed in healthy people, but can be life-threatening for the immunocompromised, such as HIV-infected persons, organ transplant recipients, or newborn infants. Congenital cytomegalovirus infection can lead to significant morbidity and even death. After infection, HCMV remains latent within the body throughout life and can be reactivated at any time. Eventually, it may cause mucoepidermoid carcinoma and possibly other malignancies such as prostate cancer. Cheng et al. 2017 (https://www.ncbi.nlm.nih.gov/pubmed/29158406) produced the transcriptomes of HCMV post infection. Write a Python script to automate the following and produce the output file requested named miniProject.log and other output files in a folder named miniProject_FirstName_LastName [where youve indicated your first and last name]. ALL results generated by you or programs called should be written to this folder. The easiest way to guarantee this is to create the folder using an os.system() call and then move into it via an os.chdir() call.
-
We would like to compare HCMV transcriptomes 2- and 6-days post-infection (dpi). First, retrieve the following transcriptomes from two patient donors from SRA and convert to paired-end fastq files. You can use wget (by constructing the path based on the SRR numbers for each of these samples). Donor 1 (2dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896360 Donor 1 (6dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896363 Donor 3 (2dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896374 Donor 3 (6dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896375
-
We will quantify TPM in each sample using kallisto, but first we need to build a transcriptome index for HCMV (NCBI accession EF999921). Use Biopython to retrieve and generate the appropriate input and then build the index with kallisto. You will need to extract the CDS features from the GenBank format.
-
Quantify the TPM of each CDS in each transcriptome using kallisto and use these results as input to find differentially expressed genes between the two timepoints (2pi and 6dpi) using the R package sleuth. Write the following details for each significant transcript (FDR < 0.05) to your log file, include a header row, and tab-delimit each item: target_id test_stat pval qval
-
It has been proposed that HCMV disease and pathogenesis may be related to the genetic diversity of the virus (Renzette et al. https://www.ncbi.nlm.nih.gov/pubmed/25154343/). Which publicly available strains are most similar to these patient samples? To compare to other strains, we will assemble these transcriptome reads. We dont expect assembly to produce the entire genome, but enough to be useful in BLAST. Virus sequencing experiments often include host DNAs. It is difficult to isolate the RNA of just the virus (as it only transcribes during infection of the host cell). Before assembly, lets make sure our reads map to HCMV. Using Bowtie2, create an index for HCMV (NCBI accession EF999921). Next, save the reads that map to the HCMV index for use in assembly. Write to your log file the number of reads in each transcriptome before and after the Bowtie2 mapping. For instance, if I was looking at the Donor 1 (2dpi) sample, I would write to the log (numbers here are arbitrary): Donor 1 (2dpi) had 230000 read pairs before Bowtie2 filtering and 100000 read pairs after.
-
Using the Bowtie2 output reads, assemble all four transcriptomes together to produce 1 assembly via SPAdes. Write the SPAdes command used to the log file.
-
Write Python code to calculate the number of contigs with a length > 1000 and write the # to the log file as follows (replace # with the correct integer): There are # contigs > 1000 bp in the assembly.
-
Write Python code to calculate the length of the assembly (the total number of bp in all of the contigs > 1000 bp in length) and write this # to the log file as follows (replace # with the correct integer): There are # bp in the assembly.
-
Write Python code to retrieve the longest contig from your SPAdes assembly. Use the longest contig as blast+ input to query the nr nucleotide database limited to members of the Betaherpesvirinae subfamily. You will need to make a local database of just sequences from the Betaherpesvirinae subfamily. Identify the top 10 hits. For the top 10 hits, write the following to your log file: Subject accession, Percent identity, Alignment length, Start of alignment in query, End of alignment in query, Start of alignment in subject, End of alignment in subject, Bit score, E-value, and Subject Title.
Karolina Senkow