This workflow demonstrates how to use tools I created for the identification and analysis of RM target motifs from Oxford Nanopore sequencing reads. The workflow starts with the BED file generated by ONT's ModKit program, and is broken down into 3 modules:
1. Motif Discovery
Identifies target motifs of the RM system(s) harbored in your genome and calculates the percentage of sites within the motif genome wide that are modified.
2. Methylome Profiler
Maps all modified bases to your genome, counts modified bases within all coding and intergenic regions, counts across all genomic elements, sliding window methylation rates
3. Motif Analysis
Statistical testing on whether identified RM target motifs are enriched or avoided in different areas of your genome
To get started, clone the github repo to your local machine:
# clone github repo
! git clone https://github.com/bely12/borrelia-methylomics.git
Inputs: BED file containing modified base call data and the reference genome that was used to map your reads
Output: fasta of all 6mA centered kmers
Use -len argument to specify the size of the kmer, as well as -depth for miniumum sequencing depth to consider (default >10) and -mod_thresh to set a frequency threshold for calling a base modified (default >50). The -modified option is used in this example. Use --help for more output options.
# extract kmers surrounding all modified adenine positions as a multifasta file
! python3 Motif_Discovery/Akmaker.py \
-bed test_files/test.bedmethyl \
-ref test_files/test.fasta \
-modified \
-len 11 \
-out test
Find most commonly modfied short sequences containing your modified adenine
Script: motif_discovery.py
Inputs: BED file, reference genome, set of all 6mA centered kmers (Akmaker.py output)
Output: List of most abundant short sequences, total number of occurences, and modifification frequency genome wide
Specify the 0-indexed position of the 6mA with -mod_pos_index. The -target_lengths argument specifies the motif lengths to consider (separate different numbers with a comma). Use --help to customize other parameters such as miniumum sequencing depths and mod call thresholds.
! python3 Motif_Discovery/motif_discovery.py \
-bed test_files/test.bedmethyl \
-ref test_files/test.fasta \
-candidates test_positive_seqs.fasta \
-mod_pos_index 5 \
-target_lengths 5,6,7 \
-n_targets 5
motif occurences mod percent ATAAA 7 6 0.857 GATAA 15 12 0.8 CGATA 30 30 1.0 CGGTA 15 15 1.0 CGAGA 17 15 0.882 GATAAA 7 6 0.857 CGATAA 13 11 0.846 CGAGAA 10 10 1.0 TCGATA 11 11 1.0 CGATAT 9 9 1.0 GATAAAT 3 2 0.667
Fine tune short sequences to a definitive R-M target motif
Script: mod_calc.py
Inputs: motif or list of motifs and the position of the modified adenine
Output: Total number of occurences, total number modified, and modifification frequency genome wide for each specified motif
Based on results from motif_discovery.py, look up modification frequencies inferred motif(s). Motifs can be entered with -motif with ambiguous nucleotide symbols. Use 1-based index for specifiying the position of the modified adenine in your motif with -mod_pos. Use --help to see options for custom parameters/thresholds. Run multiple times, substituting N's at different positions to hammer down exact specificity of the R-M target and determine a consensus motif.
! python3 Motif_Discovery/mod_calc.py \
-bed test_files/test.bedmethyl \
-ref test_files/test.fasta \
-motif CGRKA \
-mod_pos 5 \
motif occurences mod percent CGAGA 17 15 0.882 CGATA 30 30 1.0 CGGGA 12 12 1.0 CGGTA 15 15 1.0 found in genome = 74 modified in genome = 72 Total percent modfied for all motifs = 97.3
Script: mod_mapper.py
Inputs: BED file, reference genome, modified motif, or file containing a list of multiple modified motifs and the position of the modified adenine
Output: table containing coordinates of all modified target and off-target sites, as well as target sites that are not modified
Use --help to customize other parameters such as miniumum sequencing depth to include and mod call thresholds.
! python3 Methylome_Profiler//mod_mapper.py \
-bed test_files/test.bedmethyl \
-ref test_files/test.fasta \
-motif CGRKA \
-mod_pos 5 \
> test.sites
View sites table
import pandas as pd
sites_file = 'test.sites'
sites=pd.read_table(sites_file, header=None)
sites = sites.rename(columns={0: 'motif', 1: 'plasmid_id', 2: 'position', 3: 'strand', 4: 'status'})
sites.head()
| motif | plasmid_id | position | strand | status | |
|---|---|---|---|---|---|
| 0 | off_target | NC_001849.2 | 564 | - | modified |
| 1 | off_target | NC_001849.2 | 565 | - | modified |
| 2 | CGAGA | NC_001849.2 | 566 | - | modified |
| 3 | CGAGA | NC_001849.2 | 1630 | - | unmodified |
| 4 | CGATA | NC_001849.2 | 2848 | + | modified |
Link modified sites to features in the genome
Script: annotate_methylome.py
Inputs: .gbff file and modified sites table
Output:
3 tsv files named as below following user assigned prefix
mod_annotations.tsv: table of modified base counts and normalized rates within all cds and intergenic regions of genome
annotated_sites.tsv: table of modified sites with column for the gene ID corresponding to the coordinate (or interegenic if not within cds)
summary.tsv: table of modified base counts and normalized rates for each chromosome/plasmid in the genome
! python3 Methylome_Profiler/annotate_methylome.py \
-gbff test_files/B31.gbff \
-sites test.sites \
-out test
View feature counts table
#read file into pandas df
feature_counts_file = 'test_mod_annotations.tsv'
feature_counts=pd.read_table(feature_counts_file)
#filter for element_id's within test files
feature_counts = feature_counts[ ((feature_counts['element_id'] == 'NC_001849.2') | (feature_counts['element_id'] == 'NC_001852.1'))]
feature_counts.head()
| element_id | len | gene_id | protein_id | start | end | feature_len | strand | target_mod | mod_rate | target_no_mod | unmodified_rate | off_target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1457 | NC_001849.2 | 16821 | intergenic | na | 1 | 328 | 328 | na | 0 | 0.00 | 0 | 0.0 | 0 |
| 1458 | NC_001849.2 | 16821 | BB_D01 | WP_010890249.1 | 329 | 803 | 475 | + | 1 | 2.11 | 0 | 0.0 | 2 |
| 1459 | NC_001849.2 | 16821 | intergenic | na | 804 | 869 | 66 | na | 0 | 0.00 | 0 | 0.0 | 0 |
| 1460 | NC_001849.2 | 16821 | BB_RS05615 | WP_012614969.1 | 870 | 1020 | 151 | + | 0 | 0.00 | 0 | 0.0 | 0 |
| 1461 | NC_001849.2 | 16821 | intergenic | na | 1021 | 1408 | 388 | na | 0 | 0.00 | 0 | 0.0 | 0 |
View site annotation table
annot_file = 'test_annotated_sites.tsv'
annot_sites=pd.read_table(annot_file)
annot_sites.head()
| motif | plasmid_id | position | strand | status | annot | |
|---|---|---|---|---|---|---|
| 0 | off_target | NC_001849.2 | 564 | - | modified | BB_D01 |
| 1 | off_target | NC_001849.2 | 565 | - | modified | BB_D01 |
| 2 | CGAGA | NC_001849.2 | 566 | - | modified | BB_D01 |
| 3 | CGAGA | NC_001849.2 | 1630 | - | unmodified | BB_D04 |
| 4 | CGATA | NC_001849.2 | 2848 | + | modified | intergenic |
View methylation summary table
summary_file = 'test_mod_summary.tsv'
summary_table=pd.read_table(summary_file)
summary_table = summary_table[ ((summary_table['id'] == 'NC_001849.2') | (summary_table['id'] == 'NC_001852.1'))]
summary_table.head()
| id | len | total_mods | total_rate | cds_mods | cds_rate | intergenic_mods | intergenic_rate | off_target_mods | off_target_rate | unmodified_targets | unmodified_target_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | NC_001849.2 | 16821 | 18 | 1.070 | 7 | 0.879 | 11 | 1.240 | 10 | 0.594 | 1 | 0.059 |
| 8 | NC_001852.1 | 29766 | 54 | 1.814 | 53 | 1.932 | 1 | 0.417 | 27 | 0.907 | 1 | 0.034 |