mod_tools_tutorial

This workflow demonstrates how to use tools I created for the identification and analysis of RM target motifs from Oxford Nanopore sequencing reads. The workflow starts with the BED file generated by ONT's ModKit program, and is broken down into 3 modules:

1. Motif Discovery
Identifies target motifs of the RM system(s) harbored in your genome and calculates the percentage of sites within the motif genome wide that are modified.

2. Methylome Profiler
Maps all modified bases to your genome, counts modified bases within all coding and intergenic regions, counts across all genomic elements, sliding window methylation rates

3. Motif Analysis
Statistical testing on whether identified RM target motifs are enriched or avoided in different areas of your genome

To get started, clone the github repo to your local machine:

MODULE 1: MOTIF DISCOVERY

Get sequences surrounding all modified adenines of length k

Script: Akmaker.py

Inputs: BED file containing modified base call data and the reference genome that was used to map your reads

Output: fasta of all 6mA centered kmers

Use -len argument to specify the size of the kmer, as well as -depth for miniumum sequencing depth to consider (default >10) and -mod_thresh to set a frequency threshold for calling a base modified (default >50). The -modified option is used in this example. Use --help for more output options.

In [ ]:

# extract kmers surrounding all modified adenine positions as a multifasta file
! python3 Motif_Discovery/Akmaker.py \
    -bed test_files/test.bedmethyl \
    -ref test_files/test.fasta \
    -modified \
    -len 11 \
    -out test

Find most commonly modfied short sequences containing your modified adenine

Script: motif_discovery.py

Inputs: BED file, reference genome, set of all 6mA centered kmers (Akmaker.py output)

Output: List of most abundant short sequences, total number of occurences, and modifification frequency genome wide

Specify the 0-indexed position of the 6mA with -mod_pos_index. The -target_lengths argument specifies the motif lengths to consider (separate different numbers with a comma). Use --help to customize other parameters such as miniumum sequencing depths and mod call thresholds.

In [ ]:

! python3 Motif_Discovery/motif_discovery.py \
    -bed test_files/test.bedmethyl \
    -ref test_files/test.fasta \
    -candidates test_positive_seqs.fasta \
    -mod_pos_index 5 \
    -target_lengths 5,6,7 \
    -n_targets 5 

motif 	 occurences 	 mod 	 percent
ATAAA 	 7 	 6 	 0.857
GATAA 	 15 	 12 	 0.8
CGATA 	 30 	 30 	 1.0
CGGTA 	 15 	 15 	 1.0
CGAGA 	 17 	 15 	 0.882
GATAAA 	 7 	 6 	 0.857
CGATAA 	 13 	 11 	 0.846
CGAGAA 	 10 	 10 	 1.0
TCGATA 	 11 	 11 	 1.0
CGATAT 	 9 	 9 	 1.0
GATAAAT 	 3 	 2 	 0.667

Fine tune short sequences to a definitive R-M target motif

Script: mod_calc.py

Inputs: motif or list of motifs and the position of the modified adenine

Output: Total number of occurences, total number modified, and modifification frequency genome wide for each specified motif

Based on results from motif_discovery.py, look up modification frequencies inferred motif(s). Motifs can be entered with -motif with ambiguous nucleotide symbols. Use 1-based index for specifiying the position of the modified adenine in your motif with -mod_pos. Use --help to see options for custom parameters/thresholds. Run multiple times, substituting N's at different positions to hammer down exact specificity of the R-M target and determine a consensus motif.

In [ ]:

! python3 Motif_Discovery/mod_calc.py \
    -bed test_files/test.bedmethyl \
    -ref test_files/test.fasta \
    -motif CGRKA \
    -mod_pos 5 \ 

motif 	 occurences 	 mod 	 percent
CGAGA 	 17 	 15 	 0.882
CGATA 	 30 	 30 	 1.0
CGGGA 	 12 	 12 	 1.0
CGGTA 	 15 	 15 	 1.0


found in genome =  74
modified in genome =  72
Total percent modfied for all motifs =  97.3

MODULE 2: METHYLOME PROFILER

Generate a table of all modified base positions genome wide

Script: mod_mapper.py

Inputs: BED file, reference genome, modified motif, or file containing a list of multiple modified motifs and the position of the modified adenine

Output: table containing coordinates of all modified target and off-target sites, as well as target sites that are not modified

Use --help to customize other parameters such as miniumum sequencing depth to include and mod call thresholds.

In [ ]:

! python3 Methylome_Profiler//mod_mapper.py \
    -bed test_files/test.bedmethyl \
    -ref test_files/test.fasta \
    -motif CGRKA \
    -mod_pos 5 \
    > test.sites

In [ ]:

import pandas as pd
sites_file = 'test.sites'
sites=pd.read_table(sites_file, header=None)
sites = sites.rename(columns={0: 'motif', 1: 'plasmid_id', 2: 'position', 3: 'strand', 4: 'status'})
sites.head()

Out[ ]:

	motif	plasmid_id	position	strand	status
0	off_target	NC_001849.2	564	-	modified
1	off_target	NC_001849.2	565	-	modified
2	CGAGA	NC_001849.2	566	-	modified
3	CGAGA	NC_001849.2	1630	-	unmodified
4	CGATA	NC_001849.2	2848	+	modified

Link modified sites to features in the genome

Script: annotate_methylome.py

Inputs: .gbff file and modified sites table

Output:
3 tsv files named as below following user assigned prefix mod_annotations.tsv: table of modified base counts and normalized rates within all cds and intergenic regions of genome
annotated_sites.tsv: table of modified sites with column for the gene ID corresponding to the coordinate (or interegenic if not within cds)
summary.tsv: table of modified base counts and normalized rates for each chromosome/plasmid in the genome

View feature counts table

In [ ]:

#read file into pandas df
feature_counts_file = 'test_mod_annotations.tsv'
feature_counts=pd.read_table(feature_counts_file)
#filter for element_id's within test files 
feature_counts = feature_counts[ ((feature_counts['element_id'] == 'NC_001849.2') | (feature_counts['element_id'] == 'NC_001852.1'))]

feature_counts.head()

Out[ ]:

	element_id	len	gene_id	protein_id	start	end	feature_len	strand	target_mod	mod_rate	off_target
1457	NC_001849.2	16821	intergenic	na	1	328	328	na	0	0.00	0
1458	NC_001849.2	16821	BB_D01	WP_010890249.1	329	803	475	+	1	2.11	2
1459	NC_001849.2	16821	intergenic	na	804	869	66	na	0	0.00	0
1460	NC_001849.2	16821	BB_RS05615	WP_012614969.1	870	1020	151	+	0	0.00	0
1461	NC_001849.2	16821	intergenic	na	1021	1408	388	na	0	0.00	0

View site annotation table

In [ ]:

annot_file = 'test_annotated_sites.tsv'
annot_sites=pd.read_table(annot_file)
annot_sites.head()

Out[ ]:

	motif	plasmid_id	position	strand	status	annot
0	off_target	NC_001849.2	564	-	modified	BB_D01
1	off_target	NC_001849.2	565	-	modified	BB_D01
2	CGAGA	NC_001849.2	566	-	modified	BB_D01
3	CGAGA	NC_001849.2	1630	-	unmodified	BB_D04
4	CGATA	NC_001849.2	2848	+	modified	intergenic

View methylation summary table

In [ ]:

summary_file = 'test_mod_summary.tsv'
summary_table=pd.read_table(summary_file)
summary_table = summary_table[ ((summary_table['id'] == 'NC_001849.2') | (summary_table['id'] == 'NC_001852.1'))]
summary_table.head()

Out[ ]:

	id	len	total_mods	total_rate	cds_mods	cds_rate	intergenic_mods	intergenic_rate	off_target_mods	off_target_rate	unmodified_targets	unmodified_target_rate
3	NC_001849.2	16821	18	1.070	7	0.879	11	1.240	10	0.594	1	0.059
8	NC_001852.1	29766	54	1.814	53	1.932	1	0.417	27	0.907	1	0.034

6mA motif identification and methylome characterization - tutorial

MODULE 1: MOTIF DISCOVERY

Get sequences surrounding all modified adenines of length k

Find most commonly modfied short sequences containing your modified adenine

Fine tune short sequences to a definitive R-M target motif

MODULE 2: METHYLOME PROFILER

Generate a table of all modified base positions genome wide

Link modified sites to features in the genome