kmerExtractor user manual

Below we present examples of how to use the main functions of the software.

Import libraries

[1]:

import pandas as pd
import matplotlib.pyplot as plt

# software to extract the k-mers
import kmerExtractor as kex

Compute k-mers

Specify the corresponding directory of the FASTA files from which you want to extract the k-mers, and the directory where you want to save the output files. Also define the window size (in base pairs) for which you want the frequency of k-mers to be counted.

In this case we will use the genome of arabidospsis, tomato and maize, with a window size of \(100000\) bp for all genomes.

[2]:

filepath_ara = '../input_data/arabidopsis_genome.fasta'
filepath_tom = '../input_data/tomato_genome.fasta'
filepath_mz = '../input_data/maize_genome.fasta'
output_path = '../output_data'
window_size = 100_000

Now use the kmers_by_window_opt function to calculate the k-mers of each genome for the \(k\) of your choice.

In the output file, the first column indicates the chromosome, the second column the window index, the third and fourth columns indicate the start and end position of the window respectively, and the rest of the \(4^k\) columns correspond to the k-mer frequencies. This data is saved in a csv file and stored in a dataframe for downstream analysis within the tutorial.

The resulting dataframe for arabidopsis 2-mers looks like shown below.

[3]:

df_2mers_ara = kex.kmers_by_window_opt(filepath_ara, k=2, window_size=window_size, output_path=output_path)
df_2mers_tom = kex.kmers_by_window_opt(filepath_tom, k=2, window_size=window_size, output_path=output_path)
df_2mers_mz  = kex.kmers_by_window_opt(filepath_mz,  k=2, window_size=window_size, output_path=output_path)
df_2mers_ara

100%|████████████████████████████████████████| 1537509/1537509 [00:45<00:00, 33733.02it/s]
100%|██████████████████████████████████████| 10090326/10090326 [04:45<00:00, 35388.84it/s]
100%|██████████████████████████████████████| 25502007/25502007 [12:31<00:00, 33940.82it/s]

[3]:

	chromosome	window	start	end	AA	AC	AG	AT	CA	CC	CG	CT	GA	GC	GG	GT	TA	TC	TG	TT
0	chr1	0	1	100000	11617	5201	5956	9480	6269	3277	2412	5588	6375	2998	3460	5248	7993	6069	6253	11803
1	chr1	1	100001	200000	10857	5241	5871	8727	6080	3516	2983	5895	6670	3174	3758	5578	7089	6544	6567	11449
2	chr1	2	200001	300000	10623	5144	6129	8658	6350	3575	2655	6386	6617	3378	3616	5325	6964	6869	6536	11174
3	chr1	3	300001	400000	9548	5117	6328	7979	6483	4297	2771	6834	6625	3923	4047	5310	6315	7049	6759	10614
4	chr1	4	400001	500000	11515	5294	5692	9068	6225	3578	2638	5896	6455	3027	3592	5271	7375	6438	6422	11513
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1190	chr5	267	26700001	26800000	11432	5154	5970	9120	6440	3528	2620	5942	6521	3051	3289	5280	7283	6797	6262	11310
1191	chr5	268	26800001	26900000	11549	5299	5782	9116	6113	3457	2948	5881	6405	3068	3254	5222	7680	6575	5965	11685
1192	chr5	269	26900001	26975502	8826	4001	4609	6658	4719	2639	2158	4387	5163	2352	2573	3979	5386	4912	4726	8413
1193	chr5	269	26900001	26975502	8826	4001	4609	6658	4719	2639	2158	4387	5163	2352	2573	3979	5386	4913	4726	8413
1194	chr5	269	26900001	26975502	8826	4001	4609	6658	4719	2639	2158	4387	5163	2352	2573	3979	5386	4914	4726	8413

1195 rows × 20 columns

However, a most natural approach is to use \(k=3\), since the 3-mers represent codons, which encode for amino acids or signal the termination of protein synthesis (stop signals).

The resulting dataframe for arabidopsis 3-mers looks like below.

[4]:

df_3mers_ara = kex.kmers_by_window_opt(filepath_ara,k=3,window_size=window_size,output_path=output_path)
df_3mers_tom = kex.kmers_by_window_opt(filepath_tom,k=3,window_size=window_size,output_path=output_path)
df_3mers_mz = kex.kmers_by_window_opt(filepath_mz,k=3,window_size=window_size,output_path=output_path)
df_3mers_ara

100%|████████████████████████████████████████| 1537509/1537509 [00:54<00:00, 28160.74it/s]
100%|██████████████████████████████████████| 10090326/10090326 [05:59<00:00, 28039.88it/s]
100%|██████████████████████████████████████| 25502007/25502007 [15:53<00:00, 26741.24it/s]

[4]:

	chromosome	window	start	end	AAA	AAC	AAG	AAT	ACA	ACC	...	TCG	TCT	TGA	TGC	TGG	TGT	TTA	TTC	TTG	TTT
0	chr1	0	1	100000	4415	1966	2228	3008	1926	1096	...	776	2200	2016	1007	1349	1881	2682	2169	2422	4530
1	chr1	1	100001	200000	4026	1914	2114	2803	1849	1063	...	959	2339	2102	957	1432	2076	2324	2283	2489	4353
2	chr1	2	200001	300000	3899	1875	2138	2711	1863	1066	...	932	2572	2098	1085	1390	1963	2255	2477	2322	4120
3	chr1	3	300001	400000	3335	1760	2117	2336	1729	1110	...	833	2586	2041	1313	1456	1949	1998	2398	2334	3884
4	chr1	4	400001	500000	4629	1874	2085	2926	1929	1075	...	885	2301	2109	988	1406	1919	2499	2241	2414	4359
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1190	chr5	267	26700001	26800000	4301	2000	2180	2951	1818	1114	...	891	2453	2014	1028	1297	1922	2394	2385	2340	4191
1191	chr5	268	26800001	26900000	4423	1999	2061	3066	1835	1072	...	973	2304	1962	937	1221	1845	2695	2302	2196	4492
1192	chr5	269	26900001	26975502	3366	1522	1803	2135	1467	797	...	704	1734	1585	738	960	1443	1861	1717	1716	3119
1193	chr5	269	26900001	26975502	3366	1522	1803	2135	1467	797	...	704	1734	1585	738	960	1443	1861	1717	1716	3119
1194	chr5	269	26900001	26975502	3366	1522	1803	2135	1467	797	...	704	1734	1585	738	960	1443	1861	1717	1716	3119

1195 rows × 68 columns

Read k-mer files if already computed

Calculating k-mers can take longer for larger genomes. That’s why you also have the option to load previously saved files, so you don’t have to recalculate the k-mers every time you want to plot some figures or do further analysis.

[5]:

df_2mers_ara = pd.read_csv(output_path +'/CGRW_k2_arabidopsis_genome.csv')
df_2mers_tom = pd.read_csv(output_path +'/CGRW_k2_tomato_genome.csv')
df_2mers_mz = pd.read_csv(output_path +'/CGRW_k2_maize_genome.csv')

df_3mers_ara = pd.read_csv(output_path +'/CGRW_k3_arabidopsis_genome.csv')
df_3mers_tom = pd.read_csv(output_path +'/CGRW_k3_tomato_genome.csv')
df_3mers_mz = pd.read_csv(output_path +'/CGRW_k3_maize_genome.csv')

Group all windows within each chromosome

If you don’t want the k-mer frequencies by windows within the chromosomes, you can get the cumulative frequencies for each chromosome as shown in the cell below. For example, the resulting dataframe for arabidopsis would have only 5 rows corresponding to its 5 chromosomes.

[6]:

df_2mers_ara_bychr = df_2mers_ara.groupby('chromosome').agg({k:'sum' for k in df_2mers_ara.columns[4:]})
df_2mers_tom_bychr = df_2mers_tom.groupby('chromosome').agg({k:'sum' for k in df_2mers_tom.columns[4:]})
df_2mers_mz_bychr = df_2mers_mz.groupby('chromosome').agg({k:'sum' for k in df_2mers_mz.columns[4:]})
df_2mers_ara_bychr

[6]:

	AA	AC	AG	AT	CA	CC	CG	CT	GA	GC	GG	GT	TA	TC	TG	TT
chromosome
chr1	3519572	1588202	1794811	2806911	1923811	1016038	697362	1798090	1930160	902322	1008839	1579754	2335983	1928721	1920061	3512205
chr2	2291405	1029105	1160640	1834416	1246754	672290	457569	1166315	1249456	584680	660355	1026228	1527955	1256858	1242154	2289303
chr3	2697289	1228414	1405644	2153303	1498561	804334	559027	1396347	1516501	714668	805341	1226134	1772296	1510857	1492635	2672182
chr4	2143926	980389	1109010	1707158	1184064	637226	439583	1110444	1195069	560350	630128	970513	1417419	1193357	1177332	2125863
chr5	3140913	1410681	1608073	2510382	1712226	907201	634608	1605965	1741333	808346	914520	1422652	2075592	1733770	1729636	3160014

Group all chromosomes

Furthermore, you can pool the k-mer frequencies of all the chromosomes and have the cumulative counts of the entire sequence.

[7]:

df_2mers_all = pd.DataFrame()
df_2mers_all['arabidopsis'] = df_2mers_ara_bychr.sum()
df_2mers_all['tomato'] = df_2mers_tom_bychr.sum()
df_2mers_all['maize'] = df_2mers_mz_bychr.sum()
df_2mers_all.T

[7]:

	AA	AC	AG	AT	CA	CC	CG	CT	GA	GC	GG	GT	TA	TC	TG	TT
arabidopsis	13793105	6236791	7078178	11012170	7565416	4037089	2788149	7077161	7632519	3570366	4019183	6225281	9129245	7623563	7561818	13759567
tomato	85677817	37700236	41002572	75901364	45946447	25138712	11804902	40969026	42920385	18121904	25183170	37739813	65737185	42898271	45974544	85737480
maize	161563751	108257047	128688710	136989307	133796230	120957040	87977339	128781627	129723428	112442579	121098322	108465666	110415392	129855155	133966157	161938131

k-mer figures

Plot k-mers across windows

With the plot_kmers_across_windows function you can plot the frequencies of the k-mers of your choice across the windows of a specified chromosome.

[8]:

a = kex.plot_kmers_across_windows(df_2mers_ara,['TT','CC'],'chr1')

You can also use this function to compare, placing the plot of different chromosomes on different axes within the same figure.

[9]:

fig,ax = plt.subplots(2,1,figsize=(12,10),gridspec_kw={'hspace': 0.3})

kex.plot_kmers_across_windows(df_3mers_ara,['TTT','CCC'],'chr1',ax=ax[0])
kex.plot_kmers_across_windows(df_3mers_ara,['TTT','CCC'],'chr4',ax=ax[1])

plt.show()

Plot k-mer frequencies within each chromosome

The plot_kmers_freq_within_chromosomes function makes a stacked bar graph of the frequency of different k-mers per chromosome. For the same k-mer, each stacked bar represents a different chromosome of the corresponding color.

[10]:

b = kex.plot_kmers_freq_within_chromosomes(df_2mers_ara,figsize=(8,4))

[11]:

c = kex.plot_kmers_freq_within_chromosomes(df_3mers_ara)

Plot FCGR structure

The technique to show the k-mer frequencies as images is known as Frequency Chaos Game Representation (FCGR). It corresponds to a matrix of dimensions \(2^k \times 2^k\); with the upper left corner equal to the k-mer of C, the upper right corner the k-mer of G, the lower left corner the k-mer of A, and the lower right corner the k-mer of T.

The plot_FCGR function plots the FCGR of a sequence as a heatmap that allows to appreciate the relative frequencies of each k-mer. You can plot the FCGR for a specific chromosome and window, the cumulative frequency for an entire chromosome (if no window is specified), or the cumulative frequency for the entire genome (if neither chromosome nor window is specified).

[12]:

d = kex.plot_FCGR(df_3mers_ara,'chr1',window=145)

You can also adjust the size of the figure and the colormap to your preference as shown below

[13]:

e = kex.plot_FCGR(df_2mers_tom,'chr6',figsize=(4,3),colormap='Greys')

Aditionally, is possible compare the FCGR of multiple genomes as shown in the following cell

[14]:

fig,ax = plt.subplots(1,3,figsize=(20,5))

kex.plot_FCGR(df_3mers_ara,ax=ax[0])
kex.plot_FCGR(df_3mers_tom,ax=ax[1])
kex.plot_FCGR(df_3mers_mz,ax=ax[2])

ax[0].set_title('Arabidopsis')
ax[1].set_title('Tomato')
ax[2].set_title('Maize')

plt.show()

Plot CGR

Finally, although it is not the main objective of the software, a function to generate the CGR image of a sequence is included. In the following example, the FASTA file of the Arabidopsis genome is read and the CGR of the first 100 kb is generated.

[15]:

import re

# Read FASTA file with complete genome separated by chromosomes
f_in = open(filepath_ara, 'r')
all_file = f_in.read()
f_in.close()

# join lines
seq_all = "".join(all_file.split("\n"))
del all_file

# Split sequence by chromosomes
seq_bychr = re.split('>chr[1-9]*', seq_all)[1:]
del seq_all

# plot CGR of the first chromosome, first window (first 100 Kb)
f = kex.plot_CGR(seq_bychr[0][:100_000],markersize=0.2)

100%|████████████████████████████████████████| 100000/100000 [00:00<00:00, 1886046.78it/s]