kmerExtractor user manual

Below we present examples of how to use the main functions of the software.

Import libraries

[1]:
import pandas as pd
import matplotlib.pyplot as plt

# software to extract the k-mers
import kmerExtractor as kex

Compute k-mers

Specify the corresponding directory of the FASTA files from which you want to extract the k-mers, and the directory where you want to save the output files. Also define the window size (in base pairs) for which you want the frequency of k-mers to be counted.

In this case we will use the genome of arabidospsis, tomato and maize, with a window size of \(100000\) bp for all genomes.

[2]:
filepath_ara = '../input_data/arabidopsis_genome.fasta'
filepath_tom = '../input_data/tomato_genome.fasta'
filepath_mz = '../input_data/maize_genome.fasta'
output_path = '../output_data'
window_size = 100_000

Now use the kmers_by_window_opt function to calculate the k-mers of each genome for the \(k\) of your choice.

In the output file, the first column indicates the chromosome, the second column the window index, the third and fourth columns indicate the start and end position of the window respectively, and the rest of the \(4^k\) columns correspond to the k-mer frequencies. This data is saved in a csv file and stored in a dataframe for downstream analysis within the tutorial.

The resulting dataframe for arabidopsis 2-mers looks like shown below.

[3]:
df_2mers_ara = kex.kmers_by_window_opt(filepath_ara, k=2, window_size=window_size, output_path=output_path)
df_2mers_tom = kex.kmers_by_window_opt(filepath_tom, k=2, window_size=window_size, output_path=output_path)
df_2mers_mz  = kex.kmers_by_window_opt(filepath_mz,  k=2, window_size=window_size, output_path=output_path)
df_2mers_ara
100%|████████████████████████████████████████| 1537509/1537509 [00:45<00:00, 33733.02it/s]
100%|██████████████████████████████████████| 10090326/10090326 [04:45<00:00, 35388.84it/s]
100%|██████████████████████████████████████| 25502007/25502007 [12:31<00:00, 33940.82it/s]
[3]:
chromosome window start end AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
0 chr1 0 1 100000 11617 5201 5956 9480 6269 3277 2412 5588 6375 2998 3460 5248 7993 6069 6253 11803
1 chr1 1 100001 200000 10857 5241 5871 8727 6080 3516 2983 5895 6670 3174 3758 5578 7089 6544 6567 11449
2 chr1 2 200001 300000 10623 5144 6129 8658 6350 3575 2655 6386 6617 3378 3616 5325 6964 6869 6536 11174
3 chr1 3 300001 400000 9548 5117 6328 7979 6483 4297 2771 6834 6625 3923 4047 5310 6315 7049 6759 10614
4 chr1 4 400001 500000 11515 5294 5692 9068 6225 3578 2638 5896 6455 3027 3592 5271 7375 6438 6422 11513
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1190 chr5 267 26700001 26800000 11432 5154 5970 9120 6440 3528 2620 5942 6521 3051 3289 5280 7283 6797 6262 11310
1191 chr5 268 26800001 26900000 11549 5299 5782 9116 6113 3457 2948 5881 6405 3068 3254 5222 7680 6575 5965 11685
1192 chr5 269 26900001 26975502 8826 4001 4609 6658 4719 2639 2158 4387 5163 2352 2573 3979 5386 4912 4726 8413
1193 chr5 269 26900001 26975502 8826 4001 4609 6658 4719 2639 2158 4387 5163 2352 2573 3979 5386 4913 4726 8413
1194 chr5 269 26900001 26975502 8826 4001 4609 6658 4719 2639 2158 4387 5163 2352 2573 3979 5386 4914 4726 8413

1195 rows × 20 columns

However, a most natural approach is to use \(k=3\), since the 3-mers represent codons, which encode for amino acids or signal the termination of protein synthesis (stop signals).

The resulting dataframe for arabidopsis 3-mers looks like below.

[4]:
df_3mers_ara = kex.kmers_by_window_opt(filepath_ara,k=3,window_size=window_size,output_path=output_path)
df_3mers_tom = kex.kmers_by_window_opt(filepath_tom,k=3,window_size=window_size,output_path=output_path)
df_3mers_mz = kex.kmers_by_window_opt(filepath_mz,k=3,window_size=window_size,output_path=output_path)
df_3mers_ara
100%|████████████████████████████████████████| 1537509/1537509 [00:54<00:00, 28160.74it/s]
100%|██████████████████████████████████████| 10090326/10090326 [05:59<00:00, 28039.88it/s]
100%|██████████████████████████████████████| 25502007/25502007 [15:53<00:00, 26741.24it/s]
[4]:
chromosome window start end AAA AAC AAG AAT ACA ACC ... TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
0 chr1 0 1 100000 4415 1966 2228 3008 1926 1096 ... 776 2200 2016 1007 1349 1881 2682 2169 2422 4530
1 chr1 1 100001 200000 4026 1914 2114 2803 1849 1063 ... 959 2339 2102 957 1432 2076 2324 2283 2489 4353
2 chr1 2 200001 300000 3899 1875 2138 2711 1863 1066 ... 932 2572 2098 1085 1390 1963 2255 2477 2322 4120
3 chr1 3 300001 400000 3335 1760 2117 2336 1729 1110 ... 833 2586 2041 1313 1456 1949 1998 2398 2334 3884
4 chr1 4 400001 500000 4629 1874 2085 2926 1929 1075 ... 885 2301 2109 988 1406 1919 2499 2241 2414 4359
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1190 chr5 267 26700001 26800000 4301 2000 2180 2951 1818 1114 ... 891 2453 2014 1028 1297 1922 2394 2385 2340 4191
1191 chr5 268 26800001 26900000 4423 1999 2061 3066 1835 1072 ... 973 2304 1962 937 1221 1845 2695 2302 2196 4492
1192 chr5 269 26900001 26975502 3366 1522 1803 2135 1467 797 ... 704 1734 1585 738 960 1443 1861 1717 1716 3119
1193 chr5 269 26900001 26975502 3366 1522 1803 2135 1467 797 ... 704 1734 1585 738 960 1443 1861 1717 1716 3119
1194 chr5 269 26900001 26975502 3366 1522 1803 2135 1467 797 ... 704 1734 1585 738 960 1443 1861 1717 1716 3119

1195 rows × 68 columns

Read k-mer files if already computed

Calculating k-mers can take longer for larger genomes. That’s why you also have the option to load previously saved files, so you don’t have to recalculate the k-mers every time you want to plot some figures or do further analysis.

[5]:
df_2mers_ara = pd.read_csv(output_path +'/CGRW_k2_arabidopsis_genome.csv')
df_2mers_tom = pd.read_csv(output_path +'/CGRW_k2_tomato_genome.csv')
df_2mers_mz = pd.read_csv(output_path +'/CGRW_k2_maize_genome.csv')

df_3mers_ara = pd.read_csv(output_path +'/CGRW_k3_arabidopsis_genome.csv')
df_3mers_tom = pd.read_csv(output_path +'/CGRW_k3_tomato_genome.csv')
df_3mers_mz = pd.read_csv(output_path +'/CGRW_k3_maize_genome.csv')

Group all windows within each chromosome

If you don’t want the k-mer frequencies by windows within the chromosomes, you can get the cumulative frequencies for each chromosome as shown in the cell below. For example, the resulting dataframe for arabidopsis would have only 5 rows corresponding to its 5 chromosomes.

[6]:
df_2mers_ara_bychr = df_2mers_ara.groupby('chromosome').agg({k:'sum' for k in df_2mers_ara.columns[4:]})
df_2mers_tom_bychr = df_2mers_tom.groupby('chromosome').agg({k:'sum' for k in df_2mers_tom.columns[4:]})
df_2mers_mz_bychr = df_2mers_mz.groupby('chromosome').agg({k:'sum' for k in df_2mers_mz.columns[4:]})
df_2mers_ara_bychr
[6]:
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
chromosome
chr1 3519572 1588202 1794811 2806911 1923811 1016038 697362 1798090 1930160 902322 1008839 1579754 2335983 1928721 1920061 3512205
chr2 2291405 1029105 1160640 1834416 1246754 672290 457569 1166315 1249456 584680 660355 1026228 1527955 1256858 1242154 2289303
chr3 2697289 1228414 1405644 2153303 1498561 804334 559027 1396347 1516501 714668 805341 1226134 1772296 1510857 1492635 2672182
chr4 2143926 980389 1109010 1707158 1184064 637226 439583 1110444 1195069 560350 630128 970513 1417419 1193357 1177332 2125863
chr5 3140913 1410681 1608073 2510382 1712226 907201 634608 1605965 1741333 808346 914520 1422652 2075592 1733770 1729636 3160014

Group all chromosomes

Furthermore, you can pool the k-mer frequencies of all the chromosomes and have the cumulative counts of the entire sequence.

[7]:
df_2mers_all = pd.DataFrame()
df_2mers_all['arabidopsis'] = df_2mers_ara_bychr.sum()
df_2mers_all['tomato'] = df_2mers_tom_bychr.sum()
df_2mers_all['maize'] = df_2mers_mz_bychr.sum()
df_2mers_all.T
[7]:
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
arabidopsis 13793105 6236791 7078178 11012170 7565416 4037089 2788149 7077161 7632519 3570366 4019183 6225281 9129245 7623563 7561818 13759567
tomato 85677817 37700236 41002572 75901364 45946447 25138712 11804902 40969026 42920385 18121904 25183170 37739813 65737185 42898271 45974544 85737480
maize 161563751 108257047 128688710 136989307 133796230 120957040 87977339 128781627 129723428 112442579 121098322 108465666 110415392 129855155 133966157 161938131

k-mer figures

Plot k-mers across windows

With the plot_kmers_across_windows function you can plot the frequencies of the k-mers of your choice across the windows of a specified chromosome.

[8]:
a = kex.plot_kmers_across_windows(df_2mers_ara,['TT','CC'],'chr1')
_images/kmerExtractorTutorial_22_0.png

You can also use this function to compare, placing the plot of different chromosomes on different axes within the same figure.

[9]:
fig,ax = plt.subplots(2,1,figsize=(12,10),gridspec_kw={'hspace': 0.3})

kex.plot_kmers_across_windows(df_3mers_ara,['TTT','CCC'],'chr1',ax=ax[0])
kex.plot_kmers_across_windows(df_3mers_ara,['TTT','CCC'],'chr4',ax=ax[1])

plt.show()
_images/kmerExtractorTutorial_24_0.png

Plot k-mer frequencies within each chromosome

The plot_kmers_freq_within_chromosomes function makes a stacked bar graph of the frequency of different k-mers per chromosome. For the same k-mer, each stacked bar represents a different chromosome of the corresponding color.

[10]:
b = kex.plot_kmers_freq_within_chromosomes(df_2mers_ara,figsize=(8,4))
_images/kmerExtractorTutorial_27_0.png
[11]:
c = kex.plot_kmers_freq_within_chromosomes(df_3mers_ara)
_images/kmerExtractorTutorial_28_0.png

Plot FCGR structure

The technique to show the k-mer frequencies as images is known as Frequency Chaos Game Representation (FCGR). It corresponds to a matrix of dimensions \(2^k \times 2^k\); with the upper left corner equal to the k-mer of C, the upper right corner the k-mer of G, the lower left corner the k-mer of A, and the lower right corner the k-mer of T.

The plot_FCGR function plots the FCGR of a sequence as a heatmap that allows to appreciate the relative frequencies of each k-mer. You can plot the FCGR for a specific chromosome and window, the cumulative frequency for an entire chromosome (if no window is specified), or the cumulative frequency for the entire genome (if neither chromosome nor window is specified).

[12]:
d = kex.plot_FCGR(df_3mers_ara,'chr1',window=145)
_images/kmerExtractorTutorial_31_0.png

You can also adjust the size of the figure and the colormap to your preference as shown below

[13]:
e = kex.plot_FCGR(df_2mers_tom,'chr6',figsize=(4,3),colormap='Greys')
_images/kmerExtractorTutorial_33_0.png

Aditionally, is possible compare the FCGR of multiple genomes as shown in the following cell

[14]:
fig,ax = plt.subplots(1,3,figsize=(20,5))

kex.plot_FCGR(df_3mers_ara,ax=ax[0])
kex.plot_FCGR(df_3mers_tom,ax=ax[1])
kex.plot_FCGR(df_3mers_mz,ax=ax[2])

ax[0].set_title('Arabidopsis')
ax[1].set_title('Tomato')
ax[2].set_title('Maize')

plt.show()
_images/kmerExtractorTutorial_35_0.png

Plot CGR

Finally, although it is not the main objective of the software, a function to generate the CGR image of a sequence is included. In the following example, the FASTA file of the Arabidopsis genome is read and the CGR of the first 100 kb is generated.

[15]:
import re

# Read FASTA file with complete genome separated by chromosomes
f_in = open(filepath_ara, 'r')
all_file = f_in.read()
f_in.close()

# join lines
seq_all = "".join(all_file.split("\n"))
del all_file

# Split sequence by chromosomes
seq_bychr = re.split('>chr[1-9]*', seq_all)[1:]
del seq_all

# plot CGR of the first chromosome, first window (first 100 Kb)
f = kex.plot_CGR(seq_bychr[0][:100_000],markersize=0.2)
100%|████████████████████████████████████████| 100000/100000 [00:00<00:00, 1886046.78it/s]
_images/kmerExtractorTutorial_38_1.png