kmerExtractor module
- kmerExtractor.CGR(sequence, d=1)[source]
Generate the X and Y coordinates of the Chaos Game Representation (CGR) for the input sequence. Purine nucleotides (A and G) are placed on the minor diagonal, and the pyrimidine nucleotides (C and T) on the main diagonal. The CGR is built in a square of dimensions 2d x 2d with center at the coordinates (0,0). Only the four basic uppercase nucleotides (A, C, G, T) are accepted. Any other character is ignored.
- Parameters
sequence (str) – Nucleotide sequence.
d (int) – Half of the side of the square. Default to 1.
- Returns
The x and y coordinates of the CGR.
- Return type
tuple[list[float],list[float]]
- kmerExtractor.FCGR(kmers)[source]
Organize the k-mers and their associated values in the common structure of Frequency Chaos Game Representation (FCGR), i.e., a matrix of dimensions N x N, where N = 4^(k/2); with the upper left corner equal to the k-mer of C, the upper right corner the k-mer of G, the lower left corner the k-mer of A, and the lower right corner the k-mer of T.
- Parameters
kmers (dict) – dictionary mapping the k-mers to a real value. All k-mers must be of the same length.
- Returns
FCGR matrix and the names of the k-mers in the matrix.
- Return type
tuple[numpy.ndarray, numpy.ndarray]
- kmerExtractor.kmers_by_window(filepath, k, window_size=100000, output_path=None)[source]
Calculate the frequency of k-mers by windows within each of the chromosome sequences. First, the entire sequence contained within the FASTA file is read. The sequences of each chromosome are then identified and separated. Subsequently, each chromosome is separated into windows of the specified size. Finally, the frequencies of the k-mers are calculated for each window.
- Parameters
filepath (str) – Location of the FASTA file containing the nucleotide sequence of an organism separated by chromosomes.
k (int) – Length of k-mers.
window_size (int) – Window size, in number of base pairs, to be used to count the frequency of k-mers within the chromosome. Must be less than or equal to the length of the smallest chromosome. Default to 100000.
output_path (str) – Path to the directory where the output csv file will be saved. If not specified, the output file will be saved in the same directory as the input file. Default to None
- Returns
Dataframe indicating the chromosome, the window index, the start and end positions of the window, and the frequencies of the corresponding k-mers.
- Return type
pandas.DataFrame
- kmerExtractor.kmers_by_window_opt(filepath, k, window_size=100000, output_path=None)[source]
Calculate the frequency of k-mers by windows within each of the chromosome sequences. It calculates the k-mers, as it reads the FASTA file, optimizing space by avoiding storing the sequence.
- Parameters
filepath (str) – Location of the FASTA file containing the nucleotide sequence of an organism separated by chromosomes.
k (int) – Length of k-mers.
window_size (int) – Window size, in number of base pairs, to be used to count the frequency of k-mers within the chromosome. Must be greater than the number of nucleotides per line in the input FASTA file. Default to 100000
output_path (str) – Path to the directory where the output csv file will be saved. If not specified, the output file will be saved in the same directory as the input file. Default to None
- Returns
Dataframe containing the k-mer frequencies for all windows of each chromosome.
- Return type
pandas.DataFrame
- kmerExtractor.kmers_in_sequence(sequence, k)[source]
Calculate the frequency of k-mers in a given sequence. Only k-mers containing the four basic uppercase nucleotides (A, C, G, T) are accepted. Any substrings of length k that have any other character in their composition will be disregarded and thus not considered as k-mers.
- Parameters
sequence (str) – Nucleotide sequence.
k (int) – Length of k-mers.
- Returns
Dictionary containing the k-mer frequencies. Keys are the k-mers and values are the corresponding frequencies.
- Return type
dict
- kmerExtractor.plot_CGR(sequence, d=1, markersize=1, ax=None, figsize=(6, 6))[source]
Plot the Chaos Game Representation (CGR) of the input sequence. Purine nucleotides (A and G) are placed on the minor diagonal, and the pyrimidine nucleotides (C and T) on the main diagonal. The CGR is built in a square of dimensions 2d x 2d with center at the coordinates (0,0). Only the four basic uppercase nucleotides (A, C, G, T) are accepted. Any other character is ignored.
- Parameters
sequence (str) – Nucleotide sequence.
d (int) – Half of the side of the square in which the CGR is built. Default to 1.
- Returns
The Axes object containing the plot.
- Return type
matplotlib.axes._axes.Axes
- kmerExtractor.plot_FCGR(df, chromosome=None, window=None, figsize=(7, 6), colormap='bwr', ax=None)[source]
Plot the Frequency Chaos Game Representation (FCGR) of the k-mers in the input dataframe.
- Parameters
df (pandas.DataFrame) – Dataframe with the k-mer frequencies by windows across chromosomes.
chromosome (str) – Name of the chromosome to plot. If None, parameter window must be None, plotting the sum of k-mers across all chromosomes. Default to None.
window (int) – Number of the window to plot. If None, plot the sum of all windows for each chromosome. Default to None.
figsize (tuple[int,int]) – The size in inches of the figure to create. Default to (7,6).
colormap (str) – Matplotlib colormap name. The mapping from data values to color space. Default to ‘bwr’.
ax (matplotlib.axes.Axes) – Axes in which to draw the plot, otherwise use the currently-active Axes. Default to None.
- Returns
The Axes object containing the plot.
- Return type
matplotlib.axes._axes.Axes
- kmerExtractor.plot_kmers_across_windows(df, kmer_names, chromosome, figsize=(12, 4), ax=None)[source]
Plot the frequencies of the given k-mers across the windows of a given chromosome.
- Parameters
df (pandas.DataFrame) – dataframe with the k-mer frequencies by window.
kmer_names (list) – list of k-mer names to plot.
chromosome (str) – name of the chromosome to plot.
figsize (tuple[int,int]) – The size in inches of the figure to create. default to (12,4).
ax (matplotlib.axes._axes.Axes) – The Axes object containing the plot. If None, a new figure and axes is created. Default to None.
- Returns
The Axes object containing the plot.
- Return type
matplotlib.axes._axes.Axes
- kmerExtractor.plot_kmers_freq_within_chromosomes(df, figsize=(15, 6), ax=None)[source]
Plot the sum of the k-mer frequencies within each chromosome.
- Parameters
df (pandas.DataFrame) – dataframe with the k-mer frequencies by window.
figsize (tuple[int,int]) – The size in inches of the figure to create. default to (15,6).
ax (matplotlib.axes.Axes) – Axes in which to draw the plot, otherwise use the currently-active Axes. Default to None.
- Returns
The Axes object containing the plot.
- Return type
matplotlib.axes._axes.Axes