7  Ka/Ks in function of distance

Published

June 4, 2024

Used R libraries:

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)

Are farther pairs of genes coming from a more ancient duplication event ?

The file ./data/Table_Ka_Ks.csv was produced by Séanna Charles, using PAML, on TAIR10:

model Yang & Nielsen for paml(yn00), parameters by default.

clustalw2 -quiet -align -infile= prot_fst -outfile=prot.ali.aln
 ./pal2nal.pl prot.ali.aln cds_fst -output paml > cds.ali.phy
awk -v file=cds.ali.phy '{gsub(\"XXXXX\",file); print $0}' yn00.ctl_master > yn00.ctl
paml-4.10.7-linux-X86_64/paml-4.10.7/bin/yn00

7.1 Quick notions on the Ka/Ks metrics

The Ka/Ks ratio corresponds to _the number of nonsynonymous substitutions per non-synonymous site (\(K_a\)) over to the number of synonymous substitutions per synonymous site (\(K_s\)) during the same period.

A little definition of the terms:

Mutations that alter coding sequences (CDS), but do not alter amino acid sequences are referred to as synonymous mutations. Synonymous sites are then the collection of potential synonymous mutations present in a gene.

[ref. 10.1371/journal.pgen.1003527]

This is correlated with the age of the divergence event of two nucleotide sequences: a higher Ka/Ks ratio corresponds to a more recent event (?).

7.2 Exploratory analysis of the table

kaks_df <- read.csv("./data/Table_Ka_Ks.csv")
head(kaks_df)
    GeneIDA   GeneIDB     Ks     Ka Ratio.of.fixation Standart.Deviation.Ks
1 AT2G29370 AT2G29350 0.4077 0.1381        0.3387295                0.0580
2 AT2G03380 AT2G36980 4.4041 0.7893        0.1792194                8.1816
3 AT4G13240 AT3G24620 1.0498 0.1068        0.1017337                0.1575
4 AT1G19720 AT4G18750 4.6873 0.8070        0.1721673                9.8727
5 AT1G54280 AT5G04930 4.9841 0.7148        0.1434161               12.0254
6 AT4G15320 AT2G32620 0.3993 0.1727        0.4325069                0.0370
  Strandart.Deviation.Ka
1                 0.0168
2                 0.0389
3                 0.0101
4                 0.0331
5                 0.0255
6                 0.0114
dim(kaks_df)
[1] 18198     7

Does all the genes present in the table belongs to duplicate gene families?

families_df <- read.table("exp/mutual/results/bak/2024-06-03/ifb/TAIR10_LêHoang_jeu1.Walktrap.tsv", header=TRUE)
head(families_df)
   geneName family
1 AT1G01010   4801
2 AT1G01020     12
3 AT1G01040   4804
4 AT1G01050   4805
5 AT1G01070   4808
6 AT1G01090     20
dim(families_df)
[1] 17552     2
family_genes <- unique(families_df[,"geneName"])
length(family_genes)
[1] 17552
table_genes <- unique(c(kaks_df[,1], kaks_df[,2]))
length(table_genes)
[1] 4818
orphan_genes_1 <- setdiff(family_genes, table_genes)
length(orphan_genes_1)
[1] 12996

Maybe the genes in the table are from the TAGs only?

tag_df <- read.table("exp/mutual/results/bak/2024-06-03/ifb/TAIR10_LêHoang_jeu1.Walktrap.TAGs.tsv", header=TRUE)
head(tag_df)
     geneName chromosome strand   family tag0 tag1 tag5 tag10
1   AT1G01010       Chr1      1     4801    -    -    -     -
2   AT1G01020       Chr1     -1       12    -    -    -     -
3   AT1G01030       Chr1     -1 spacers0    -    -    -     -
4   AT1G01040       Chr1      1     4804    -    -    -     -
5   AT1G01046       Chr1      1 spacers1    -    -    -     -
6 AT1G01046.1       Chr1      1 spacers2    -    -    -     -
dim(tag_df)
[1] 29026     8
real_tag_df <- tag_df[!str_detect(tag_df[, "family"], "spacers[0-9]+"),]
head(real_tag_df)
   geneName chromosome strand family tag0 tag1 tag5 tag10
1 AT1G01010       Chr1      1   4801    -    -    -     -
2 AT1G01020       Chr1     -1     12    -    -    -     -
4 AT1G01040       Chr1      1   4804    -    -    -     -
7 AT1G01050       Chr1     -1   4805    -    -    -     -
8 AT1G01060       Chr1     -1   3404    -    -    -     -
9 AT1G01070       Chr1     -1   4808    -    -    -     -
tag0_df <- real_tag_df[real_tag_df[,"tag0"] != "-",]
head(tag0_df)
     geneName chromosome strand family tag0 tag1 tag5 tag10
69  AT1G01580       Chr1      1    133    1    1    2     2
70  AT1G01590       Chr1      1    133    1    1    2     2
134 AT1G02190       Chr1      1    189    2    2    3     3
135 AT1G02205       Chr1      1    189    2    2    3     3
137 AT1G02220       Chr1     -1      5    3    3    4     4
138 AT1G02230       Chr1     -1      5    3    3    4     4
tag0_genes <- unique(tag0_df[,"geneName"])
length(tag0_genes)
[1] 2745

7.3 References