2 TAG definition distributions in Arabidopsis thaliana

Author

Samuel Ortion

Published

April 30, 2024

3 Running FTAG-Finder.smk

In exp/20240501_TAG_list_size_distributions, I used FTAG-Finder.smk, git revision 0d22067.

3.1 Deployment

I used the concept of workflow module from Snakamek:

# | filename="../exp/20240501_TAG_list_size_distributions/Snakefile"

from snakemake.utils import min_version
min_version("6.0")

configfile: "config/config.yaml"

input_pep_fasta = config["input_pep_fasta"]  # proteome fasta file
run_name = config["run_name"]

module ftag_finder:
    snakefile:
        # "https://github.com/samuelortion/FTAG-Finder.smk/raw/v0.1.0/workflow/Snakefile"
        "../../subprojects/FTAG-Finder/branches/snakemake/workflow/Snakefile"
    config:
        config

use rule * from ftag_finder

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("../conf/lamme2024.mplstyle")

3.2 Distribution of the number of genes in a TAG according to their definition (number of spacer)

tag_df = pd.read_csv("../exp/20240501_TAG_list_size_distributions/results/TAIR10.mcl.TAGs.tsv", sep="\t", na_values=["NA", "-"])
tag_definitions = [0,1,5,10]
tag_definitions_columns = [f"tag{i}" for i in tag_definitions]
tag_df[tag_definitions_columns] = tag_df[tag_definitions_columns].astype("Int64")
identifier_columns = ["chromosome", *tag_definitions_columns]
tag_df.head()

	geneName	chromosome	strand	family	tag0	tag1	tag5	tag10
0	AT1G01010	Chr1	1	573	<NA>	<NA>	<NA>	<NA>
1	AT1G01020	Chr1	-1	2487	<NA>	<NA>	<NA>	<NA>
2	AT1G01030	Chr1	-1	844	<NA>	<NA>	<NA>	<NA>
3	AT1G01040	Chr1	1	845	<NA>	<NA>	<NA>	<NA>
4	AT1G01046	Chr1	1	spacers0	<NA>	<NA>	<NA>	<NA>

Count how many genes are member of each TAG definition.

tag_df.groupby(identifier_columns).size().reset_index(name="count")

	chromosome	tag0	tag1	tag5	tag10	count
0	Chr1	1	1	2	2	2
1	Chr1	2	2	3	3	2
2	Chr1	3	3	4	4	3
3	Chr1	4	4	5	5	2
4	Chr1	5	5	6	6	2
...	...	...	...	...	...	...
1200	Chr5	268	295	316	333	3
1201	Chr5	269	296	317	334	2
1202	Chr5	270	297	318	335	2
1203	Chr5	271	298	319	336	2
1204	ChrC	1	1	1	1	2

1205 rows × 6 columns

Number of TAG for each definition in Arabidopsis thaliana.

In the dataframe, each TAG(i) column contains identifier for the TAG with (i) spacers.

To get the number of TAGs for each definition, we simply count how many unique values are in each column.

tag_count = tag_df[identifier_columns].nunique()
tag_count

chromosome      7
tag0          316
tag1          362
tag5          408
tag10         414
dtype: int64

fig, ax = plt.subplots()
sns.barplot(x=tag_count.index, y=tag_count.values, ax=ax)
ax.set(xlabel="Number of spacers", ylabel="Number of TAGs")
plt.show()

Number of genes in each TAG definition.

tag_gene_count = tag_df[tag_definitions_columns].count()
tag_gene_count

tag0     2953
tag1     3445
tag5     3842
tag10    3995
dtype: int64