3  Reproduce results from Julie Lê Hoang (2017) in Arabidopsis thaliana

Author

Samuel Ortion

Published

May 6, 2024

import os
os.chdir("..")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

The analysis relies on TAG data detected on the Arabidopsis thaliana proteome.

Parameter configuration: see ../exp/20240501_TAG_list_size_distributions/config/config.yaml

3.1 Number of TAG and number of genes for each TAG definition

tag_filename = "./tmp/TAIR10_Ensembl.Walktrap.TAGs.tsv" # FTAG-Finder generated, git-revision b192d0b9dd31c65c4c156646236acd263461a5db
tag_df = pd.read_csv(tag_filename, sep="\t", na_values=["-"])
tag_df.head()
geneName chromosome strand family tag0 tag1 tag5 tag10
0 AT1G01010 1 1 1 NaN NaN NaN NaN
1 AT1G01020 1 -1 4 NaN NaN NaN NaN
2 AT1G01030 1 -1 6 NaN NaN NaN NaN
3 AT1G01040 1 1 spacers0 NaN NaN NaN NaN
4 transcript:at1g01046 1 1 spacers1 NaN NaN NaN NaN
tag_definitions = [0, 1, 5, 10]
tag_columns = [f"tag{i}" for i in tag_definitions]

How many TAGs are there for each definition?

tag_column = "tag0"
unique_tag = pd.concat([tag_df["chromosome"], tag_df[tag_column]]).unique()
unique_tag.shape[0]
313

How many genes are implicated in a TAG for each definition?

for tag_column in tag_columns:
    print(f"Number of genes implicated for {tag_column}: {tag_df[~tag_df[tag_column].isna()].shape[0]}")
Number of genes implicated for tag0: 2788
Number of genes implicated for tag1: 3321
Number of genes implicated for tag5: 3664
Number of genes implicated for tag10: 3799

There is discrepancies between the numbers presented by Julie Lê-Hoang and the results I obtained.