import os
os.chdir("..")3 Reproduce results from Julie Lê Hoang (2017) in Arabidopsis thaliana
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as npThe analysis relies on TAG data detected on the Arabidopsis thaliana proteome.
Parameter configuration: see ../exp/20240501_TAG_list_size_distributions/config/config.yaml
3.1 Number of TAG and number of genes for each TAG definition
tag_filename = "./tmp/TAIR10_Ensembl.Walktrap.TAGs.tsv" # FTAG-Finder generated, git-revision b192d0b9dd31c65c4c156646236acd263461a5db
tag_df = pd.read_csv(tag_filename, sep="\t", na_values=["-"])
tag_df.head()| geneName | chromosome | strand | family | tag0 | tag1 | tag5 | tag10 | |
|---|---|---|---|---|---|---|---|---|
| 0 | AT1G01010 | 1 | 1 | 1 | NaN | NaN | NaN | NaN |
| 1 | AT1G01020 | 1 | -1 | 4 | NaN | NaN | NaN | NaN |
| 2 | AT1G01030 | 1 | -1 | 6 | NaN | NaN | NaN | NaN |
| 3 | AT1G01040 | 1 | 1 | spacers0 | NaN | NaN | NaN | NaN |
| 4 | transcript:at1g01046 | 1 | 1 | spacers1 | NaN | NaN | NaN | NaN |
tag_definitions = [0, 1, 5, 10]
tag_columns = [f"tag{i}" for i in tag_definitions]How many TAGs are there for each definition?
tag_column = "tag0"
unique_tag = pd.concat([tag_df["chromosome"], tag_df[tag_column]]).unique()
unique_tag.shape[0]313
How many genes are implicated in a TAG for each definition?
for tag_column in tag_columns:
print(f"Number of genes implicated for {tag_column}: {tag_df[~tag_df[tag_column].isna()].shape[0]}")Number of genes implicated for tag0: 2788
Number of genes implicated for tag1: 3321
Number of genes implicated for tag5: 3664
Number of genes implicated for tag10: 3799
There is discrepancies between the numbers presented by Julie Lê-Hoang and the results I obtained.