Exercise of lesson 3¶
In [ ]:
import matplotlib.pyplot as plt
import pandas as pd
#%pip install openpyxl
The table studied is the result of a differential expression gene (DEG) analysis. It comes from the following GEO data GSE165691.
1. Load dataset¶
In [ ]:
df = pd.read_excel("exercise/data/DEG_result_table.xlsx")
In [ ]:
df.head()
Out[ ]:
ensembl_gene_id | log2FoldChange | pvalue | padj | length | symbol | entrezgene | description | PR661W_Vec | PR661W_TRPM1 | PR661W_TRPM1.norm.count.mean | PR661W_Vec.norm.count.mean | baseMean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSMUSG00000027750 | 0.844948 | 3.204666e-235 | 4.256116e-231 | 8702 | Postn | 50706 | periostin, osteoblast specific factor [Source:... | 5192.916456 | 9327.516658 | 9327.516658 | 5192.916456 | 7260.216557 |
1 | ENSMUSG00000032332 | 1.012135 | 2.993412e-148 | 1.987775e-144 | 15228 | Col12a1 | 12816 | collagen, type XII, alpha 1 [Source:MGI Symbol... | 2108.777926 | 4253.179793 | 4253.179793 | 2108.777926 | 3180.978860 |
2 | ENSMUSG00000029661 | 0.838711 | 9.096674e-96 | 4.027098e-92 | 12426 | Col1a2 | 12843 | collagen, type I, alpha 2 [Source:MGI Symbol;A... | 2134.124390 | 3816.778886 | 3816.778886 | 2134.124390 | 2975.451638 |
3 | ENSMUSG00000001642 | -0.851705 | 7.978918e-74 | 2.649200e-70 | 8384 | Akr1b3 | 11677 | aldo-keto reductase family 1, member B3 (aldos... | 2283.752215 | 1265.494602 | 1265.494602 | 2283.752215 | 1774.623408 |
4 | ENSMUSG00000029304 | -0.609826 | 2.120973e-60 | 5.633728e-57 | 1648 | Spp1 | 20750 | secreted phosphoprotein 1 [Source:MGI Symbol;A... | 3195.806439 | 2094.134776 | 2094.134776 | 3195.806439 | 2644.970607 |
2. Explore data table¶
In [ ]:
df.shape
Out[ ]:
(13281, 13)
In [ ]:
df.columns
Out[ ]:
Index(['ensembl_gene_id', 'log2FoldChange', 'pvalue', 'padj', 'length', 'symbol', 'entrezgene', 'description', 'PR661W_Vec', 'PR661W_TRPM1', 'PR661W_TRPM1.norm.count.mean', 'PR661W_Vec.norm.count.mean', 'baseMean'], dtype='object')
There are 13281 rows (= genes) and 13 columns.
The columns correspond to:
- gene identifier (ensembl_gene_id)
- descriptions of the gene (length, symbol, entrezgene, description)
- results of the DEG analysis (log2FoldChange, pvalue, padj)
- metrics about the expression of the gene in the samples (PR661W_Vec, PR661W_TRPM1, PR661W_TRPM1.norm.count.mean, PR661W_Vec.norm.count.mean, baseMean)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13281 entries, 0 to 13280 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ensembl_gene_id 13281 non-null object 1 log2FoldChange 13281 non-null float64 2 pvalue 13281 non-null float64 3 padj 13281 non-null float64 4 length 13281 non-null int64 5 symbol 13176 non-null object 6 entrezgene 11953 non-null object 7 description 13179 non-null object 8 PR661W_Vec 13281 non-null float64 9 PR661W_TRPM1 13281 non-null float64 10 PR661W_TRPM1.norm.count.mean 13281 non-null float64 11 PR661W_Vec.norm.count.mean 13281 non-null float64 12 baseMean 13281 non-null float64 dtypes: float64(8), int64(1), object(4) memory usage: 1.3+ MB
The columns symbol, entrezgene and description contain null data. They all correspond to columns describing the gene.
In [ ]:
df.describe()
Out[ ]:
log2FoldChange | pvalue | padj | length | PR661W_Vec | PR661W_TRPM1 | PR661W_TRPM1.norm.count.mean | PR661W_Vec.norm.count.mean | baseMean | |
---|---|---|---|---|---|---|---|---|---|
count | 13281.000000 | 1.328100e+04 | 1.328100e+04 | 13281.000000 | 13281.000000 | 13281.000000 | 13281.000000 | 13281.000000 | 13281.000000 |
mean | 0.014733 | 6.215488e-01 | 9.721532e-01 | 5259.189368 | 73.838396 | 76.277698 | 76.277698 | 73.838396 | 75.058047 |
std | 0.367930 | 2.780440e-01 | 1.489298e-01 | 4121.327315 | 183.790011 | 208.393918 | 208.393918 | 183.790011 | 194.522537 |
min | -3.465115 | 3.204666e-235 | 4.256116e-231 | 57.000000 | 1.016250 | 1.020423 | 1.020423 | 1.016250 | 1.018336 |
25% | -0.148969 | 4.372949e-01 | 9.999524e-01 | 2673.000000 | 8.129998 | 8.106694 | 8.106694 | 8.129998 | 8.187393 |
50% | 0.009607 | 6.791172e-01 | 9.999524e-01 | 4346.000000 | 29.291903 | 30.045789 | 30.045789 | 29.291903 | 29.754884 |
75% | 0.173428 | 8.549296e-01 | 9.999524e-01 | 6733.000000 | 76.696963 | 78.345811 | 78.345811 | 76.696963 | 77.624579 |
max | 3.479844 | 9.999524e-01 | 9.999524e-01 | 123179.000000 | 7412.226483 | 9327.516658 | 9327.516658 | 7412.226483 | 7872.613769 |
The minimum value of log2FoldChange is -3.465115, the max is 3.479844, and the mean is 0.014733.
In [ ]:
sum(df.padj < 0.05)
Out[ ]:
185
In [ ]:
(df.padj < 0.05).value_counts()
Out[ ]:
padj False 13096 True 185 Name: count, dtype: int64
There are 185 genes with padj < 0.05.
In [ ]:
((df.padj < 0.05) & (df.log2FoldChange > 1)).value_counts()
Out[ ]:
False 13242 True 39 Name: count, dtype: int64
In [ ]:
((df.padj < 0.05) & (df.log2FoldChange < -1)).value_counts()
Out[ ]:
False 13264 True 17 Name: count, dtype: int64
There are 39 genes up-regulated and 17 genes down-regulated.
3. Expression of genes of interest¶
Here is the expression in both samples of the genes Il31ra, Sox9 and Lbp
In [ ]:
df[df.symbol.isin(['Il31ra', 'Sox9', 'Lbp'])][["symbol", "PR661W_Vec", "PR661W_TRPM1"]]
Out[ ]:
symbol | PR661W_Vec | PR661W_TRPM1 | |
---|---|---|---|
47 | Lbp | 49.019104 | 134.809218 |
871 | Il31ra | 3.168308 | 9.013737 |
1061 | Sox9 | 42.323811 | 59.014464 |
In [ ]:
max_log2FoldChange_gene = df.loc[df['log2FoldChange'].idxmax()]['symbol']
min_log2FoldChange_gene = df.loc[df['log2FoldChange'].idxmin()]['symbol']
print(
f"Gene with max log2FoldChange: {max_log2FoldChange_gene} \n\
Gene with min log2FoldChange: {min_log2FoldChange_gene}"
)
Gene with max log2FoldChange: Robo2 Gene with min log2FoldChange: Bmp3
4. Histogram plots¶
In [ ]:
df.log2FoldChange.plot(kind='hist', title="log2FoldChange repartition")
Out[ ]:
<Axes: title={'center': 'log2FoldChange repartition'}, ylabel='Frequency'>
5. Pie plot¶
In [ ]:
# Calculate the counts of positive and negative log2FoldChange genes
positive_genes_count = sum(df.log2FoldChange > 0)
negative_genes_count = sum(df.log2FoldChange < 0)
# Create a pie plot
plt.pie([positive_genes_count, negative_genes_count], labels=['Positive Log2FoldChange', 'Negative Log2FoldChange'], autopct='%1.1f%%')
plt.show()
6. Volcano plot¶
In [ ]:
df["color"] = "grey"
df["color"] = df["color"].mask(((df.padj < 0.05) & (df.log2FoldChange > 1)), 'red')
df["color"] = df["color"].mask(((df.padj < 0.05) & (df.log2FoldChange < -1)), 'blue')
fig, ax = plt.subplots(1)
df.plot(kind='scatter', x='log2FoldChange', y='padj', color=df.color, ax=ax)
ax.set_yscale('log')
ax.invert_yaxis()
plt.show()
7. Volcano plot with interact¶
In [ ]:
from ipywidgets import interact
from ipywidgets import widgets
def f(tlfc, tpadj):
df["color"] = "grey"
df["color"] = df["color"].mask(((df.padj < tpadj) & (df.log2FoldChange > tlfc)), 'red')
df["color"] = df["color"].mask(((df.padj < tpadj) & (df.log2FoldChange < -tlfc)), 'blue')
fig, ax = plt.subplots(1)
df.plot(kind='scatter', x='log2FoldChange', y='padj', color=df.color, ax=ax)
ax.set_yscale('log')
ax.invert_yaxis()
plt.show()
interact(f,
tlfc=widgets.FloatSlider(value=1, min=0, max=5, step=0.1),
tpadj=widgets.FloatSlider(value=0.05, min=0, max=1, step=0.01))
interactive(children=(FloatSlider(value=1.0, description='tlfc', max=5.0), FloatSlider(value=0.05, description…
Out[ ]:
<function __main__.f(tlfc, tpadj)>
8. Volcano plot with plotly¶
In [ ]:
import plotly.express as px
df["color"] = "Non-significant"
df["color"] = df["color"].mask(((df.padj < 0.05) & (df.log2FoldChange > 1)), 'up-regulated')
df["color"] = df["color"].mask(((df.padj < 0.05) & (df.log2FoldChange < -1)), 'down-regulated')
fig = px.scatter(df, x="log2FoldChange", y="padj", color = "color",
log_y=True, hover_name="symbol",
hover_data=['ensembl_gene_id', 'log2FoldChange', 'padj',
'symbol','PR661W_Vec', 'PR661W_TRPM1',], )
fig.update_yaxes(autorange='reversed')
fig.show()
9. Heatmap¶
In [ ]:
top_10 = df.sort_values('log2FoldChange', key=lambda x: abs(x), ascending=False).head(10).sort_values('log2FoldChange', ascending=False)
In [ ]:
import numpy as np
plt.imshow(np.log(top_10[['PR661W_TRPM1.norm.count.mean', 'PR661W_Vec.norm.count.mean']]), cmap='YlGnBu')
plt.colorbar()
plt.xticks(range(2), ['PR661W_TRPM1', 'PR661W_Vec'], rotation=90)
plt.yticks(range(len(top_10)), top_10['symbol'])
plt.title("Log value of norm.count.mean of each samples")
plt.show()