What is JupyterLab?¶

interactive python/R programming env
use the computationable power from our HPC
access the HPC data directly

Common usage¶

create a python/R notebook
create table of content and cell tags
[navigation] go to my data dir and start analysis
is my notebook running?
some hotkeys

Data analysis using Pandas¶

read_csv(), head(), sample(), shape(), columns
df.isnull().any().any()
sort_values()
value_counts()
describe()
groupby()
min(),max(),std(),median()
subset dataframe
to_csv()

Example Data: several clinicl measurements of 48 Type 2 diabetes patients and 48 normal individuals¶

[22]:

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/YichaoOU/Data_Science_Tutorials/master/LTC_selected_features.csv",sep=",")

read data¶

[5]:

df.head()

[5]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
0	3	10.42	4.73	6.5	121.4	1.63	3.7	1.03	1
1	5	102.86	7.16	7.6	226.0	1.11	10.4	0.98	1
2	6	9.84	5.06	4.9	39.7	0.75	1.3	0.35	0
3	7	41.03	5.35	6.3	203.3	2.02	7.0	1.17	1
4	8	12.30	5.37	5.6	50.8	1.19	1.7	0.32	0

[6]:

df.sample(n=3)

[6]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
9	13	27.82	5.67	5.8	79.8	1.27	2.9	0.19	0
88	92	11.74	4.90	5.0	135.3	1.04	4.2	0.28	0
60	64	114.61	18.74	10.5	207.0	1.00	24.8	1.16	1

[25]:

df.columns

[25]:

Index(['SampleID', 'Peptide_27', 'Fasting_plasma_glucose_(mmol/l)', 'HbA1c',
       'Fasting_plasma_insulin_(pmol/l)', 'C-Peptide_(nmol/l)', 'HOMA-IR',
       'Free_fatty_acids_(mmol/l)', 'Class'],
      dtype='object')

[7]:

df.isnull().any().any()

[7]:

False

get some statistics from data¶

[8]:

df.describe()

[8]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
count	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000
mean	51.541667	33.570208	6.711146	6.248958	155.566667	1.488632	7.240625	0.630937	0.500000
std	27.961596	23.398125	2.734147	1.455515	141.817693	0.778423	7.617384	0.343940	0.502625
min	3.000000	2.630000	3.800000	4.300000	15.000000	0.310000	0.500000	0.150000	0.000000
25%	27.750000	15.597500	5.037500	5.300000	64.550000	0.935000	2.200000	0.340000	0.000000
50%	51.500000	27.280000	5.670000	5.750000	112.600000	1.235000	4.100000	0.550000	0.500000
75%	75.250000	46.885000	7.352500	6.625000	199.775000	1.845000	10.125000	0.940000	1.000000
max	100.000000	114.610000	18.740000	10.900000	783.000000	4.180000	45.300000	1.350000	1.000000

[9]:

df = df.sort_values("HbA1c",ascending=False)

[10]:

df.head()

[10]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
55	59	55.81	9.71	10.9	212.0	2.26	13.2	0.39	1
19	23	53.11	8.24	10.8	242.0	2.42	12.8	0.83	1
62	66	23.47	10.13	10.7	143.1	1.02	9.3	0.92	1
60	64	114.61	18.74	10.5	207.0	1.00	24.8	1.16	1
85	89	56.46	9.89	9.1	178.2	0.94	11.3	1.17	1

[12]:

df.value_counts('Class')

[12]:

Class
0    48
1    48
dtype: int64

[13]:

df.shape

[13]:

(96, 9)

[14]:

df['HbA1c'].min()

[14]:

4.3

[15]:

df['HbA1c'].median()

[15]:

5.75

[42]:

df['HbA1c'].max()

[42]:

10.9

[16]:

df.groupby('Class').mean()

[16]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)
Class
0	53.354167	21.965417	5.321667	5.345833	79.208333	1.103513	2.679167	0.391042
1	49.729167	45.175000	8.100625	7.152083	231.925000	1.873750	11.802083	0.870833

[21]:

df.groupby('Class').head()

[21]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
55	59	55.81	9.71	10.9	212.0	2.26	13.2	0.39	1
19	23	53.11	8.24	10.8	242.0	2.42	12.8	0.83	1
62	66	23.47	10.13	10.7	143.1	1.02	9.3	0.92	1
60	64	114.61	18.74	10.5	207.0	1.00	24.8	1.16	1
85	89	56.46	9.89	9.1	178.2	0.94	11.3	1.17	1
89	93	18.06	7.39	6.5	88.9	1.22	4.2	0.35	0
65	69	43.86	6.26	6.0	69.2	1.20	2.8	0.22	0
43	47	11.36	6.67	5.9	58.2	0.89	2.5	0.36	0
35	39	32.81	5.68	5.9	102.4	1.03	3.7	0.27	0
6	10	25.75	5.67	5.8	67.1	0.94	2.4	0.26	0

[41]:

df.groupby('Class').head().to_csv("My_examlpe.tsv",index=False,sep="\t")

subset data¶

[60]:

df_undiagnosed = df[df['HbA1c']<6.5]

[61]:

df_undiagnosed['Class'].value_counts()

[61]:

0    47
1    18
Name: Class, dtype: int64

[62]:

df_undiagnosed['Class'].value_counts(normalize=True)

[62]:

0    0.723077
1    0.276923
Name: Class, dtype: float64

[26]:

data2 = "https://raw.githubusercontent.com/YichaoOU/Data_Science_Tutorials/master/newly_diagnosed.csv"

Exercise¶

Read this table as Pandas object¶

[36]:

df2 = pd.read_csv(data2,sep=",")

[38]:

df2.head()

[38]:

	Sample ID	Peptide 1	Peptide 2	Peptide 3	Peptide 4	Peptide 5	Peptide 6	Peptide 7	Peptide 8	Peptide 9	...	TSH (mU/l)	fT3 (pmol/l)	fT4 (pmol/l)	Cortisol (nmol/l)	Testosteron (nmol/l)	HOMA-IR	Free fatty acids (mmol/l)	RRsys (mmHg)	RR dia (mmHg)	ssCRP (mg/dl)
0	sample 2	33.58	7.18	9.35	3.57	94.44	14.91	153.05	35.52	9.76	...	0.72	5.23	21.5	NaN	NaN	7.2	0.30	NaN	NaN	0.48
1	sample 3	37.57	8.70	10.79	3.36	94.11	15.99	198.88	39.65	8.62	...	0.97	NaN	NaN	NaN	NaN	9.3	1.04	NaN	NaN	7.15
2	sample 4	27.31	5.42	5.64	2.75	67.01	11.91	148.32	28.90	5.64	...	1.02	5.01	17.6	NaN	NaN	9.1	0.37	NaN	NaN	NaN
3	sample 5	29.09	5.81	4.69	3.61	65.99	12.42	154.55	26.48	6.38	...	1.00	4.61	15.7	NaN	NaN	24.5	0.99	168.0	95.0	0.20
4	sample 6	41.13	8.40	8.85	3.56	109.13	17.60	209.62	44.35	10.43	...	1.30	5.43	19.2	NaN	NaN	11.1	0.54	165.0	84.0	0.30

5 rows × 69 columns

What are the features (columns) in this dataset?¶

[40]:

df2.columns

[40]:

Index(['Sample ID', 'Peptide 1', 'Peptide 2', 'Peptide 3', 'Peptide 4',
       'Peptide 5', 'Peptide 6', 'Peptide 7', 'Peptide 8', 'Peptide 9',
       'Peptide 10', 'Peptide 11', 'Peptide 12', 'Peptide 13', 'Peptide 14',
       'Peptide 15', 'Peptide 16', 'Peptide 17', 'Peptide 18', 'Peptide 21',
       'Peptide 22', 'Peptide 23', 'Peptide 24', 'Peptide 25', 'Peptide 26',
       'Peptide 27', 'Peptide 29', 'Peptide 30', 'Age', 'Diagnosis', 'BMI',
       'HbA1c (%)', 'Gender', 'Height', 'Body weight', 'BMI.1', 'Body fat',
       'Fat free mass', 'Waist', 'Hip', 'WHR', 'Hemoglobin', 'Erythrozyten',
       'Thrombozyten', 'Leukocytes', 'ALAT', 'ASAT', 'gGT',
       'Fasting plasma glucose (mmol/l)', 'Fasting plasma insulin (pmol/l)',
       'C-Peptide (nmol/l)', 'Proinsulin (pmol/l)', 'Creatinin',
       'Triglycerides (mmol/l)', 'Cholesterol total (mmol/l)',
       'HDL-Cholesterol (mmol/l)', 'LDL-Cholesterol (mmol/l)',
       'Proteins total (g/l)', 'Albumin (g/l)', 'TSH (mU/l)', 'fT3 (pmol/l)',
       'fT4 (pmol/l)', 'Cortisol (nmol/l)', 'Testosteron (nmol/l)', 'HOMA-IR',
       'Free fatty acids (mmol/l)', 'RRsys (mmHg)', 'RR dia (mmHg)',
       'ssCRP (mg/dl)'],
      dtype='object')

What are the maximum and minimum values for HbA1c? Are they the same as the first dataset?¶

[45]:

df2['HbA1c (%)'].min()

[45]:

4.3

[46]:

df2['HbA1c (%)'].max()

[46]:

9.7

Does our data have NaN values?¶

[47]:

df2.isnull().any().any()

[47]:

True

[52]:

df2[df2.isnull().any(axis=1)] # any rows containing NaN

[52]:

	Sample ID	Peptide 1	Peptide 2	Peptide 3	Peptide 4	Peptide 5	Peptide 6	Peptide 7	Peptide 8	Peptide 9	...	TSH (mU/l)	fT3 (pmol/l)	fT4 (pmol/l)	Cortisol (nmol/l)	Testosteron (nmol/l)	HOMA-IR	Free fatty acids (mmol/l)	RRsys (mmHg)	RR dia (mmHg)	ssCRP (mg/dl)
0	sample 2	33.58	7.18	9.35	3.57	94.44	14.91	153.05	35.52	9.76	...	0.72	5.23	21.5	NaN	NaN	7.2	0.30	NaN	NaN	0.48
1	sample 3	37.57	8.70	10.79	3.36	94.11	15.99	198.88	39.65	8.62	...	0.97	NaN	NaN	NaN	NaN	9.3	1.04	NaN	NaN	7.15
2	sample 4	27.31	5.42	5.64	2.75	67.01	11.91	148.32	28.90	5.64	...	1.02	5.01	17.6	NaN	NaN	9.1	0.37	NaN	NaN	NaN
3	sample 5	29.09	5.81	4.69	3.61	65.99	12.42	154.55	26.48	6.38	...	1.00	4.61	15.7	NaN	NaN	24.5	0.99	168.0	95.0	0.20
4	sample 6	41.13	8.40	8.85	3.56	109.13	17.60	209.62	44.35	10.43	...	1.30	5.43	19.2	NaN	NaN	11.1	0.54	165.0	84.0	0.30
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
91	Sample 96	26.08	3.54	0.68	0.05	29.46	9.27	142.70	18.61	2.49	...	3.84	NaN	NaN	20.30	117.00	0.8	0.44	130.0	80.0	5.50
92	Sample 97	17.72	2.28	0.64	0.01	19.22	7.37	102.24	13.22	1.61	...	0.63	NaN	NaN	9.73	208.00	5.0	0.36	NaN	NaN	NaN
93	Sample 98	17.70	2.23	0.59	0.01	18.47	5.80	103.94	13.67	1.65	...	1.36	NaN	NaN	9.03	56.00	2.7	0.39	NaN	NaN	NaN
94	Sample 99	26.57	3.33	0.71	0.07	27.69	8.31	137.81	20.68	2.99	...	0.71	NaN	NaN	NaN	1.49	4.6	0.76	NaN	NaN	0.08
95	Sample 100	25.26	3.40	0.64	0.05	27.47	8.72	138.50	19.72	2.54	...	3.34	NaN	NaN	11.50	93.00	4.7	0.35	NaN	NaN	19.50

96 rows × 69 columns

[56]:

df2[df2.columns[df2.isnull().any(axis=0)]] # any columns containing NaN

[56]:

	Body fat	Fat free mass	Waist	Hip	WHR	Hemoglobin	Erythrozyten	Thrombozyten	Leukocytes	ALAT	...	Proteins total (g/l)	Albumin (g/l)	TSH (mU/l)	fT3 (pmol/l)	fT4 (pmol/l)	Cortisol (nmol/l)	Testosteron (nmol/l)	RRsys (mmHg)	RR dia (mmHg)	ssCRP (mg/dl)
0	32.7	73.4	NaN	NaN	NaN	8.3	4.32	216.0	7.4	0.29	...	70.8	NaN	0.72	5.23	21.5	NaN	NaN	NaN	NaN	0.48
1	33.9	NaN	NaN	NaN	NaN	8.9	5.44	180.0	5.9	1.40	...	NaN	NaN	0.97	NaN	NaN	NaN	NaN	NaN	NaN	7.15
2	33.8	NaN	NaN	NaN	NaN	8.9	4.68	169.0	5.4	0.55	...	66.1	NaN	1.02	5.01	17.6	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	8.5	4.83	237.0	6.6	0.57	...	72.1	NaN	1.00	4.61	15.7	NaN	NaN	168.0	95.0	0.20
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.65	...	76.3	NaN	1.30	5.43	19.2	NaN	NaN	165.0	84.0	0.30
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
91	28.0	73.6	NaN	NaN	NaN	10.1	5.28	170.0	8.5	0.65	...	67.9	43.6	3.84	NaN	NaN	20.30	117.00	130.0	80.0	5.50
92	37.9	81.2	NaN	NaN	NaN	7.4	4.18	395.0	10.1	0.30	...	NaN	NaN	0.63	NaN	NaN	9.73	208.00	NaN	NaN	NaN
93	36.8	79.5	NaN	NaN	NaN	9.1	4.76	172.0	7.8	0.43	...	NaN	NaN	1.36	NaN	NaN	9.03	56.00	NaN	NaN	NaN
94	46.9	66.7	135.0	137.0	0.99	8.3	4.84	196.0	5.9	0.43	...	NaN	NaN	0.71	NaN	NaN	NaN	1.49	NaN	NaN	0.08
95	30.6	90.4	NaN	NaN	NaN	8.5	4.61	219.0	3.7	1.16	...	NaN	NaN	3.34	NaN	NaN	11.50	93.00	NaN	NaN	19.50

96 rows × 28 columns

How many diabetic patents will be undiagnosed using HbA1c < 6.5?¶

[64]:

df2_undiagnosed = df2[df2['HbA1c (%)']<6.5]

[67]:

df2_undiagnosed['Diagnosis'].value_counts()

[67]:

NGT    48
T2D    23
Name: Diagnosis, dtype: int64

[68]:

df2_undiagnosed['Diagnosis'].value_counts(normalize=True)

[68]:

NGT    0.676056
T2D    0.323944
Name: Diagnosis, dtype: float64

Save the undiagnosed table as “happy_learning.csv”¶

Data visualization using Seaborn and many other libraries¶

scatter plot
barplot
boxplot
violin plot
beeswarm plot

Scatter plots¶

[69]:

import seaborn as sns

[71]:

df.columns

[71]:

Index(['SampleID', 'Peptide_27', 'Fasting_plasma_glucose_(mmol/l)', 'HbA1c',
       'Fasting_plasma_insulin_(pmol/l)', 'C-Peptide_(nmol/l)', 'HOMA-IR',
       'Free_fatty_acids_(mmol/l)', 'Class'],
      dtype='object')

[72]:

sns.scatterplot(data = df,x="HbA1c",y="Peptide_27")

[72]:

<AxesSubplot:xlabel='HbA1c', ylabel='Peptide_27'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_53_1.png

[73]:

sns.scatterplot(data = df,x="HbA1c",y="Peptide_27",hue="Class")

[73]:

<AxesSubplot:xlabel='HbA1c', ylabel='Peptide_27'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_54_1.png

[74]:

import matplotlib.pylab as plt

[76]:

sns.scatterplot(data = df,x="HbA1c",y="Peptide_27",hue="Class")
plt.axhline(40,color="black")

[76]:

<matplotlib.lines.Line2D at 0x2aad6a85d6d0>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_56_1.png

[78]:

sns.scatterplot(data = df,x="HbA1c",y="Peptide_27",hue="Class")
plt.axhline(40,color="black")
plt.axvline(6,color="grey")

[78]:

<matplotlib.lines.Line2D at 0x2aad6aa23130>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_57_1.png

[124]:

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe"

[140]:

df.head()

[140]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
0	3	10.42	4.73	6.5	121.4	1.63	3.7	1.03	1
1	5	102.86	7.16	7.6	226.0	1.11	10.4	0.98	1
2	6	9.84	5.06	4.9	39.7	0.75	1.3	0.35	0
3	7	41.03	5.35	6.3	203.3	2.02	7.0	1.17	1
4	8	12.30	5.37	5.6	50.8	1.19	1.7	0.32	0

interactive scatter plot¶

[142]:

px.scatter(data_frame = df,x="HbA1c",y="Peptide_27",color="Class",hover_data=["SampleID"])

[79]:

df.describe()

[79]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
count	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000	96.000000
mean	51.541667	33.570208	6.711146	6.248958	155.566667	1.488632	7.240625	0.630937	0.500000
std	27.961596	23.398125	2.734147	1.455515	141.817693	0.778423	7.617384	0.343940	0.502625
min	3.000000	2.630000	3.800000	4.300000	15.000000	0.310000	0.500000	0.150000	0.000000
25%	27.750000	15.597500	5.037500	5.300000	64.550000	0.935000	2.200000	0.340000	0.000000
50%	51.500000	27.280000	5.670000	5.750000	112.600000	1.235000	4.100000	0.550000	0.500000
75%	75.250000	46.885000	7.352500	6.625000	199.775000	1.845000	10.125000	0.940000	1.000000
max	100.000000	114.610000	18.740000	10.900000	783.000000	4.180000	45.300000	1.350000	1.000000

barplot¶

[83]:

plot_df = df.melt(id_vars=['Class'],value_vars = ['Fasting_plasma_glucose_(mmol/l)','HbA1c'])
sns.barplot(data=plot_df,x="variable",y="value",hue="Class")

[83]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_64_1.png

[166]:

df_norm = df/df.max(axis=0)
df_norm.head()

[166]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
0	0.03	0.090917	0.252401	0.596330	0.155045	0.389952	0.081678	0.762963	1.0
1	0.05	0.897478	0.382070	0.697248	0.288633	0.265550	0.229581	0.725926	1.0
2	0.06	0.085856	0.270011	0.449541	0.050702	0.179426	0.028698	0.259259	0.0
3	0.07	0.357997	0.285486	0.577982	0.259642	0.483254	0.154525	0.866667	1.0
4	0.08	0.107320	0.286553	0.513761	0.064879	0.284689	0.037528	0.237037	0.0

[175]:

plot_df = df_norm.melt(id_vars=['Class'],value_vars =list(set(df_norm.columns)-set(['Class','SampleID'])) )
sns.barplot(data=plot_df,y="variable",x="value",hue="Class")

[175]:

<AxesSubplot:xlabel='value', ylabel='variable'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_66_1.png

boxplot¶

[84]:

sns.boxplot(data=plot_df,x="variable",y="value",hue="Class")

[84]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_68_1.png

Violinplot¶

[86]:

sns.violinplot(data=plot_df,x="variable",y="value",hue="Class")

[86]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_70_1.png

beeswarm plot¶

[89]:

sns.swarmplot(data=plot_df,x="variable",y="value",hue="Class")

[89]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_72_1.png

[100]:

sns.violinplot(data=plot_df,x="variable",y="value",hue="Class",inner=None)
sns.swarmplot(data=plot_df,x="variable",y="value",hue="Class",dodge=True,color="black",s=3)

[100]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_73_1.png

[101]:

sns.boxplot(data=plot_df,x="variable",y="value",hue="Class")
sns.swarmplot(data=plot_df,x="variable",y="value",hue="Class",dodge=True,color="black",s=3)

[101]:

<AxesSubplot:xlabel='variable', ylabel='value'>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_74_1.png

[102]:

df.head()

[102]:

	SampleID	Peptide_27	Fasting_plasma_glucose_(mmol/l)	HbA1c	Fasting_plasma_insulin_(pmol/l)	C-Peptide_(nmol/l)	HOMA-IR	Free_fatty_acids_(mmol/l)	Class
0	3	10.42	4.73	6.5	121.4	1.63	3.7	1.03	1
1	5	102.86	7.16	7.6	226.0	1.11	10.4	0.98	1
2	6	9.84	5.06	4.9	39.7	0.75	1.3	0.35	0
3	7	41.03	5.35	6.3	203.3	2.02	7.0	1.17	1
4	8	12.30	5.37	5.6	50.8	1.19	1.7	0.32	0

Exercise¶

Can we still use the same HbA1c and Peptide27 cutoff for the second data?¶

[160]:

df2.columns

[160]:

Index(['Sample ID', 'Peptide 1', 'Peptide 2', 'Peptide 3', 'Peptide 4',
       'Peptide 5', 'Peptide 6', 'Peptide 7', 'Peptide 8', 'Peptide 9',
       'Peptide 10', 'Peptide 11', 'Peptide 12', 'Peptide 13', 'Peptide 14',
       'Peptide 15', 'Peptide 16', 'Peptide 17', 'Peptide 18', 'Peptide 21',
       'Peptide 22', 'Peptide 23', 'Peptide 24', 'Peptide 25', 'Peptide 26',
       'Peptide 27', 'Peptide 29', 'Peptide 30', 'Age', 'Diagnosis', 'BMI',
       'HbA1c (%)', 'Gender', 'Height', 'Body weight', 'BMI.1', 'Body fat',
       'Fat free mass', 'Waist', 'Hip', 'WHR', 'Hemoglobin', 'Erythrozyten',
       'Thrombozyten', 'Leukocytes', 'ALAT', 'ASAT', 'gGT',
       'Fasting plasma glucose (mmol/l)', 'Fasting plasma insulin (pmol/l)',
       'C-Peptide (nmol/l)', 'Proinsulin (pmol/l)', 'Creatinin',
       'Triglycerides (mmol/l)', 'Cholesterol total (mmol/l)',
       'HDL-Cholesterol (mmol/l)', 'LDL-Cholesterol (mmol/l)',
       'Proteins total (g/l)', 'Albumin (g/l)', 'TSH (mU/l)', 'fT3 (pmol/l)',
       'fT4 (pmol/l)', 'Cortisol (nmol/l)', 'Testosteron (nmol/l)', 'HOMA-IR',
       'Free fatty acids (mmol/l)', 'RRsys (mmHg)', 'RR dia (mmHg)',
       'ssCRP (mg/dl)'],
      dtype='object')

[161]:

sns.scatterplot(data = df2,x="HbA1c (%)",y="Peptide 27",hue="Diagnosis")
plt.axhline(40,color="black")
plt.axvline(6,color="grey")

[161]:

<matplotlib.lines.Line2D at 0x2aab879ffd90>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_79_1.png

what is the data distribution for each feature?¶

[179]:

df2.head()

[179]:

	Sample ID	Peptide 1	Peptide 2	Peptide 3	Peptide 4	Peptide 5	Peptide 6	Peptide 7	Peptide 8	Peptide 9	...	TSH (mU/l)	fT3 (pmol/l)	fT4 (pmol/l)	Cortisol (nmol/l)	Testosteron (nmol/l)	HOMA-IR	Free fatty acids (mmol/l)	RRsys (mmHg)	RR dia (mmHg)	ssCRP (mg/dl)
0	sample 2	33.58	7.18	9.35	3.57	94.44	14.91	153.05	35.52	9.76	...	0.72	5.23	21.5	NaN	NaN	7.2	0.30	NaN	NaN	0.48
1	sample 3	37.57	8.70	10.79	3.36	94.11	15.99	198.88	39.65	8.62	...	0.97	NaN	NaN	NaN	NaN	9.3	1.04	NaN	NaN	7.15
2	sample 4	27.31	5.42	5.64	2.75	67.01	11.91	148.32	28.90	5.64	...	1.02	5.01	17.6	NaN	NaN	9.1	0.37	NaN	NaN	NaN
3	sample 5	29.09	5.81	4.69	3.61	65.99	12.42	154.55	26.48	6.38	...	1.00	4.61	15.7	NaN	NaN	24.5	0.99	168.0	95.0	0.20
4	sample 6	41.13	8.40	8.85	3.56	109.13	17.60	209.62	44.35	10.43	...	1.30	5.43	19.2	NaN	NaN	11.1	0.54	165.0	84.0	0.30

5 rows × 69 columns

[200]:

for c in df2.columns:
    try:
        plt.figure(figsize=(4,2))
        tmp = df2.melt(id_vars=['Diagnosis'],value_vars=c)
        sns.boxplot(data=tmp,y="variable",x="value",hue="Diagnosis")
    except:
        continue

<ipython-input-200-795c994e6a81>:3: RuntimeWarning:

More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

<Figure size 288x144 with 0 Axes>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_2.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_3.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_4.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_5.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_6.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_7.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_8.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_9.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_10.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_11.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_12.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_13.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_14.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_15.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_16.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_17.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_18.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_19.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_20.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_21.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_22.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_23.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_24.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_25.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_26.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_27.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_28.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_29.png

<Figure size 288x144 with 0 Axes>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_31.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_32.png

<Figure size 288x144 with 0 Axes>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_34.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_35.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_36.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_37.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_38.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_39.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_40.png

<Figure size 288x144 with 0 Axes>

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_42.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_43.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_44.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_45.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_46.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_47.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_48.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_49.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_50.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_51.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_52.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_53.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_54.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_55.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_56.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_57.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_58.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_59.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_60.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_61.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_62.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_63.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_64.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_65.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_66.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_67.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_68.png

../../_images/content_Bioinformatics_Core_Competencies_Introduction_6_21_2021_v2_82_69.png

[ ]: