To drive the point home, lets straightway get started with the below hypothetical dataset of smoker data across three Indian cities:
City Smoker | |
kolkata y | |
delhi y | |
mumbai n | |
kolkata y | |
delhi n | |
mumbai y | |
kolkata n | |
delhi y | |
mumbai y | |
kolkata y | |
delhi n | |
mumbai n | |
kolkata n | |
delhi n | |
mumbai y | |
kolkata n | |
delhi n | |
mumbai n | |
kolkata y | |
delhi y | |
mumbai y | |
kolkata y | |
delhi y | |
mumbai n | |
mumbai y | |
delhi n | |
mumbai n | |
delhi y | |
delhi n | |
mumbai n | |
kolkata y | |
mumbai y | |
mumbai n | |
mumbai y |
First, let’s convert it to a contingency table:
import pandas as pd | |
city_smoker_df=pd.read_excel('city_smoker_data.xlsx', index_col=0) | |
city_smoker_df=city_smoker_df.reset_index() | |
from sklearn.preprocessing import OneHotEncoder | |
import numpy as np | |
enc = OneHotEncoder(handle_unknown='ignore') | |
csd_ohe=enc.fit_transform(city_smoker_df['Smoker'].values.reshape(-1,1)) | |
ohe_col_val=np.concatenate((city_smoker_df.loc[:,'City'].values.reshape(-1,1),csd_ohe.toarray()),axis=1) | |
ohe_col_names1=['City', 'non-smoker', 'smoker'] | |
ohe_col_val_df=pd.DataFrame(ohe_col_val,columns=ohe_col_names1) | |
cont_table=ohe_col_val_df.groupby(['City']).sum() | |
cont_table['non-smoker']=cont_table['non-smoker'].astype(int) | |
cont_table['smoker']=cont_table['smoker'].astype(int) | |
cont_table['total']=cont_table['non-smoker']+cont_table['smoker'] | |
cont_table=cont_table.reset_index() | |
cont_table.loc[len(cont_table.index)]=['total',cont_table['non-smoker'].sum(),cont_table['smoker'].sum(),cont_table['total'].sum()] | |
cont_table |
City | non-smoker | smoker | total |
delhi | 6 | 5 | 11 |
kolkata | 3 | 6 | 9 |
mumbai | 7 | 7 | 14 |
total | 16 | 18 | 34 |
Now, Joint probability of delhi AND non-smoker = P(delhi ∩non-smoker) = 6/34= 0.18
Similarly, for all the other combinations joint probabilities can be calculated as:
joint_proba=cont_table | |
joint_proba['non-smoker']=joint_proba['non-smoker']/joint_proba.iloc[len(joint_proba)-1][len(joint_proba)-1] | |
joint_proba['non-smoker']=joint_proba['non-smoker'].round(2) | |
joint_proba['smoker']=joint_proba['smoker']/joint_proba.iloc[len(joint_proba)-1][len(joint_proba)-1] | |
joint_proba['smoker']=joint_proba['smoker'].round(2) | |
joint_proba['total']=joint_proba['total']/joint_proba.iloc[len(joint_proba)-1][len(joint_proba)-1] | |
joint_proba['total']=joint_proba['total'].round(2) | |
joint_proba |
City | non-smoker | smoker | total |
delhi | 0.18 | 0.15 | 0.32 |
kolkata | 0.09 | 0.18 | 0.26 |
mumbai | 0.21 | 0.21 | 0.41 |
total | 0.47 | 0.53 | 1.0 |
Marginal probabilities are the probabilities lies in the margin of the above table. and the meaning is , the marginal probability of person randomly selected will be from delhi is 0.32 .
Conditional probability that a randomly selected non-smoker person is from delhi =
P(delhi | non-smoker) = 6/16=0.38
Similarly, for the other combinations the conditional probabilities could be calculated as:
cont_table_con_proba=cont_table.iloc[:len(cont_table.index)-1,:len(cont_table.index)-1] | |
cont_table_con_proba['non-smoker']=cont_table_con_proba['non-smoker']/cont_table.iloc[len(cont_table_proba)-1][1] | |
cont_table_con_proba['non-smoker']=cont_table_con_proba['non-smoker'].round(2) | |
cont_table_con_proba['smoker']=cont_table_con_proba['smoker']/cont_table.iloc[len(cont_table_proba)-1][2] | |
cont_table_con_proba['smoker']=cont_table_con_proba['smoker'].round(2) | |
cont_table_con_proba |
index | City | non-smoker | smoker |
0 | delhi | 0.38 | 0.28 |
1 | kolkata | 0.19 | 0.34 |
2 | mumbai | 0.45 | 0.4 |
The concept of conditional /marginal / joint probability is important to test dependency of the variables, how? lets keep it for some other day.

Hi, I am Shibashis, a blogger by passion and an engineer by profession. I have written most of the articles for mechGuru.com. For more than a decades i am closely associated with the engineering design/manufacturing simulation technologies. I am a self taught code hobbyist, presently in love with Python (Open CV / ML / Data Science /AWS -3000+ lines, 400+ hrs. )