Exploratory Data Analysis (EDA) is a critical phase in the journey of understanding and extracting insights from any dataset. In this context, we embark on an exploratory journey into the fascinating realm of fish market data. The dataset at our disposal encapsulates a wealth of information about various fish species, offering a unique opportunity to uncover patterns, trends, and relationships that may lie beneath the surface.
As we delve into the intricacies of this dataset, our goal is to gain a comprehensive understanding of the different attributes associated with each fish entry. From species variations to size measurements, weight, and potentially more, EDA empowers us to visually and statistically dissect the data. Through this process, we aim to uncover valuable insights that can inform decision-making, facilitate predictive modeling, or simply enhance our knowledge of the characteristics that distinguish one fish species from another.
The journey begins with loading the data, exploring its structure, and gradually peeling back layers to reveal the nuances hidden within. Utilizing various statistical measures, visualizations, and hypothesis testing, EDA not only helps us comprehend the nuances of the fish market data but also guides us in making informed decisions and formulating meaningful questions for further analysis.
Import Python Libraries and Data
Let’s start by importing the necessary Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load the Dataset
df = pd.read_csv('fish.csv')
Explore the Data
# Display the first few rows
df.head()
# Get information about the dataset
df.info()
The dataset encompasses 7 attributes detailing information about 159 fishes available in the market. The column descriptions are outlined below:
- Species: species name of fish
- Weight: weight of fish in g
- Length1: vertical length in cm
- Length2: diagonal length in cm
- Length3: cross length in cm
- Height: height in cm
- Width: diagonal width in cm
It’s worth noting that there are no missing values in the dataset, and among the columns, only Species
is categorical, while the rest are all numerical.
# Summary statistics
df.describe()
Upon a quick review of the summary statistics for numerical variables, it’s evident that there are no apparent inconsistencies or nonsensical values within the permissible ranges for each feature.
Categorical Features: Species
Now, let’s conduct a comprehensive examination of each variable in the dataset. Our primary attention will be directed towards the categorical features, with Species
standing as the sole variable in this category.
# Distribution of Species
species_distribution = df['Species'].value_counts().to_frame()
species_distribution = species_distribution.rename(columns={'Species': 'Number of Observations'})
species_distribution
# Plot a pie chart
plt.pie(x=species_distribution['Number of Observations'], labels=species_distribution.index, autopct='%.0f%%')
Out of the seven distinct fish species, Perch is the most prevalent, representing over one-third of the dataset, followed by Bream at 22%, and Roach at 13%. Conversely, Whitefish is the least encountered fish in the dataset, comprising only 4%.
Numerical Features: Weight, Height, Width, Length1, Length2, and Length3
For the numerical features, we create box plots for each fish type to visually represent the distribution of each metric.
num_cols = df.drop(columns='Species').columns
n_row = 2
n_col = 3
fig, axs = plt.subplots(n_row, n_col, figsize=(20, 10))
for i in range(n_row):
for j in range(n_col):
if i == 0:
idx = i+j
else:
idx = i+j+2
sns.boxplot(ax=axs[i, j], data=df, x='Species', y=num_cols[idx])
Analyzing the plots reveals that Smelt exhibits the least variability in terms of the range of values across all metrics. Roach and Parkki also demonstrate relatively small variations and values, following a similar pattern to Smelt. Conversely, Pike and Perch exhibit the widest variability across all metrics. Perch appears to have the smallest minimums, while Pike boasts the highest maximum values. Bream and Whitefish exhibit similar distributions.
Correlation
Having examined the features individually, we now explore their interrelationships. We assess the correlation between features by examining scatter plots for each fish type, as well as a correlation heatmap. These visualizations provide insights into the linear associations between pairs of variables.
# Plot pairwise relationships in a dataset.
sns.pairplot(df, hue='Species')
Strong relationships are evident among the numerical features, particularly showcasing positive logarithmic (or exponential) associations among all features for all fish types. Due to the strong correlation between the variables, multicollinearity is present.
In the context of multiple regression, multicollinearity indicates that the predictors are not providing independent information. The potential problems with multicollinearity can be
- Unreliable coefficient estimates: High multicollinearity inflates the standard errors of the coefficients, leading to less precise estimates. This makes it difficult to determine the true effect of each predictor variable.
- Unstable coefficients: Small changes in the data can cause large changes in the coefficient estimates, leading to a lack of confidence in the results.
- Difficulty in determining the importance of variables: Multicollinearity makes it hard to discern which variables are actually influencing the dependent variable since they are correlated with each other.
- Increased variance of estimates: When variables are highly correlated, the variance of the coefficient estimates increases, which can lead to wider confidence intervals and less reliable hypothesis tests.
- Model overfitting: In some cases, multicollinearity can cause the model to overfit the data, capturing the noise rather than the true underlying pattern.
There are several ways to address multicollinearity.
- Remove highly correlated predictors: If two or more predictors are highly correlated, consider removing one of them from the model, especially if it contributes little unique information.
- Combine predictors: If predictors are highly correlated, you can combine them into a single predictor, for example, by using their average or creating a composite index.
- Principal Component Analysis (PCA): PCA can be used to transform the correlated predictors into a set of uncorrelated components, which can then be used as predictors in the regression model.
- Ridge regression: Ridge regression adds a penalty to the size of coefficients to reduce multicollinearity. This shrinks the coefficients and helps to mitigate the effects of multicollinearity.
- Increase sample size: In some cases, increasing the sample size can reduce the standard errors of the coefficients, thereby mitigating the effects of multicollinearity.
- Centering the variables: Subtracting the mean from each predictor can sometimes help reduce multicollinearity.
Addressing multicollinearity depends on the context. In our upcoming post, we’ll explore how to tackle this issue when building a regression model.
fish_types = df['Species'].unique()
n_row = 2
n_col = 4
fig, axs = plt.subplots(n_row, n_col, figsize=(30, 10))
for i in range(n_row):
for j in range(n_col):
if i == 0:
idx = i+j
else:
idx = i+j+3
if idx != 7:
sns.heatmap(ax=axs[i, j], data=df[df['Species']==fish_types[idx]].corr(), annot=True, fmt='.1f')
axs[i, j].set_title('Correlation heatmap for {}'.format(fish_types[idx]))
The robust connections between the features are further affirmed by examining the correlation heatmaps. The linear relationships between the metrics for all fish types surpass a correlation coefficient of 0.9.
In conclusion, the Exploratory Data Analysis (EDA) conducted on the fish market dataset has provided valuable insights into the characteristics of various fish species. Notably, the absence of missing values and the absence of potential outliers has streamlined the analysis process, requiring minimal data-handling interventions. The dataset’s numerical features exhibit strong associations, particularly showcasing logarithmic relationships between weight and other metrics across all fish types. The correlation heatmaps further confirm these robust connections, with linear relationships consistently exceeding a coefficient of 0.9. This comprehensive exploration sets a solid foundation for subsequent analyses, ensuring a thorough understanding of the dataset’s structure and paving the way for informed decision-making in future investigations.