Data Cleaning and Preprocessing Techniques in Python

 Data Cleaning and Preprocessing Techniques 

Data cleaning and preprocessing are crucial steps in the data analysis and machine learning pipeline. Here are some common techniques  for data cleaning and preprocessing:

Handling Missing Values:

Dropping: Remove rows or columns with missing values using dropna() method.

Imputation: Fill missing values with a specific value (mean, median, mode) using fillna() method or tools like SimpleImputer from scikit-learn.

Python Code

 import pandas as pd

from sklearn.impute import SimpleImputer


# Dropping missing values

clean_data = data.dropna()


# Imputation

imputer = SimpleImputer(strategy='mean')

data_imputed = imputer.fit_transform(data)

Handling Outliers:

Detection: Identify outliers using statistical methods or visualization techniques.

Removal: Remove outliers or replace them with a boundary value.

Python Code

 # Example using z-score

from scipy import stats

z_scores = stats.zscore(data)

abs_z_scores = np.abs(z_scores)

filtered_entries = (abs_z_scores < 3).all(axis=1)

cleaned_data = data[filtered_entries]

Feature Scaling:

Normalization: Scale features to a range (e.g., between 0 and 1) using MinMaxScaler from scikit-learn.

Standardization: Scale features to have a mean of 0 and a standard deviation of 1 using StandardScaler from scikit-learn.

Python Code

 from sklearn.preprocessing import MinMaxScaler, StandardScaler


# Normalization

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)


# Standardization

scaler = StandardScaler()

standardized_data = scaler.fit_transform(data)

Encoding Categorical Variables:

Label Encoding: Convert categorical variables into numerical form.

One-Hot Encoding: Create binary columns for each category using get_dummies() from pandas or OneHotEncoder from scikit-learn.

Python Code

 # Label Encoding

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

encoded_labels = encoder.fit_transform(labels)


# One-Hot Encoding

encoded_data = pd.get_dummies(data)

Text Cleaning:

Tokenization: Split text into words or tokens.

Removing Stopwords: Eliminate common words like 'is', 'and', 'the', etc.

Stemming/Lemmatization: Reduce words to their root form.

Python Code

 import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer, WordNetLemmatizer


nltk.download('stopwords')

nltk.download('punkt')

nltk.download('wordnet')


stop_words = set(stopwords.words('english'))

stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()


def clean_text(text):

    tokens = word_tokenize(text)

    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    return ' '.join(lemmatized_tokens)

These are just some of the basic techniques for data cleaning and preprocessing . Depending on the dataset and the specific problem you're working on, you may need to employ additional techniques.

 


Post a Comment

Previous Post Next Post