Preprocesamiento de Datos 🧹

실제 데이터는 깨끗하지 않습니다. 데이터 전처리는 분석 가능한 형태로 데이터를 정제하는 필수 과정입니다.

¿Qué es el Preprocesamiento de Datos?

데이터 전처리는 원시 데이터를 분석이나 머신러닝에 적합한 형태로 변환하는 과정입니다.

Tareas Principales

결측치 처리: 누락된 데이터 다루기
중복 제거: 중복된 레코드 제거
데이터 타입 변환: 적절한 타입으로 변경
이상치 처리: 비정상적인 값 처리
데이터 통합: 여러 데이터 소스 결합

Información

데이터 과학자들은 업무 시간의 80%를 데이터 전처리에 사용한다고 합니다!

Manejo de Valores Faltantes

Verificar Valores Faltantes

import pandas as pd
import numpy as np

# 샘플 데이터 (결측치 포함)
data = {
    'name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'age': [25, None, 35, 28, 32],
    'salary': [50000, 60000, None, 55000, None],
    'department': ['IT', 'HR', 'IT', None, 'HR']
}
df = pd.DataFrame(data)

print("=== 원본 데이터 ===")
print(df)

# Verificar Valores Faltantes
print("\n=== 결측치 개수 ===")
print(df.isnull().sum())

# 결측치 비율
print("\n=== 결측치 비율 ===")
print((df.isnull().sum() / len(df) * 100).round(2))

# 결측치가 있는 행
print("\n=== 결측치 포함 행 ===")
print(df[df.isnull().any(axis=1)])

Eliminar Valores Faltantes

# 결측치가 있는 행 전체 제거
df_dropped_rows = df.dropna()
print("행 제거 후:", df_dropped_rows.shape)

# 결측치가 있는 열 제거
df_dropped_cols = df.dropna(axis=1)
print("열 제거 후:", df_dropped_cols.shape)

# 특정 열의 결측치만 제거
df_dropped_subset = df.dropna(subset=['name', 'age'])
print("name, age 결측치 제거:", df_dropped_subset.shape)

# 모든 값이 결측치인 행만 제거
df_dropped_all = df.dropna(how='all')

# 최소 N개 이상의 값이 있는 행만 유지
df_dropped_thresh = df.dropna(thresh=3)  # 최소 3개 값

Rellenar Valores Faltantes

# 특정 값으로 채우기
df_filled = df.fillna(0)
df_filled = df.fillna('Unknown')

# 열별로 다른 값 채우기
df_filled = df.fillna({
    'age': df['age'].mean(),  # 평균
    'salary': df['salary'].median(),  # 중앙값
    'department': 'Unknown'
})

# 앞/뒤 값으로 채우기 (시계열 데이터)
df_ffill = df.fillna(method='ffill')  # 앞 값으로
df_bfill = df.fillna(method='bfill')  # 뒤 값으로

# 보간법
df['age'] = df['age'].interpolate()  # 선형 보간

고급 결측치 처리

import pandas as pd
import numpy as np

# 샘플 데이터
data = {
    'date': pd.date_range('2024-01-01', periods=10),
    'temperature': [22, 23, None, 25, 26, None, 28, 29, None, 31],
    'humidity': [60, 62, 65, None, 68, 70, None, 75, 77, 80]
}
df = pd.DataFrame(data)

# 그룹별 평균으로 채우기
df['category'] = ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
df['temperature'] = df.groupby('category')['temperature'].transform(
    lambda x: x.fillna(x.mean())
)

# 시계열 보간
df['temperature'] = df['temperature'].interpolate(method='time')

# KNN 기반 채우기 (scikit-learn 사용)
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
df[['temperature', 'humidity']] = imputer.fit_transform(
    df[['temperature', 'humidity']]
)

Procesamiento de Datos Duplicados

중복 확인

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'age': [25, 30, 25, 35, 30],
    'city': ['Seoul', 'Busan', 'Seoul', 'Incheon', 'Busan']
}
df = pd.DataFrame(data)

print("=== 원본 데이터 ===")
print(df)

# 중복 확인
print("\n=== 중복 여부 ===")
print(df.duplicated())

# 중복 개수
print(f"\n중복 행 개수: {df.duplicated().sum()}")

# 중복된 행 보기
print("\n=== 중복된 행 ===")
print(df[df.duplicated(keep=False)])

Eliminar Duplicados

# Eliminar Duplicados (첫 번째 유지)
df_unique = df.drop_duplicates()
print("중복 제거 후:", len(df_unique))

# 마지막 유지
df_unique_last = df.drop_duplicates(keep='last')

# 모두 제거
df_no_duplicates = df.drop_duplicates(keep=False)

# 특정 열 기준 중복 제거
df_unique_name = df.drop_duplicates(subset=['name'])
df_unique_multi = df.drop_duplicates(subset=['name', 'age'])

# 원본 수정
df.drop_duplicates(inplace=True)

Conversión de Tipos de Datos

기본 타입 변환

import pandas as pd

data = {
    'id': ['1', '2', '3', '4', '5'],
    'price': ['1000', '1500', '2000', '2500', '3000'],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
    'is_active': ['True', 'False', 'True', 'True', 'False']
}
df = pd.DataFrame(data)

print("=== 원본 타입 ===")
print(df.dtypes)

# 숫자로 변환
df['id'] = df['id'].astype(int)
df['price'] = df['price'].astype(float)

# 날짜로 변환
df['date'] = pd.to_datetime(df['date'])

# 불리언으로 변환
df['is_active'] = df['is_active'].map({'True': True, 'False': False})

print("\n=== 변환 후 타입 ===")
print(df.dtypes)

안전한 변환

# 에러 무시하고 변환
df['price'] = pd.to_numeric(df['price'], errors='coerce')  # 실패 시 NaN

# 날짜 변환 (에러 처리)
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# 카테고리 타입 (메모리 절약)
df['category'] = df['category'].astype('category')

복잡한 변환

# 문자열에서 숫자 추출
df['price_str'] = ['$1,000', '$1,500', '$2,000']
df['price'] = df['price_str'].str.replace('[$,]', '', regex=True).astype(float)

# 날짜 형식 지정
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')

# 여러 형식 시도
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

Procesamiento de Cadenas

기본 문자열 작업

import pandas as pd

data = {
    'name': ['  Alice  ', 'BOB', 'charlie', 'David'],
    'email': ['alice@example.com', 'BOB@EXAMPLE.COM', 'charlie@test.com', 'david@test.com']
}
df = pd.DataFrame(data)

# 공백 제거
df['name'] = df['name'].str.strip()

# 대소문자 변환
df['name_upper'] = df['name'].str.upper()
df['name_lower'] = df['name'].str.lower()
df['name_title'] = df['name'].str.title()

# 이메일 도메인 추출
df['domain'] = df['email'].str.split('@').str[1]

# 특정 문자 포함 여부
df['has_test'] = df['email'].str.contains('test')

# 문자열 길이
df['name_length'] = df['name'].str.len()

print(df)

정규표현식 활용

import pandas as pd

data = {
    'phone': ['010-1234-5678', '02-987-6543', '031-555-1234', '010.9999.8888'],
    'text': ['가격: 10,000원', '할인가 5,000원', '정상가 20000원', '특가 15,000']
}
df = pd.DataFrame(data)

# 전화번호 형식 통일
df['phone_clean'] = df['phone'].str.replace('[.-]', '', regex=True)

# 숫자만 추출
df['price'] = df['text'].str.extract(r'(\d+,?\d+)')[0]
df['price'] = df['price'].str.replace(',', '').astype(int)

# 특정 패턴 찾기
df['has_discount'] = df['text'].str.contains('할인|특가')

print(df)

텍스트 분할 및 결합

# 문자열 분할
df = pd.DataFrame({'full_name': ['John Doe', 'Jane Smith', 'Bob Johnson']})
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

# 문자열 결합
df['full_name_reversed'] = df['last_name'] + ', ' + df['first_name']

# 여러 열 결합
df['address'] = df['city'] + ', ' + df['state'] + ' ' + df['zip']

Fusión de Datos

merge (SQL JOIN과 유사)

import pandas as pd

# 직원 정보
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'dept_id': [10, 20, 10, 30]
})

# 부서 정보
departments = pd.DataFrame({
    'dept_id': [10, 20, 30, 40],
    'dept_name': ['IT', 'HR', 'Sales', 'Marketing']
})

# Inner Join (교집합)
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print("=== Inner Join ===")
print(inner)

# Left Join (왼쪽 전체)
left = pd.merge(employees, departments, on='dept_id', how='left')
print("\n=== Left Join ===")
print(left)

# Right Join (오른쪽 전체)
right = pd.merge(employees, departments, on='dept_id', how='right')

# Outer Join (합집합)
outer = pd.merge(employees, departments, on='dept_id', how='outer')

# 다른 열 이름으로 조인
df1 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
df2 = pd.DataFrame({'key': [1, 2], 'data': [100, 200]})
merged = pd.merge(df1, df2, left_on='id', right_on='key')

concat (단순 결합)

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# 세로 결합 (행 추가)
vertical = pd.concat([df1, df2], ignore_index=True)
print("=== 세로 결합 ===")
print(vertical)

# 가로 결합 (열 추가)
df3 = pd.DataFrame({'C': [9, 10], 'D': [11, 12]})
horizontal = pd.concat([df1, df3], axis=1)
print("\n=== 가로 결합 ===")
print(horizontal)

join (인덱스 기반)

df1 = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'B': [4, 5, 6]}, index=['a', 'b', 'd'])

# 인덱스 기준 조인
joined = df1.join(df2, how='outer')
print("=== Join ===")
print(joined)

Agrupación (GroupBy)

기본 그룹화

import pandas as pd
import numpy as np

# 판매 데이터
data = {
    'date': pd.date_range('2024-01-01', periods=20),
    'product': np.random.choice(['A', 'B', 'C'], 20),
    'region': np.random.choice(['서울', '부산', '대구'], 20),
    'sales': np.random.randint(100, 1000, 20)
}
df = pd.DataFrame(data)

# 제품별 총 매출
product_sales = df.groupby('product')['sales'].sum()
print("=== 제품별 매출 ===")
print(product_sales)

# 여러 집계 함수
product_stats = df.groupby('product')['sales'].agg(['sum', 'mean', 'count'])
print("\n=== 제품별 통계 ===")
print(product_stats)

# 여러 열로 그룹화
region_product = df.groupby(['region', 'product'])['sales'].sum()
print("\n=== 지역별 제품 매출 ===")
print(region_product)

고급 그룹화

# 여러 열에 다른 함수 적용
agg_dict = {
    'sales': ['sum', 'mean', 'max'],
    'product': 'count'
}
result = df.groupby('region').agg(agg_dict)
print(result)

# 커스텀 함수
def range_func(x):
    return x.max() - x.min()

custom_agg = df.groupby('product')['sales'].agg([
    ('합계', 'sum'),
    ('평균', 'mean'),
    ('범위', range_func)
])
print(custom_agg)

# 그룹별 변환
df['sales_mean_by_product'] = df.groupby('product')['sales'].transform('mean')
df['sales_vs_avg'] = df['sales'] - df['sales_mean_by_product']

# 그룹별 랭킹
df['rank_in_product'] = df.groupby('product')['sales'].rank(ascending=False)

피벗 테이블

# 피벗 테이블 생성
pivot = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print("=== 피벗 테이블 ===")
print(pivot)

# 여러 집계
pivot_multi = pd.pivot_table(
    df,
    values='sales',
    index='region',
    columns='product',
    aggfunc=['sum', 'mean'],
    fill_value=0
)

Ejemplos Prácticos

예제 1: 고객 데이터 정제

import pandas as pd
import numpy as np

# 지저분한 고객 데이터
data = {
    'customer_id': [1, 2, 3, 4, 5, 5, 6, None, 8],
    'name': ['  Alice  ', 'BOB', 'Charlie', None, 'Eve', 'Eve', 'Frank', 'Grace', 'Henry'],
    'email': ['alice@test.com', 'bob@TEST.COM', 'charlie@test', 'dave@test.com',
              'eve@test.com', 'eve@test.com', None, 'grace@test.com', 'henry@test.com'],
    'age': ['25', '30', 'thirty', '28', '32', '32', '27', '29', None],
    'purchase_amount': ['1000', '1,500', '2000', '500', '3000', '3000', '1200', '800', '1500'],
    'join_date': ['2024-01-01', '2024/01/15', '2024.02.01', None, '2024-03-01',
                  '2024-03-01', '2024-03-15', '2024-04-01', '2024-04-15']
}
df = pd.DataFrame(data)

print("=== 원본 데이터 (문제 많음) ===")
print(df)
print(f"\n데이터 크기: {df.shape}")

# Step 1: 중복 제거
print("\n[1단계] 중복 제거")
df = df.drop_duplicates(subset=['customer_id', 'email'], keep='first')
print(f"중복 제거 후: {df.shape}")

# Step 2: 결측치 처리
print("\n[2단계] 결측치 처리")
print(f"결측치 개수:\n{df.isnull().sum()}")

# customer_id 결측치 제거 (필수 항목)
df = df.dropna(subset=['customer_id'])

# name 결측치 채우기
df['name'] = df['name'].fillna('Unknown')

# email 결측치 처리
df['email'] = df['email'].fillna('no-email@unknown.com')

# Step 3: 데이터 타입 변환
print("\n[3단계] 데이터 타입 변환")

# customer_id를 정수로
df['customer_id'] = df['customer_id'].astype(int)

# age: 숫자가 아닌 값 처리
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['age'] = df['age'].fillna(df['age'].median())

# purchase_amount: 쉼표 제거 후 숫자로
df['purchase_amount'] = df['purchase_amount'].str.replace(',', '').astype(float)

# join_date: 다양한 형식 통일
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
df['join_date'] = df['join_date'].fillna(pd.Timestamp('2024-01-01'))

# Step 4: 문자열 정제
print("\n[4단계] 문자열 정제")

# name: 공백 제거, 첫글자 대문자
df['name'] = df['name'].str.strip().str.title()

# email: 소문자로 통일
df['email'] = df['email'].str.lower()

# Step 5: 이메일 유효성 검사
print("\n[5단계] 이메일 유효성 검사")
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
df['email_valid'] = df['email'].str.match(email_pattern)
print(f"유효하지 않은 이메일: {(~df['email_valid']).sum()}개")

# Step 6: 파생 변수 생성
print("\n[6단계] 파생 변수 생성")
df['customer_value'] = pd.cut(
    df['purchase_amount'],
    bins=[0, 1000, 2000, float('inf')],
    labels=['Low', 'Medium', 'High']
)

df['days_since_join'] = (pd.Timestamp.now() - df['join_date']).dt.days

print("\n=== 정제된 데이터 ===")
print(df)
print(f"\n최종 데이터 크기: {df.shape}")
print(f"\n데이터 타입:\n{df.dtypes}")

# 정제 요약
print("\n=== 정제 요약 ===")
print(f"- 중복 제거: {len(data['customer_id']) - len(df)}건")
print(f"- 결측치 처리 완료")
print(f"- 데이터 타입 통일")
print(f"- 문자열 정규화")
print(f"- 이메일 검증")

예제 2: 판매 데이터 통합 및 분석

import pandas as pd
import numpy as np

# 온라인 판매 데이터
online_sales = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004],
    'customer_id': [101, 102, 101, 103],
    'product': ['노트북', '마우스', '키보드', '모니터'],
    'amount': [1500000, 30000, 80000, 400000],
    'date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18'],
    'channel': ['온라인'] * 4
})

# 오프라인 판매 데이터
offline_sales = pd.DataFrame({
    'order_id': [2001, 2002, 2003],
    'customer_id': [101, 104, 102],
    'product': ['마우스', '노트북', '키보드'],
    'amount': [35000, 1600000, 75000],
    'date': ['2024-01-16', '2024-01-17', '2024-01-18'],
    'channel': ['오프라인'] * 3
})

# 고객 정보
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['김철수', '이영희', '박민수', '정지은', '최호진'],
    'grade': ['VIP', '일반', '일반', 'VIP', '일반'],
    'region': ['서울', '부산', '대구', '서울', '인천']
})

print("=== 1. 데이터 통합 ===")
# 온/오프라인 데이터 통합
all_sales = pd.concat([online_sales, offline_sales], ignore_index=True)
print(f"전체 판매 건수: {len(all_sales)}")

# 날짜 타입 변환
all_sales['date'] = pd.to_datetime(all_sales['date'])

# 고객 정보와 병합
sales_with_customer = pd.merge(
    all_sales,
    customers,
    on='customer_id',
    how='left'
)

print("\n=== 2. 통합 데이터 ===")
print(sales_with_customer)

print("\n=== 3. 채널별 분석 ===")
channel_stats = sales_with_customer.groupby('channel').agg({
    'amount': ['sum', 'mean', 'count'],
    'customer_id': 'nunique'
})
channel_stats.columns = ['총매출', '평균매출', '거래건수', '고객수']
print(channel_stats)

print("\n=== 4. 고객별 분석 ===")
customer_stats = sales_with_customer.groupby(['customer_id', 'name', 'grade']).agg({
    'amount': 'sum',
    'order_id': 'count'
})
customer_stats.columns = ['총구매액', '구매횟수']
customer_stats = customer_stats.sort_values('총구매액', ascending=False)
print(customer_stats)

print("\n=== 5. 제품별 분석 ===")
product_stats = sales_with_customer.groupby('product').agg({
    'amount': ['sum', 'count'],
    'customer_id': 'nunique'
})
product_stats.columns = ['총매출', '판매건수', '구매고객수']
product_stats = product_stats.sort_values('총매출', ascending=False)
print(product_stats)

print("\n=== 6. VIP vs 일반 고객 ===")
grade_comparison = sales_with_customer.groupby('grade').agg({
    'amount': ['sum', 'mean'],
    'order_id': 'count'
})
grade_comparison.columns = ['총매출', '평균구매액', '구매횟수']
print(grade_comparison)

print("\n=== 7. 지역별 분석 ===")
region_sales = sales_with_customer.groupby('region')['amount'].sum().sort_values(ascending=False)
print(region_sales)

# 피벗 테이블: 지역별 × 채널별
print("\n=== 8. 지역별 채널별 매출 ===")
pivot = pd.pivot_table(
    sales_with_customer,
    values='amount',
    index='region',
    columns='channel',
    aggfunc='sum',
    fill_value=0
)
print(pivot)

예제 3: 시계열 데이터 전처리

import pandas as pd
import numpy as np

# 불규칙한 시계열 데이터 (센서 데이터)
dates = pd.date_range('2024-01-01', periods=100, freq='H')
np.random.seed(42)

# 일부 데이터 누락 및 이상치 포함
data = {
    'timestamp': dates,
    'temperature': np.random.normal(20, 5, 100),
    'humidity': np.random.normal(60, 10, 100)
}
df = pd.DataFrame(data)

# 결측치와 이상치 추가
df.loc[10:15, 'temperature'] = np.nan
df.loc[30, 'temperature'] = 100  # 이상치
df.loc[50:52, 'humidity'] = np.nan
df.loc[70, 'humidity'] = -50  # 이상치

print("=== 원본 데이터 ===")
print(f"데이터 크기: {df.shape}")
print(f"결측치:\n{df.isnull().sum()}")

# Step 1: 타임스탬프를 인덱스로
df.set_index('timestamp', inplace=True)

# Step 2: 이상치 탐지 및 처리 (IQR 방법)
print("\n[이상치 처리]")
for col in ['temperature', 'humidity']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
    print(f"{col} 이상치: {outliers.sum()}개")

    # 이상치를 NaN으로 변경
    df.loc[outliers, col] = np.nan

# Step 3: 결측치 보간
print("\n[결측치 보간]")
df['temperature'] = df['temperature'].interpolate(method='time')
df['humidity'] = df['humidity'].interpolate(method='time')
print(f"보간 후 결측치: {df.isnull().sum().sum()}개")

# Step 4: 이동 평균으로 노이즈 제거
df['temp_smooth'] = df['temperature'].rolling(window=3).mean()
df['humid_smooth'] = df['humidity'].rolling(window=3).mean()

# Step 5: 시간대별 통계
df['hour'] = df.index.hour
hourly_stats = df.groupby('hour').agg({
    'temperature': ['mean', 'std'],
    'humidity': ['mean', 'std']
})

print("\n=== 시간대별 통계 ===")
print(hourly_stats.head(10))

# Step 6: 일별 집계
daily = df.resample('D').agg({
    'temperature': ['mean', 'min', 'max'],
    'humidity': ['mean', 'min', 'max']
})

print("\n=== 일별 집계 ===")
print(daily)

print("\n=== 전처리 완료 ===")
print(f"최종 데이터 크기: {df.shape}")
print(f"결측치: {df.isnull().sum().sum()}개")

Preprocesamiento de Datos 파이프라인

import pandas as pd

class DataCleaningPipeline:
    """데이터 전처리 파이프라인"""

    def __init__(self, df):
        self.df = df.copy()
        self.logs = []

    def log(self, message):
        """로그 기록"""
        self.logs.append(message)
        print(f"✓ {message}")

    def remove_duplicates(self, subset=None):
        """중복 제거"""
        before = len(self.df)
        self.df = self.df.drop_duplicates(subset=subset)
        removed = before - len(self.df)
        self.log(f"중복 {removed}건 제거")
        return self

    def handle_missing(self, strategy='drop', fill_value=None):
        """결측치 처리"""
        missing_before = self.df.isnull().sum().sum()

        if strategy == 'drop':
            self.df = self.df.dropna()
        elif strategy == 'fill':
            self.df = self.df.fillna(fill_value)
        elif strategy == 'mean':
            numeric_cols = self.df.select_dtypes(include='number').columns
            self.df[numeric_cols] = self.df[numeric_cols].fillna(
                self.df[numeric_cols].mean()
            )

        missing_after = self.df.isnull().sum().sum()
        self.log(f"결측치 {missing_before}개 → {missing_after}개")
        return self

    def convert_types(self, type_dict):
        """데이터 타입 변환"""
        for col, dtype in type_dict.items():
            if col in self.df.columns:
                self.df[col] = self.df[col].astype(dtype)
                self.log(f"{col} → {dtype} 변환")
        return self

    def clean_strings(self, columns):
        """문자열 정제"""
        for col in columns:
            if col in self.df.columns:
                self.df[col] = (self.df[col]
                    .str.strip()
                    .str.lower()
                )
                self.log(f"{col} 문자열 정제")
        return self

    def remove_outliers(self, columns, method='iqr'):
        """이상치 제거"""
        for col in columns:
            if method == 'iqr':
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower = Q1 - 1.5 * IQR
                upper = Q3 + 1.5 * IQR

                before = len(self.df)
                self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
                removed = before - len(self.df)
                self.log(f"{col} 이상치 {removed}건 제거")
        return self

    def get_result(self):
        """결과 반환"""
        print("\n=== 전처리 완료 ===")
        for log in self.logs:
            print(f"  {log}")
        return self.df

# 사용 예제
data = {
    'name': ['  Alice  ', 'BOB', 'Charlie', 'Alice', None],
    'age': [25, 30, 200, 25, 28],  # 200은 이상치
    'email': ['alice@test.com', 'BOB@TEST.COM', 'charlie@test.com', 'alice@test.com', None]
}
df = pd.DataFrame(data)

# 파이프라인 실행
cleaned_df = (DataCleaningPipeline(df)
    .remove_duplicates(subset=['name', 'email'])
    .handle_missing(strategy='drop')
    .clean_strings(['name', 'email'])
    .remove_outliers(['age'])
    .get_result()
)

print("\n정제된 데이터:")
print(cleaned_df)

Preguntas Frecuentes

결측치는 언제 제거하고 언제 채워야 하나요?

제거하는 경우:

결측치가 5% 미만일 때
무작위로 발생한 결측치
중요하지 않은 열

채우는 경우:

결측치가 많을 때 (20% 이상)
시계열 데이터
중요한 특성

merge와 join의 차이는?

# merge: 명시적으로 키 지정
pd.merge(df1, df2, on='key')

# join: 인덱스 기준 (더 간단)
df1.join(df2)

groupby가 느려요. 어떻게 해야 하나요?

# 느림
df.groupby('category').apply(complex_function)

# 빠름
df.groupby('category').agg({'column': 'sum'})

# 더 빠름 (NumPy 사용)
import numpy as np
for cat in df['category'].unique():
    subset = df[df['category'] == cat]
    result = np.sum(subset['column'])

Próximos Pasos

데이터 전처리를 익혔다면:

고급 분석: 통계 분석, 가설 검정
특징 공학: 머신러닝을 위한 변수 생성
데이터 시각화: 전처리 결과 시각화
자동화: 전처리 파이프라인 구축

¿Qué es el Preprocesamiento de Datos?​

Tareas Principales​

Manejo de Valores Faltantes​

Verificar Valores Faltantes​

Eliminar Valores Faltantes​

Rellenar Valores Faltantes​

고급 결측치 처리​

Procesamiento de Datos Duplicados​

중복 확인​

Eliminar Duplicados​

Conversión de Tipos de Datos​

기본 타입 변환​

안전한 변환​

복잡한 변환​

Procesamiento de Cadenas​

기본 문자열 작업​

정규표현식 활용​

텍스트 분할 및 결합​

Fusión de Datos​

merge (SQL JOIN과 유사)​

concat (단순 결합)​

join (인덱스 기반)​

Agrupación (GroupBy)​

기본 그룹화​

고급 그룹화​

피벗 테이블​

Ejemplos Prácticos​

예제 1: 고객 데이터 정제​

예제 2: 판매 데이터 통합 및 분석​

예제 3: 시계열 데이터 전처리​

Preprocesamiento de Datos 파이프라인​

Preguntas Frecuentes​

결측치는 언제 제거하고 언제 채워야 하나요?​

merge와 join의 차이는?​

groupby가 느려요. 어떻게 해야 하나요?​

Próximos Pasos​

Referencias​

¿Qué es el Preprocesamiento de Datos?

Tareas Principales

Manejo de Valores Faltantes

Verificar Valores Faltantes

Eliminar Valores Faltantes

Rellenar Valores Faltantes

고급 결측치 처리

Procesamiento de Datos Duplicados

중복 확인

Eliminar Duplicados

Conversión de Tipos de Datos

기본 타입 변환

안전한 변환

복잡한 변환

Procesamiento de Cadenas

기본 문자열 작업

정규표현식 활용

텍스트 분할 및 결합

Fusión de Datos

merge (SQL JOIN과 유사)

concat (단순 결합)

join (인덱스 기반)

Agrupación (GroupBy)

기본 그룹화

고급 그룹화

피벗 테이블

Ejemplos Prácticos

예제 1: 고객 데이터 정제

예제 2: 판매 데이터 통합 및 분석

예제 3: 시계열 데이터 전처리

Preprocesamiento de Datos 파이프라인

Preguntas Frecuentes

결측치는 언제 제거하고 언제 채워야 하나요?

merge와 join의 차이는?

groupby가 느려요. 어떻게 해야 하나요?

Próximos Pasos

Referencias