Skip to main content

Data Preprocessing ๐Ÿงน

Real data is not clean. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋Š” ๋ถ„์„ ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๋Š” ํ•„์ˆ˜ ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

What is Data Preprocessing?โ€‹

Data preprocessing is the process of transforming raw data into a form suitable for analysis or machine learning.

Main Tasksโ€‹

  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ
  • ์ค‘๋ณต ์ œ๊ฑฐ: ์ค‘๋ณต๋œ ๋ ˆ์ฝ”๋“œ ์ œ๊ฑฐ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€ํ™˜: ์ ์ ˆํ•œ ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝ
  • ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ: ๋น„์ •์ƒ์ ์ธ ๊ฐ’ ์ฒ˜๋ฆฌ
  • ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ: ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์†Œ์Šค ๊ฒฐํ•ฉ
Info

๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋“ค์€ ์—…๋ฌด ์‹œ๊ฐ„์˜ 80%๋ฅผ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์— ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

Handling Missing Valuesโ€‹

Checking Missing Valuesโ€‹

import pandas as pd
import numpy as np

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ (๊ฒฐ์ธก์น˜ ํฌํ•จ)
data = {
'name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
'age': [25, None, 35, 28, 32],
'salary': [50000, 60000, None, 55000, None],
'department': ['IT', 'HR', 'IT', None, 'HR']
}
df = pd.DataFrame(data)

print("=== ์›๋ณธ ๋ฐ์ดํ„ฐ ===")
print(df)

# ๊ฒฐ์ธก์น˜ ํ™•์ธ
print("\n=== ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜ ===")
print(df.isnull().sum())

# ๊ฒฐ์ธก์น˜ ๋น„์œจ
print("\n=== ๊ฒฐ์ธก์น˜ ๋น„์œจ ===")
print((df.isnull().sum() / len(df) * 100).round(2))

# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ํ–‰
print("\n=== ๊ฒฐ์ธก์น˜ ํฌํ•จ ํ–‰ ===")
print(df[df.isnull().any(axis=1)])

Removing Missing Valuesโ€‹

# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ํ–‰ ์ „์ฒด ์ œ๊ฑฐ
df_dropped_rows = df.dropna()
print("ํ–‰ ์ œ๊ฑฐ ํ›„:", df_dropped_rows.shape)

# ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ์—ด ์ œ๊ฑฐ
df_dropped_cols = df.dropna(axis=1)
print("์—ด ์ œ๊ฑฐ ํ›„:", df_dropped_cols.shape)

# ํŠน์ • ์—ด์˜ ๊ฒฐ์ธก์น˜๋งŒ ์ œ๊ฑฐ
df_dropped_subset = df.dropna(subset=['name', 'age'])
print("name, age ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ:", df_dropped_subset.shape)

# ๋ชจ๋“  ๊ฐ’์ด ๊ฒฐ์ธก์น˜์ธ ํ–‰๋งŒ ์ œ๊ฑฐ
df_dropped_all = df.dropna(how='all')

# ์ตœ์†Œ N๊ฐœ ์ด์ƒ์˜ ๊ฐ’์ด ์žˆ๋Š” ํ–‰๋งŒ ์œ ์ง€
df_dropped_thresh = df.dropna(thresh=3) # ์ตœ์†Œ 3๊ฐœ ๊ฐ’

Filling Missing Valuesโ€‹

# ํŠน์ • ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ธฐ
df_filled = df.fillna(0)
df_filled = df.fillna('Unknown')

# ์—ด๋ณ„๋กœ ๋‹ค๋ฅธ ๊ฐ’ ์ฑ„์šฐ๊ธฐ
df_filled = df.fillna({
'age': df['age'].mean(), # ํ‰๊ท 
'salary': df['salary'].median(), # ์ค‘์•™๊ฐ’
'department': 'Unknown'
})

# ์•ž/๋’ค ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ธฐ (์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ)
df_ffill = df.fillna(method='ffill') # ์•ž ๊ฐ’์œผ๋กœ
df_bfill = df.fillna(method='bfill') # ๋’ค ๊ฐ’์œผ๋กœ

# ๋ณด๊ฐ„๋ฒ•
df['age'] = df['age'].interpolate() # ์„ ํ˜• ๋ณด๊ฐ„

Advanced Missing Value Handlingโ€‹

import pandas as pd
import numpy as np

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ
data = {
'date': pd.date_range('2024-01-01', periods=10),
'temperature': [22, 23, None, 25, 26, None, 28, 29, None, 31],
'humidity': [60, 62, 65, None, 68, 70, None, 75, 77, 80]
}
df = pd.DataFrame(data)

# ๊ทธ๋ฃน๋ณ„ ํ‰๊ท ์œผ๋กœ ์ฑ„์šฐ๊ธฐ
df['category'] = ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
df['temperature'] = df.groupby('category')['temperature'].transform(
lambda x: x.fillna(x.mean())
)

# ์‹œ๊ณ„์—ด ๋ณด๊ฐ„
df['temperature'] = df['temperature'].interpolate(method='time')

# KNN ๊ธฐ๋ฐ˜ ์ฑ„์šฐ๊ธฐ (scikit-learn ์‚ฌ์šฉ)
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
df[['temperature', 'humidity']] = imputer.fit_transform(
df[['temperature', 'humidity']]
)

Handling Duplicate Dataโ€‹

Checking Duplicatesโ€‹

import pandas as pd

data = {
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'age': [25, 30, 25, 35, 30],
'city': ['Seoul', 'Busan', 'Seoul', 'Incheon', 'Busan']
}
df = pd.DataFrame(data)

print("=== ์›๋ณธ ๋ฐ์ดํ„ฐ ===")
print(df)

# ์ค‘๋ณต ํ™•์ธ
print("\n=== ์ค‘๋ณต ์—ฌ๋ถ€ ===")
print(df.duplicated())

# ์ค‘๋ณต ๊ฐœ์ˆ˜
print(f"\n์ค‘๋ณต ํ–‰ ๊ฐœ์ˆ˜: {df.duplicated().sum()}")

# ์ค‘๋ณต๋œ ํ–‰ ๋ณด๊ธฐ
print("\n=== ์ค‘๋ณต๋œ ํ–‰ ===")
print(df[df.duplicated(keep=False)])

Removing Duplicatesโ€‹

# ์ค‘๋ณต ์ œ๊ฑฐ (์ฒซ ๋ฒˆ์งธ ์œ ์ง€)
df_unique = df.drop_duplicates()
print("์ค‘๋ณต ์ œ๊ฑฐ ํ›„:", len(df_unique))

# ๋งˆ์ง€๋ง‰ ์œ ์ง€
df_unique_last = df.drop_duplicates(keep='last')

# ๋ชจ๋‘ ์ œ๊ฑฐ
df_no_duplicates = df.drop_duplicates(keep=False)

# ํŠน์ • ์—ด ๊ธฐ์ค€ ์ค‘๋ณต ์ œ๊ฑฐ
df_unique_name = df.drop_duplicates(subset=['name'])
df_unique_multi = df.drop_duplicates(subset=['name', 'age'])

# ์›๋ณธ ์ˆ˜์ •
df.drop_duplicates(inplace=True)

Data Type Conversionโ€‹

Basic Type Conversionโ€‹

import pandas as pd

data = {
'id': ['1', '2', '3', '4', '5'],
'price': ['1000', '1500', '2000', '2500', '3000'],
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
'is_active': ['True', 'False', 'True', 'True', 'False']
}
df = pd.DataFrame(data)

print("=== ์›๋ณธ ํƒ€์ž… ===")
print(df.dtypes)

# ์ˆซ์ž๋กœ ๋ณ€ํ™˜
df['id'] = df['id'].astype(int)
df['price'] = df['price'].astype(float)

# ๋‚ ์งœ๋กœ ๋ณ€ํ™˜
df['date'] = pd.to_datetime(df['date'])

# ๋ถˆ๋ฆฌ์–ธ์œผ๋กœ ๋ณ€ํ™˜
df['is_active'] = df['is_active'].map({'True': True, 'False': False})

print("\n=== ๋ณ€ํ™˜ ํ›„ ํƒ€์ž… ===")
print(df.dtypes)

Safe Conversionโ€‹

# ์—๋Ÿฌ ๋ฌด์‹œํ•˜๊ณ  ๋ณ€ํ™˜
df['price'] = pd.to_numeric(df['price'], errors='coerce') # ์‹คํŒจ ์‹œ NaN

# ๋‚ ์งœ ๋ณ€ํ™˜ (์—๋Ÿฌ ์ฒ˜๋ฆฌ)
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# ์นดํ…Œ๊ณ ๋ฆฌ ํƒ€์ž… (๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ)
df['category'] = df['category'].astype('category')

Complex Conversionโ€‹

# ๋ฌธ์ž์—ด์—์„œ ์ˆซ์ž ์ถ”์ถœ
df['price_str'] = ['$1,000', '$1,500', '$2,000']
df['price'] = df['price_str'].str.replace('[$,]', '', regex=True).astype(float)

# ๋‚ ์งœ ํ˜•์‹ ์ง€์ •
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')

# ์—ฌ๋Ÿฌ ํ˜•์‹ ์‹œ๋„
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

String Processingโ€‹

Basic String Operationsโ€‹

import pandas as pd

data = {
'name': [' Alice ', 'BOB', 'charlie', 'David'],
'email': ['alice@example.com', 'BOB@EXAMPLE.COM', 'charlie@test.com', 'david@test.com']
}
df = pd.DataFrame(data)

# ๊ณต๋ฐฑ ์ œ๊ฑฐ
df['name'] = df['name'].str.strip()

# ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜
df['name_upper'] = df['name'].str.upper()
df['name_lower'] = df['name'].str.lower()
df['name_title'] = df['name'].str.title()

# ์ด๋ฉ”์ผ ๋„๋ฉ”์ธ ์ถ”์ถœ
df['domain'] = df['email'].str.split('@').str[1]

# ํŠน์ • ๋ฌธ์ž ํฌํ•จ ์—ฌ๋ถ€
df['has_test'] = df['email'].str.contains('test')

# ๋ฌธ์ž์—ด ๊ธธ์ด
df['name_length'] = df['name'].str.len()

print(df)

Using Regular Expressionsโ€‹

import pandas as pd

data = {
'phone': ['010-1234-5678', '02-987-6543', '031-555-1234', '010.9999.8888'],
'text': ['๊ฐ€๊ฒฉ: 10,000์›', 'ํ• ์ธ๊ฐ€ 5,000์›', '์ •์ƒ๊ฐ€ 20000์›', 'ํŠน๊ฐ€ 15,000']
}
df = pd.DataFrame(data)

# ์ „ํ™”๋ฒˆํ˜ธ ํ˜•์‹ ํ†ต์ผ
df['phone_clean'] = df['phone'].str.replace('[.-]', '', regex=True)

# ์ˆซ์ž๋งŒ ์ถ”์ถœ
df['price'] = df['text'].str.extract(r'(\d+,?\d+)')[0]
df['price'] = df['price'].str.replace(',', '').astype(int)

# ํŠน์ • ํŒจํ„ด ์ฐพ๊ธฐ
df['has_discount'] = df['text'].str.contains('ํ• ์ธ|ํŠน๊ฐ€')

print(df)

Splitting and Joining Textโ€‹

# ๋ฌธ์ž์—ด ๋ถ„ํ• 
df = pd.DataFrame({'full_name': ['John Doe', 'Jane Smith', 'Bob Johnson']})
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

# ๋ฌธ์ž์—ด ๊ฒฐํ•ฉ
df['full_name_reversed'] = df['last_name'] + ', ' + df['first_name']

# ์—ฌ๋Ÿฌ ์—ด ๊ฒฐํ•ฉ
df['address'] = df['city'] + ', ' + df['state'] + ' ' + df['zip']

Data Mergingโ€‹

merge (similar to SQL JOIN)โ€‹

import pandas as pd

# ์ง์› ์ •๋ณด
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept_id': [10, 20, 10, 30]
})

# ๋ถ€์„œ ์ •๋ณด
departments = pd.DataFrame({
'dept_id': [10, 20, 30, 40],
'dept_name': ['IT', 'HR', 'Sales', 'Marketing']
})

# Inner Join (๊ต์ง‘ํ•ฉ)
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print("=== Inner Join ===")
print(inner)

# Left Join (์™ผ์ชฝ ์ „์ฒด)
left = pd.merge(employees, departments, on='dept_id', how='left')
print("\n=== Left Join ===")
print(left)

# Right Join (์˜ค๋ฅธ์ชฝ ์ „์ฒด)
right = pd.merge(employees, departments, on='dept_id', how='right')

# Outer Join (ํ•ฉ์ง‘ํ•ฉ)
outer = pd.merge(employees, departments, on='dept_id', how='outer')

# ๋‹ค๋ฅธ ์—ด ์ด๋ฆ„์œผ๋กœ ์กฐ์ธ
df1 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
df2 = pd.DataFrame({'key': [1, 2], 'data': [100, 200]})
merged = pd.merge(df1, df2, left_on='id', right_on='key')

concat (simple combination)โ€‹

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# ์„ธ๋กœ ๊ฒฐํ•ฉ (ํ–‰ ์ถ”๊ฐ€)
vertical = pd.concat([df1, df2], ignore_index=True)
print("=== ์„ธ๋กœ ๊ฒฐํ•ฉ ===")
print(vertical)

# ๊ฐ€๋กœ ๊ฒฐํ•ฉ (์—ด ์ถ”๊ฐ€)
df3 = pd.DataFrame({'C': [9, 10], 'D': [11, 12]})
horizontal = pd.concat([df1, df3], axis=1)
print("\n=== ๊ฐ€๋กœ ๊ฒฐํ•ฉ ===")
print(horizontal)

join (index-based)โ€‹

df1 = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'B': [4, 5, 6]}, index=['a', 'b', 'd'])

# ์ธ๋ฑ์Šค ๊ธฐ์ค€ ์กฐ์ธ
joined = df1.join(df2, how='outer')
print("=== Join ===")
print(joined)

Grouping (GroupBy)โ€‹

Basic Groupingโ€‹

import pandas as pd
import numpy as np

# ํŒ๋งค ๋ฐ์ดํ„ฐ
data = {
'date': pd.date_range('2024-01-01', periods=20),
'product': np.random.choice(['A', 'B', 'C'], 20),
'region': np.random.choice(['์„œ์šธ', '๋ถ€์‚ฐ', '๋Œ€๊ตฌ'], 20),
'sales': np.random.randint(100, 1000, 20)
}
df = pd.DataFrame(data)

# ์ œํ’ˆ๋ณ„ ์ด ๋งค์ถœ
product_sales = df.groupby('product')['sales'].sum()
print("=== ์ œํ’ˆ๋ณ„ ๋งค์ถœ ===")
print(product_sales)

# ์—ฌ๋Ÿฌ ์ง‘๊ณ„ ํ•จ์ˆ˜
product_stats = df.groupby('product')['sales'].agg(['sum', 'mean', 'count'])
print("\n=== ์ œํ’ˆ๋ณ„ ํ†ต๊ณ„ ===")
print(product_stats)

# ์—ฌ๋Ÿฌ ์—ด๋กœ ๊ทธ๋ฃนํ™”
region_product = df.groupby(['region', 'product'])['sales'].sum()
print("\n=== ์ง€์—ญ๋ณ„ ์ œํ’ˆ ๋งค์ถœ ===")
print(region_product)

Advanced Groupingโ€‹

# ์—ฌ๋Ÿฌ ์—ด์— ๋‹ค๋ฅธ ํ•จ์ˆ˜ ์ ์šฉ
agg_dict = {
'sales': ['sum', 'mean', 'max'],
'product': 'count'
}
result = df.groupby('region').agg(agg_dict)
print(result)

# ์ปค์Šคํ…€ ํ•จ์ˆ˜
def range_func(x):
return x.max() - x.min()

custom_agg = df.groupby('product')['sales'].agg([
('ํ•ฉ๊ณ„', 'sum'),
('ํ‰๊ท ', 'mean'),
('๋ฒ”์œ„', range_func)
])
print(custom_agg)

# ๊ทธ๋ฃน๋ณ„ ๋ณ€ํ™˜
df['sales_mean_by_product'] = df.groupby('product')['sales'].transform('mean')
df['sales_vs_avg'] = df['sales'] - df['sales_mean_by_product']

# ๊ทธ๋ฃน๋ณ„ ๋žญํ‚น
df['rank_in_product'] = df.groupby('product')['sales'].rank(ascending=False)

Pivot Tablesโ€‹

# ํ”ผ๋ฒ— ํ…Œ์ด๋ธ” ์ƒ์„ฑ
pivot = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)
print("=== ํ”ผ๋ฒ— ํ…Œ์ด๋ธ” ===")
print(pivot)

# ์—ฌ๋Ÿฌ ์ง‘๊ณ„
pivot_multi = pd.pivot_table(
df,
values='sales',
index='region',
columns='product',
aggfunc=['sum', 'mean'],
fill_value=0
)

Practical Examplesโ€‹

์˜ˆ์ œ 1: ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ ์ •์ œโ€‹

import pandas as pd
import numpy as np

# ์ง€์ €๋ถ„ํ•œ ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ
data = {
'customer_id': [1, 2, 3, 4, 5, 5, 6, None, 8],
'name': [' Alice ', 'BOB', 'Charlie', None, 'Eve', 'Eve', 'Frank', 'Grace', 'Henry'],
'email': ['alice@test.com', 'bob@TEST.COM', 'charlie@test', 'dave@test.com',
'eve@test.com', 'eve@test.com', None, 'grace@test.com', 'henry@test.com'],
'age': ['25', '30', 'thirty', '28', '32', '32', '27', '29', None],
'purchase_amount': ['1000', '1,500', '2000', '500', '3000', '3000', '1200', '800', '1500'],
'join_date': ['2024-01-01', '2024/01/15', '2024.02.01', None, '2024-03-01',
'2024-03-01', '2024-03-15', '2024-04-01', '2024-04-15']
}
df = pd.DataFrame(data)

print("=== ์›๋ณธ ๋ฐ์ดํ„ฐ (๋ฌธ์ œ ๋งŽ์Œ) ===")
print(df)
print(f"\n๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {df.shape}")

# Step 1: ์ค‘๋ณต ์ œ๊ฑฐ
print("\n[1๋‹จ๊ณ„] ์ค‘๋ณต ์ œ๊ฑฐ")
df = df.drop_duplicates(subset=['customer_id', 'email'], keep='first')
print(f"์ค‘๋ณต ์ œ๊ฑฐ ํ›„: {df.shape}")

# Step 2: ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
print("\n[2๋‹จ๊ณ„] ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ")
print(f"๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜:\n{df.isnull().sum()}")

# customer_id ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ (ํ•„์ˆ˜ ํ•ญ๋ชฉ)
df = df.dropna(subset=['customer_id'])

# name ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
df['name'] = df['name'].fillna('Unknown')

# email ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
df['email'] = df['email'].fillna('no-email@unknown.com')

# Step 3: ๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€ํ™˜
print("\n[3๋‹จ๊ณ„] ๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€ํ™˜")

# customer_id๋ฅผ ์ •์ˆ˜๋กœ
df['customer_id'] = df['customer_id'].astype(int)

# age: ์ˆซ์ž๊ฐ€ ์•„๋‹Œ ๊ฐ’ ์ฒ˜๋ฆฌ
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['age'] = df['age'].fillna(df['age'].median())

# purchase_amount: ์‰ผํ‘œ ์ œ๊ฑฐ ํ›„ ์ˆซ์ž๋กœ
df['purchase_amount'] = df['purchase_amount'].str.replace(',', '').astype(float)

# join_date: ๋‹ค์–‘ํ•œ ํ˜•์‹ ํ†ต์ผ
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
df['join_date'] = df['join_date'].fillna(pd.Timestamp('2024-01-01'))

# Step 4: ๋ฌธ์ž์—ด ์ •์ œ
print("\n[4๋‹จ๊ณ„] ๋ฌธ์ž์—ด ์ •์ œ")

# name: ๊ณต๋ฐฑ ์ œ๊ฑฐ, ์ฒซ๊ธ€์ž ๋Œ€๋ฌธ์ž
df['name'] = df['name'].str.strip().str.title()

# email: ์†Œ๋ฌธ์ž๋กœ ํ†ต์ผ
df['email'] = df['email'].str.lower()

# Step 5: ์ด๋ฉ”์ผ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ
print("\n[5๋‹จ๊ณ„] ์ด๋ฉ”์ผ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ")
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
df['email_valid'] = df['email'].str.match(email_pattern)
print(f"์œ ํšจํ•˜์ง€ ์•Š์€ ์ด๋ฉ”์ผ: {(~df['email_valid']).sum()}๊ฐœ")

# Step 6: ํŒŒ์ƒ ๋ณ€์ˆ˜ ์ƒ์„ฑ
print("\n[6๋‹จ๊ณ„] ํŒŒ์ƒ ๋ณ€์ˆ˜ ์ƒ์„ฑ")
df['customer_value'] = pd.cut(
df['purchase_amount'],
bins=[0, 1000, 2000, float('inf')],
labels=['Low', 'Medium', 'High']
)

df['days_since_join'] = (pd.Timestamp.now() - df['join_date']).dt.days

print("\n=== ์ •์ œ๋œ ๋ฐ์ดํ„ฐ ===")
print(df)
print(f"\n์ตœ์ข… ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {df.shape}")
print(f"\n๋ฐ์ดํ„ฐ ํƒ€์ž…:\n{df.dtypes}")

# ์ •์ œ ์š”์•ฝ
print("\n=== ์ •์ œ ์š”์•ฝ ===")
print(f"- ์ค‘๋ณต ์ œ๊ฑฐ: {len(data['customer_id']) - len(df)}๊ฑด")
print(f"- ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ์™„๋ฃŒ")
print(f"- ๋ฐ์ดํ„ฐ ํƒ€์ž… ํ†ต์ผ")
print(f"- ๋ฌธ์ž์—ด ์ •๊ทœํ™”")
print(f"- ์ด๋ฉ”์ผ ๊ฒ€์ฆ")

์˜ˆ์ œ 2: ํŒ๋งค ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ ๋ฐ ๋ถ„์„โ€‹

import pandas as pd
import numpy as np

# ์˜จ๋ผ์ธ ํŒ๋งค ๋ฐ์ดํ„ฐ
online_sales = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004],
'customer_id': [101, 102, 101, 103],
'product': ['๋…ธํŠธ๋ถ', '๋งˆ์šฐ์Šค', 'ํ‚ค๋ณด๋“œ', '๋ชจ๋‹ˆํ„ฐ'],
'amount': [1500000, 30000, 80000, 400000],
'date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18'],
'channel': ['์˜จ๋ผ์ธ'] * 4
})

# ์˜คํ”„๋ผ์ธ ํŒ๋งค ๋ฐ์ดํ„ฐ
offline_sales = pd.DataFrame({
'order_id': [2001, 2002, 2003],
'customer_id': [101, 104, 102],
'product': ['๋งˆ์šฐ์Šค', '๋…ธํŠธ๋ถ', 'ํ‚ค๋ณด๋“œ'],
'amount': [35000, 1600000, 75000],
'date': ['2024-01-16', '2024-01-17', '2024-01-18'],
'channel': ['์˜คํ”„๋ผ์ธ'] * 3
})

# ๊ณ ๊ฐ ์ •๋ณด
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'name': ['๊น€์ฒ ์ˆ˜', '์ด์˜ํฌ', '๋ฐ•๋ฏผ์ˆ˜', '์ •์ง€์€', '์ตœํ˜ธ์ง„'],
'grade': ['VIP', '์ผ๋ฐ˜', '์ผ๋ฐ˜', 'VIP', '์ผ๋ฐ˜'],
'region': ['์„œ์šธ', '๋ถ€์‚ฐ', '๋Œ€๊ตฌ', '์„œ์šธ', '์ธ์ฒœ']
})

print("=== 1. ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ ===")
# ์˜จ/์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ
all_sales = pd.concat([online_sales, offline_sales], ignore_index=True)
print(f"์ „์ฒด ํŒ๋งค ๊ฑด์ˆ˜: {len(all_sales)}")

# ๋‚ ์งœ ํƒ€์ž… ๋ณ€ํ™˜
all_sales['date'] = pd.to_datetime(all_sales['date'])

# ๊ณ ๊ฐ ์ •๋ณด์™€ ๋ณ‘ํ•ฉ
sales_with_customer = pd.merge(
all_sales,
customers,
on='customer_id',
how='left'
)

print("\n=== 2. ํ†ตํ•ฉ ๋ฐ์ดํ„ฐ ===")
print(sales_with_customer)

print("\n=== 3. ์ฑ„๋„๋ณ„ ๋ถ„์„ ===")
channel_stats = sales_with_customer.groupby('channel').agg({
'amount': ['sum', 'mean', 'count'],
'customer_id': 'nunique'
})
channel_stats.columns = ['์ด๋งค์ถœ', 'ํ‰๊ท ๋งค์ถœ', '๊ฑฐ๋ž˜๊ฑด์ˆ˜', '๊ณ ๊ฐ์ˆ˜']
print(channel_stats)

print("\n=== 4. ๊ณ ๊ฐ๋ณ„ ๋ถ„์„ ===")
customer_stats = sales_with_customer.groupby(['customer_id', 'name', 'grade']).agg({
'amount': 'sum',
'order_id': 'count'
})
customer_stats.columns = ['์ด๊ตฌ๋งค์•ก', '๊ตฌ๋งคํšŸ์ˆ˜']
customer_stats = customer_stats.sort_values('์ด๊ตฌ๋งค์•ก', ascending=False)
print(customer_stats)

print("\n=== 5. ์ œํ’ˆ๋ณ„ ๋ถ„์„ ===")
product_stats = sales_with_customer.groupby('product').agg({
'amount': ['sum', 'count'],
'customer_id': 'nunique'
})
product_stats.columns = ['์ด๋งค์ถœ', 'ํŒ๋งค๊ฑด์ˆ˜', '๊ตฌ๋งค๊ณ ๊ฐ์ˆ˜']
product_stats = product_stats.sort_values('์ด๋งค์ถœ', ascending=False)
print(product_stats)

print("\n=== 6. VIP vs ์ผ๋ฐ˜ ๊ณ ๊ฐ ===")
grade_comparison = sales_with_customer.groupby('grade').agg({
'amount': ['sum', 'mean'],
'order_id': 'count'
})
grade_comparison.columns = ['์ด๋งค์ถœ', 'ํ‰๊ท ๊ตฌ๋งค์•ก', '๊ตฌ๋งคํšŸ์ˆ˜']
print(grade_comparison)

print("\n=== 7. ์ง€์—ญ๋ณ„ ๋ถ„์„ ===")
region_sales = sales_with_customer.groupby('region')['amount'].sum().sort_values(ascending=False)
print(region_sales)

# ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”: ์ง€์—ญ๋ณ„ ร— ์ฑ„๋„๋ณ„
print("\n=== 8. ์ง€์—ญ๋ณ„ ์ฑ„๋„๋ณ„ ๋งค์ถœ ===")
pivot = pd.pivot_table(
sales_with_customer,
values='amount',
index='region',
columns='channel',
aggfunc='sum',
fill_value=0
)
print(pivot)

์˜ˆ์ œ 3: ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌโ€‹

import pandas as pd
import numpy as np

# ๋ถˆ๊ทœ์น™ํ•œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ (์„ผ์„œ ๋ฐ์ดํ„ฐ)
dates = pd.date_range('2024-01-01', periods=100, freq='H')
np.random.seed(42)

# ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ๋ˆ„๋ฝ ๋ฐ ์ด์ƒ์น˜ ํฌํ•จ
data = {
'timestamp': dates,
'temperature': np.random.normal(20, 5, 100),
'humidity': np.random.normal(60, 10, 100)
}
df = pd.DataFrame(data)

# ๊ฒฐ์ธก์น˜์™€ ์ด์ƒ์น˜ ์ถ”๊ฐ€
df.loc[10:15, 'temperature'] = np.nan
df.loc[30, 'temperature'] = 100 # ์ด์ƒ์น˜
df.loc[50:52, 'humidity'] = np.nan
df.loc[70, 'humidity'] = -50 # ์ด์ƒ์น˜

print("=== ์›๋ณธ ๋ฐ์ดํ„ฐ ===")
print(f"๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {df.shape}")
print(f"๊ฒฐ์ธก์น˜:\n{df.isnull().sum()}")

# Step 1: ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์ธ๋ฑ์Šค๋กœ
df.set_index('timestamp', inplace=True)

# Step 2: ์ด์ƒ์น˜ ํƒ์ง€ ๋ฐ ์ฒ˜๋ฆฌ (IQR ๋ฐฉ๋ฒ•)
print("\n[์ด์ƒ์น˜ ์ฒ˜๋ฆฌ]")
for col in ['temperature', 'humidity']:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
print(f"{col} ์ด์ƒ์น˜: {outliers.sum()}๊ฐœ")

# ์ด์ƒ์น˜๋ฅผ NaN์œผ๋กœ ๋ณ€๊ฒฝ
df.loc[outliers, col] = np.nan

# Step 3: ๊ฒฐ์ธก์น˜ ๋ณด๊ฐ„
print("\n[๊ฒฐ์ธก์น˜ ๋ณด๊ฐ„]")
df['temperature'] = df['temperature'].interpolate(method='time')
df['humidity'] = df['humidity'].interpolate(method='time')
print(f"๋ณด๊ฐ„ ํ›„ ๊ฒฐ์ธก์น˜: {df.isnull().sum().sum()}๊ฐœ")

# Step 4: ์ด๋™ ํ‰๊ท ์œผ๋กœ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
df['temp_smooth'] = df['temperature'].rolling(window=3).mean()
df['humid_smooth'] = df['humidity'].rolling(window=3).mean()

# Step 5: ์‹œ๊ฐ„๋Œ€๋ณ„ ํ†ต๊ณ„
df['hour'] = df.index.hour
hourly_stats = df.groupby('hour').agg({
'temperature': ['mean', 'std'],
'humidity': ['mean', 'std']
})

print("\n=== ์‹œ๊ฐ„๋Œ€๋ณ„ ํ†ต๊ณ„ ===")
print(hourly_stats.head(10))

# Step 6: ์ผ๋ณ„ ์ง‘๊ณ„
daily = df.resample('D').agg({
'temperature': ['mean', 'min', 'max'],
'humidity': ['mean', 'min', 'max']
})

print("\n=== ์ผ๋ณ„ ์ง‘๊ณ„ ===")
print(daily)

print("\n=== ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ===")
print(f"์ตœ์ข… ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {df.shape}")
print(f"๊ฒฐ์ธก์น˜: {df.isnull().sum().sum()}๊ฐœ")

Data Preprocessing Pipelineโ€‹

import pandas as pd

class DataCleaningPipeline:
"""๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ"""

def __init__(self, df):
self.df = df.copy()
self.logs = []

def log(self, message):
"""๋กœ๊ทธ ๊ธฐ๋ก"""
self.logs.append(message)
print(f"โœ“ {message}")

def remove_duplicates(self, subset=None):
"""์ค‘๋ณต ์ œ๊ฑฐ"""
before = len(self.df)
self.df = self.df.drop_duplicates(subset=subset)
removed = before - len(self.df)
self.log(f"์ค‘๋ณต {removed}๊ฑด ์ œ๊ฑฐ")
return self

def handle_missing(self, strategy='drop', fill_value=None):
"""๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ"""
missing_before = self.df.isnull().sum().sum()

if strategy == 'drop':
self.df = self.df.dropna()
elif strategy == 'fill':
self.df = self.df.fillna(fill_value)
elif strategy == 'mean':
numeric_cols = self.df.select_dtypes(include='number').columns
self.df[numeric_cols] = self.df[numeric_cols].fillna(
self.df[numeric_cols].mean()
)

missing_after = self.df.isnull().sum().sum()
self.log(f"๊ฒฐ์ธก์น˜ {missing_before}๊ฐœ โ†’ {missing_after}๊ฐœ")
return self

def convert_types(self, type_dict):
"""๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€ํ™˜"""
for col, dtype in type_dict.items():
if col in self.df.columns:
self.df[col] = self.df[col].astype(dtype)
self.log(f"{col} โ†’ {dtype} ๋ณ€ํ™˜")
return self

def clean_strings(self, columns):
"""๋ฌธ์ž์—ด ์ •์ œ"""
for col in columns:
if col in self.df.columns:
self.df[col] = (self.df[col]
.str.strip()
.str.lower()
)
self.log(f"{col} ๋ฌธ์ž์—ด ์ •์ œ")
return self

def remove_outliers(self, columns, method='iqr'):
"""์ด์ƒ์น˜ ์ œ๊ฑฐ"""
for col in columns:
if method == 'iqr':
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

before = len(self.df)
self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
removed = before - len(self.df)
self.log(f"{col} ์ด์ƒ์น˜ {removed}๊ฑด ์ œ๊ฑฐ")
return self

def get_result(self):
"""๊ฒฐ๊ณผ ๋ฐ˜ํ™˜"""
print("\n=== ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ===")
for log in self.logs:
print(f" {log}")
return self.df

# ์‚ฌ์šฉ ์˜ˆ์ œ
data = {
'name': [' Alice ', 'BOB', 'Charlie', 'Alice', None],
'age': [25, 30, 200, 25, 28], # 200์€ ์ด์ƒ์น˜
'email': ['alice@test.com', 'BOB@TEST.COM', 'charlie@test.com', 'alice@test.com', None]
}
df = pd.DataFrame(data)

# ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰
cleaned_df = (DataCleaningPipeline(df)
.remove_duplicates(subset=['name', 'email'])
.handle_missing(strategy='drop')
.clean_strings(['name', 'email'])
.remove_outliers(['age'])
.get_result()
)

print("\n์ •์ œ๋œ ๋ฐ์ดํ„ฐ:")
print(cleaned_df)

Frequently Asked Questionsโ€‹

๊ฒฐ์ธก์น˜๋Š” ์–ธ์ œ ์ œ๊ฑฐํ•˜๊ณ  ์–ธ์ œ ์ฑ„์›Œ์•ผ ํ•˜๋‚˜์š”?โ€‹

์ œ๊ฑฐํ•˜๋Š” ๊ฒฝ์šฐ:

  • ๊ฒฐ์ธก์น˜๊ฐ€ 5% ๋ฏธ๋งŒ์ผ ๋•Œ
  • ๋ฌด์ž‘์œ„๋กœ ๋ฐœ์ƒํ•œ ๊ฒฐ์ธก์น˜
  • ์ค‘์š”ํ•˜์ง€ ์•Š์€ ์—ด

์ฑ„์šฐ๋Š” ๊ฒฝ์šฐ:

  • ๊ฒฐ์ธก์น˜๊ฐ€ ๋งŽ์„ ๋•Œ (20% ์ด์ƒ)
  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ
  • ์ค‘์š”ํ•œ ํŠน์„ฑ

merge์™€ join์˜ ์ฐจ์ด๋Š”?โ€‹

# merge: ๋ช…์‹œ์ ์œผ๋กœ ํ‚ค ์ง€์ •
pd.merge(df1, df2, on='key')

# join: ์ธ๋ฑ์Šค ๊ธฐ์ค€ (๋” ๊ฐ„๋‹จ)
df1.join(df2)

groupby๊ฐ€ ๋А๋ ค์š”. ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?โ€‹

# ๋А๋ฆผ
df.groupby('category').apply(complex_function)

# ๋น ๋ฆ„
df.groupby('category').agg({'column': 'sum'})

# ๋” ๋น ๋ฆ„ (NumPy ์‚ฌ์šฉ)
import numpy as np
for cat in df['category'].unique():
subset = df[df['category'] == cat]
result = np.sum(subset['column'])

Next Stepsโ€‹

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ตํ˜”๋‹ค๋ฉด:

  1. ๊ณ ๊ธ‰ ๋ถ„์„: ํ†ต๊ณ„ ๋ถ„์„, ๊ฐ€์„ค ๊ฒ€์ •
  2. ํŠน์ง• ๊ณตํ•™: ๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•œ ๋ณ€์ˆ˜ ์ƒ์„ฑ
  3. ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”: ์ „์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
  4. ์ž๋™ํ™”: ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•

Referencesโ€‹