Getting Started with Pandas ๐ผ
Pandas is the most powerful and popular library for data analysis in Python. ํ ํํ์ ๋ฐ์ดํฐ๋ฅผ ์ฝ๊ฒ ๋ค๋ฃฐ ์ ์๊ฒ ํด์ค๋๋ค.
What is Pandas?โ
Pandas stands for "Panel Data" and is a library for efficiently handling structured data.
Key Featuresโ
- DataFrame: ํ ํํ์ ๋ฐ์ดํฐ ๊ตฌ์กฐ
- ๋ค์ํ ํ์ผ ํ์ ์ง์: CSV, Excel, JSON, SQL ๋ฑ
- ๊ฐ๋ ฅํ ๋ฐ์ดํฐ ์กฐ์: ํํฐ๋ง, ๊ทธ๋ฃนํ, ๋ณํฉ
- ๊ฒฐ์ธก์น ์ฒ๋ฆฌ: ๋๋ฝ๋ ๋ฐ์ดํฐ ์ฝ๊ฒ ๋ค๋ฃจ๊ธฐ
Installationโ
pip install pandas
import pandas as pd
# ๋ฒ์ ํ์ธ
print(pd.__version__) # 2.0.3
Core Data Structuresโ
Series (1-dimensional)โ
Series represents column data.
import pandas as pd
# ๋ฆฌ์คํธ์์ ์์ฑ
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64
# ์ธ๋ฑ์ค ์ง์
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s)
# a 10
# b 20
# c 30
# dtype: int64
# ๋์
๋๋ฆฌ์์ ์์ฑ
data = {'Seoul': 9800000, 'Busan': 3400000, 'Incheon': 2900000}
s = pd.Series(data)
print(s)
# Seoul 9800000
# Busan 3400000
# Incheon 2900000
# dtype: int64
# ์ ๊ทผ
print(s['Seoul']) # 9800000
print(s[0]) # 9800000
print(s[['Seoul', 'Busan']]) # ์ฌ๋ฌ ๊ฐ ์ ํ
DataFrame (2-dimensional)โ
DataFrame is data in tabular form.
import pandas as pd
# ๋์
๋๋ฆฌ์์ ์์ฑ
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 28],
'city': ['Seoul', 'Busan', 'Incheon', 'Seoul']
}
df = pd.DataFrame(data)
print(df)
# name age city
# 0 Alice 25 Seoul
# 1 Bob 30 Busan
# 2 Charlie 35 Incheon
# 3 David 28 Seoul
# ๋ฆฌ์คํธ์์ ์์ฑ
data = [
['Alice', 25, 'Seoul'],
['Bob', 30, 'Busan'],
['Charlie', 35, 'Incheon']
]
df = pd.DataFrame(data, columns=['name', 'age', 'city'])
print(df)
# CSV์์ ์์ฑ (๋์ค์ ์์ธํ)
# df = pd.read_csv('data.csv')
DataFrame Basic Informationโ
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 32],
'salary': [50000, 60000, 75000, 55000, 65000],
'department': ['IT', 'HR', 'IT', 'Sales', 'HR']
}
df = pd.DataFrame(data)
# ๊ธฐ๋ณธ ์ ๋ณด
print(df.shape) # (5, 4) - ํ, ์ด ๊ฐ์
print(df.columns) # Index(['name', 'age', 'salary', 'department'])
print(df.index) # RangeIndex(start=0, stop=5, step=1)
print(df.dtypes) # ๊ฐ ์ด์ ๋ฐ์ดํฐ ํ์
# ์ฒ์/๋ง์ง๋ง ๋ช ํ ๋ณด๊ธฐ
print(df.head(3)) # ์ฒ์ 3ํ
print(df.tail(2)) # ๋ง์ง๋ง 2ํ
# ํต๊ณ ์ ๋ณด
print(df.info()) # ์ ์ฒด ์ ๋ณด
print(df.describe()) # ์์นํ ์ปฌ๋ผ ํต๊ณ
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 age 5 non-null int64
2 salary 5 non-null int64
3 department 5 non-null object
dtypes: int64(2), object(2)
Data Selectionโ
Column Selectionโ
# ํ ๊ฐ ์ด (Series)
print(df['name'])
# ๋๋
print(df.name)
# ์ฌ๋ฌ ์ด (DataFrame)
print(df[['name', 'age']])
# ์ด ์ถ๊ฐ
df['bonus'] = df['salary'] * 0.1
print(df)
Row Selectionโ
# ์ธ๋ฑ์ค๋ก ์ ํ (iloc)
print(df.iloc[0]) # ์ฒซ ๋ฒ์งธ ํ
print(df.iloc[0:3]) # 0~2 ํ
print(df.iloc[[0, 2, 4]]) # 0, 2, 4 ํ
# ๋ผ๋ฒจ๋ก ์ ํ (loc)
df = df.set_index('name')
print(df.loc['Alice']) # Alice ํ
print(df.loc[['Alice', 'Bob']]) # ์ฌ๋ฌ ํ
# ํน์ ํ๊ณผ ์ด
print(df.iloc[0:2, 0:2]) # ์ฒ์ 2ํ, 2์ด
print(df.loc['Alice', 'age']) # Alice์ age
Conditional Selection (Filtering)โ
# ๋จ์ผ ์กฐ๊ฑด
print(df[df['age'] > 30])
print(df[df['department'] == 'IT'])
# ์ฌ๋ฌ ์กฐ๊ฑด
print(df[(df['age'] > 25) & (df['salary'] > 55000)])
print(df[(df['department'] == 'IT') | (df['department'] == 'HR')])
# isin() ์ฌ์ฉ
print(df[df['department'].isin(['IT', 'Sales'])])
# ๋ฌธ์์ด ์กฐ๊ฑด
print(df[df['name'].str.startswith('A')])
print(df[df['name'].str.contains('a')])
Reading/Writing Dataโ
CSV Filesโ
import pandas as pd
# CSV ์ฝ๊ธฐ
df = pd.read_csv('employees.csv')
# ์ต์
์ง์
df = pd.read_csv(
'employees.csv',
encoding='utf-8', # ์ธ์ฝ๋ฉ
index_col='id', # ์ธ๋ฑ์ค ์ด
usecols=['name', 'age'], # ํน์ ์ด๋ง
na_values=['NA', 'N/A'] # ๊ฒฐ์ธก์น๋ก ์ฒ๋ฆฌํ ๊ฐ
)
# CSV ์ฐ๊ธฐ
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
Excel Filesโ
pip install openpyxl
# Excel ์ฝ๊ธฐ
df = pd.read_excel('employees.xlsx', sheet_name='Sheet1')
# ์ฌ๋ฌ ์ํธ ์ฝ๊ธฐ
dfs = pd.read_excel('employees.xlsx', sheet_name=None)
for sheet_name, df in dfs.items():
print(f"Sheet: {sheet_name}")
print(df.head())
# Excel ์ฐ๊ธฐ
df.to_excel('output.xlsx', sheet_name='Data', index=False)
# ์ฌ๋ฌ ์ํธ์ ์ฐ๊ธฐ
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)
JSON Filesโ
# JSON ์ฝ๊ธฐ
df = pd.read_json('data.json')
# JSON ์ฐ๊ธฐ
df.to_json('output.json', orient='records', indent=2)
Basic Statisticsโ
import pandas as pd
import numpy as np
data = {
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [1000, 1500, 2000, 1200, 1800],
'quantity': [100, 150, 80, 120, 90]
}
df = pd.DataFrame(data)
# ๊ธฐ๋ณธ ํต๊ณ
print(df['price'].mean()) # ํ๊ท
print(df['price'].median()) # ์ค์๊ฐ
print(df['price'].std()) # ํ์คํธ์ฐจ
print(df['price'].min()) # ์ต์๊ฐ
print(df['price'].max()) # ์ต๋๊ฐ
print(df['price'].sum()) # ํฉ๊ณ
# ์ฌ๋ฌ ํต๊ณ ํ ๋ฒ์
print(df[['price', 'quantity']].describe())
# ์๊ด๊ด๊ณ
print(df[['price', 'quantity']].corr())
# ๊ฐ ๊ฐ์
print(df['product'].value_counts())
Sortingโ
# ๊ฐ์ผ๋ก ์ ๋ ฌ
sorted_df = df.sort_values('age')
sorted_df = df.sort_values('age', ascending=False) # ๋ด๋ฆผ์ฐจ์
# ์ฌ๋ฌ ์ด๋ก ์ ๋ ฌ
sorted_df = df.sort_values(['department', 'salary'], ascending=[True, False])
# ์ธ๋ฑ์ค๋ก ์ ๋ ฌ
sorted_df = df.sort_index()
Practical Examplesโ
์์ 1: ํ๋งค ๋ฐ์ดํฐ ๋ถ์โ
import pandas as pd
import numpy as np
# ์ํ ํ๋งค ๋ฐ์ดํฐ ์์ฑ
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100)
data = {
'date': dates,
'product': np.random.choice(['A', 'B', 'C', 'D'], 100),
'quantity': np.random.randint(1, 20, 100),
'price': np.random.choice([1000, 1500, 2000, 2500], 100)
}
df = pd.DataFrame(data)
# ๋งค์ถ ๊ณ์ฐ
df['revenue'] = df['quantity'] * df['price']
print("=== ํ๋งค ๋ฐ์ดํฐ ์์ฝ ===")
print(df.head())
# ์ด ๋งค์ถ
total_revenue = df['revenue'].sum()
print(f"\n์ด ๋งค์ถ: {total_revenue:,}์")
# ์ ํ๋ณ ํ๋งค๋
product_sales = df.groupby('product')['quantity'].sum().sort_values(ascending=False)
print("\n์ ํ๋ณ ํ๋งค๋:")
print(product_sales)
# ์ ํ๋ณ ํ๊ท ๋จ๊ฐ
avg_price = df.groupby('product')['price'].mean()
print("\n์ ํ๋ณ ํ๊ท ๋จ๊ฐ:")
print(avg_price)
# ์ต๊ณ ๋งค์ถ ๋ ์ง
best_day = df.groupby('date')['revenue'].sum().idxmax()
best_revenue = df.groupby('date')['revenue'].sum().max()
print(f"\n์ต๊ณ ๋งค์ถ ๋ ์ง: {best_day.date()} ({best_revenue:,}์)")
# ์ ํ๋ณ ๋งค์ถ ์์
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print("\n์ ํ๋ณ ๋งค์ถ ์์:")
for i, (product, revenue) in enumerate(product_revenue.items(), 1):
print(f"{i}. ์ ํ {product}: {revenue:,}์")
์์ 2: ํ์ ์ฑ์ ๊ด๋ฆฌโ
import pandas as pd
# ํ์ ์ฑ์ ๋ฐ์ดํฐ
data = {
'student_id': [1001, 1002, 1003, 1004, 1005],
'name': ['๊น์ฒ ์', '์ด์ํฌ', '๋ฐ๋ฏผ์', '์ ์ง์', '์ตํธ์ง'],
'math': [85, 92, 78, 95, 88],
'english': [90, 88, 85, 92, 86],
'science': [78, 95, 80, 88, 92]
}
df = pd.DataFrame(data)
print("=== ํ์ ์ฑ์ ํ ===")
print(df)
# ์ด์ ๊ณผ ํ๊ท ๊ณ์ฐ
df['total'] = df[['math', 'english', 'science']].sum(axis=1)
df['average'] = df[['math', 'english', 'science']].mean(axis=1)
# ๋ฑ์ ๊ณ์ฐ
df['rank'] = df['total'].rank(ascending=False, method='min')
# ์ ๋ ฌ
df = df.sort_values('rank')
print("\n=== ์ฑ์ ๊ฒฐ๊ณผ ===")
print(df[['name', 'total', 'average', 'rank']])
# ๊ณผ๋ชฉ๋ณ ํต๊ณ
print("\n=== ๊ณผ๋ชฉ๋ณ ํต๊ณ ===")
subjects = ['math', 'english', 'science']
stats = df[subjects].agg(['mean', 'max', 'min', 'std'])
print(stats.round(2))
# ์ฐ์ ํ์ (ํ๊ท 90์ ์ด์)
excellent = df[df['average'] >= 90]
print(f"\n=== ์ฐ์ ํ์ ({len(excellent)}๋ช
) ===")
print(excellent[['name', 'average']])
# ๊ณผ๋ชฉ๋ณ 1๋ฑ
print("\n=== ๊ณผ๋ชฉ๋ณ 1๋ฑ ===")
for subject in subjects:
top_student = df.loc[df[subject].idxmax()]
print(f"{subject}: {top_student['name']} ({top_student[subject]}์ )")
์์ 3: ์๋ณ ์ง์ถ ๋ถ์โ
import pandas as pd
# ์ง์ถ ๋ฐ์ดํฐ
data = {
'date': pd.to_datetime([
'2024-01-05', '2024-01-12', '2024-01-20',
'2024-02-03', '2024-02-15', '2024-02-28',
'2024-03-08', '2024-03-18', '2024-03-25'
]),
'category': ['์๋น', '๊ตํต', '์ผํ', '์๋น', '๋ฌธํ', '์๋น', '๊ตํต', '์ผํ', '์๋น'],
'amount': [45000, 50000, 120000, 38000, 25000, 52000, 48000, 95000, 41000]
}
df = pd.DataFrame(data)
# ์ ์ ๋ณด ์ถ๊ฐ
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.strftime('%Y-%m')
print("=== ์ง์ถ ๋ด์ญ ===")
print(df)
# ์ด ์ง์ถ
total = df['amount'].sum()
print(f"\n์ด ์ง์ถ: {total:,}์")
# ์นดํ
๊ณ ๋ฆฌ๋ณ ์ง์ถ
category_expense = df.groupby('category')['amount'].agg(['sum', 'count', 'mean'])
category_expense.columns = ['์ด์ก', 'ํ์', 'ํ๊ท ']
category_expense = category_expense.sort_values('์ด์ก', ascending=False)
print("\n=== ์นดํ
๊ณ ๋ฆฌ๋ณ ์ง์ถ ===")
print(category_expense)
# ์๋ณ ์ง์ถ
monthly_expense = df.groupby('month_name')['amount'].sum()
print("\n=== ์๋ณ ์ง์ถ ===")
for month, amount in monthly_expense.items():
print(f"{month}: {amount:,}์")
# ๊ฐ์ฅ ํฐ ์ง์ถ
max_expense = df.loc[df['amount'].idxmax()]
print(f"\n=== ์ต๋ ์ง์ถ ===")
print(f"{max_expense['date'].date()} - {max_expense['category']}: {max_expense['amount']:,}์")
# ์์ฐ ๋๋น ๋ถ์ (์ ์์ฐ 150,000์)
budget = 150000
monthly_total = df.groupby('month_name')['amount'].sum()
print("\n=== ์์ฐ ๋ถ์ (์ ์์ฐ: 150,000์) ===")
for month, amount in monthly_total.items():
diff = budget - amount
status = "์์ฐ ๋ด" if diff >= 0 else "์์ฐ ์ด๊ณผ"
print(f"{month}: {amount:,}์ ({status}, {abs(diff):,}์)")
์์ 4: ์ง์ ๋ฐ์ดํฐ ๋ถ์โ
import pandas as pd
import numpy as np
# ์ง์ ๋ฐ์ดํฐ
data = {
'employee_id': range(1001, 1021),
'name': [f'์ง์{i}' for i in range(1, 21)],
'department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing'], 20),
'position': np.random.choice(['์ฌ์', '๋๋ฆฌ', '๊ณผ์ฅ', '์ฐจ์ฅ', '๋ถ์ฅ'], 20),
'salary': np.random.randint(3000, 8000, 20) * 1000,
'years': np.random.randint(1, 15, 20)
}
df = pd.DataFrame(data)
print("=== ์ง์ ํํฉ ===")
print(df.head(10))
# ๋ถ์๋ณ ์ธ์์
dept_count = df['department'].value_counts()
print("\n=== ๋ถ์๋ณ ์ธ์ ===")
print(dept_count)
# ๋ถ์๋ณ ํ๊ท ์ฐ๋ด
dept_salary = df.groupby('department')['salary'].agg(['mean', 'min', 'max'])
dept_salary.columns = ['ํ๊ท ', '์ต์', '์ต๋']
print("\n=== ๋ถ์๋ณ ์ฐ๋ด ===")
print(dept_salary.round())
# ์ง๊ธ๋ณ ํต๊ณ
position_stats = df.groupby('position').agg({
'salary': 'mean',
'years': 'mean',
'employee_id': 'count'
})
position_stats.columns = ['ํ๊ท ์ฐ๋ด', 'ํ๊ท ๊ฒฝ๋ ฅ', '์ธ์']
print("\n=== ์ง๊ธ๋ณ ํต๊ณ ===")
print(position_stats.round())
# ๊ฒฝ๋ ฅ ๊ตฌ๊ฐ๋ณ ๋ถ์
df['experience_level'] = pd.cut(
df['years'],
bins=[0, 3, 7, 15],
labels=['์ ์
', '์ค๊ธ', '๊ณ ๊ธ']
)
exp_salary = df.groupby('experience_level')['salary'].mean()
print("\n=== ๊ฒฝ๋ ฅ๋ณ ํ๊ท ์ฐ๋ด ===")
print(exp_salary.round())
# ์์ ์ฐ๋ด์ 5๋ช
top5 = df.nlargest(5, 'salary')[['name', 'department', 'position', 'salary']]
print("\n=== ์ฐ๋ด Top 5 ===")
print(top5)
Useful Tipsโ
1. ์ฒด์ด๋ (Method Chaining)โ
# ์ฌ๋ฌ ์์
์ ์ฐ๊ฒฐ
result = (df
.query('age > 25')
.sort_values('salary', ascending=False)
.head(10)
.reset_index(drop=True)
)
2. apply ํจ์โ
# ํจ์ ์ ์ฉ
df['salary_category'] = df['salary'].apply(
lambda x: 'High' if x > 60000 else 'Normal'
)
# ์ฌ๋ฌ ์ด ์ฌ์ฉ
df['bonus'] = df.apply(
lambda row: row['salary'] * 0.2 if row['department'] == 'Sales' else row['salary'] * 0.1,
axis=1
)
3. ๋ ์ง ์ฒ๋ฆฌโ
df['date'] = pd.to_datetime('2024-01-01')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_name'] = df['date'].dt.day_name()
Frequently Asked Questionsโ
DataFrame๊ณผ Excel์ ์ฐจ์ด๋?โ
Pandas DataFrame:
- ํ๋ก๊ทธ๋๋ฐ์ผ๋ก ์ ์ด
- ๋์ฉ๋ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ๊ฐ๋ฅ
- ์๋ํ ์ฌ์
- ๋ณต์กํ ์ฐ์ฐ ๊ฐ๋ฅ
Excel:
- GUI ๊ธฐ๋ฐ
- ์์ ๋ฐ์ดํฐ์ ์ ํฉ
- ์๊ฐ์ ํธ์ง ์ฉ์ด
- ์์ ์ ๋ ฅ ์ง๊ด์
loc์ iloc์ ์ฐจ์ด๋?โ
df = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
# loc: ๋ผ๋ฒจ ๊ธฐ๋ฐ
print(df.loc['a']) # ์ธ๋ฑ์ค 'a'
# iloc: ์ ์ ์์น ๊ธฐ๋ฐ
print(df.iloc[0]) # ์ฒซ ๋ฒ์งธ ํ
๋ณต์ฌ๋ณธ vs ๋ทฐ?โ
# ๋ทฐ (์๋ณธ ์ํฅ)
view = df[df['age'] > 30]
view['age'] = 99 # SettingWithCopyWarning
# ๋ณต์ฌ๋ณธ (์์ )
copy = df[df['age'] > 30].copy()
copy['age'] = 99 # OK
๋์ฉ๋ ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋?โ
# ์ฒญํฌ ๋จ์ ์ฝ๊ธฐ
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process(chunk)
# ํน์ ์ด๋ง ์ฝ๊ธฐ
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])
# ๋ฐ์ดํฐ ํ์
์ง์
df = pd.read_csv('large_file.csv', dtype={'col1': 'int32'})
Next Stepsโ
Pandas ๊ธฐ์ด๋ฅผ ์ตํ๋ค๋ฉด ๋ค์์ ํ์ตํด๋ณด์ธ์:
- ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ: ๊ฒฐ์ธก์น, ์ค๋ณต, ๋ณํ
- ๋ฐ์ดํฐ ๋ณํฉ: merge, join, concat
- ๊ทธ๋ฃนํ: groupby ๊ณ ๊ธ ํ์ฉ
- ์๊ณ์ด: ๋ ์ง/์๊ฐ ๋ฐ์ดํฐ ์ฒ๋ฆฌ
- ์๊ฐํ: Matplotlib๊ณผ ํจ๊ป ์ฌ์ฉ