Skip to main content

Getting Started with Pandas ๐Ÿผ

Pandas is the most powerful and popular library for data analysis in Python. ํ‘œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

What is Pandas?โ€‹

Pandas stands for "Panel Data" and is a library for efficiently handling structured data.

Key Featuresโ€‹

  • DataFrame: ํ‘œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ
  • ๋‹ค์–‘ํ•œ ํŒŒ์ผ ํ˜•์‹ ์ง€์›: CSV, Excel, JSON, SQL ๋“ฑ
  • ๊ฐ•๋ ฅํ•œ ๋ฐ์ดํ„ฐ ์กฐ์ž‘: ํ•„ํ„ฐ๋ง, ๊ทธ๋ฃนํ™”, ๋ณ‘ํ•ฉ
  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ ์‰ฝ๊ฒŒ ๋‹ค๋ฃจ๊ธฐ

Installationโ€‹

pip install pandas
import pandas as pd

# ๋ฒ„์ „ ํ™•์ธ
print(pd.__version__) # 2.0.3

Core Data Structuresโ€‹

Series (1-dimensional)โ€‹

Series represents column data.

import pandas as pd

# ๋ฆฌ์ŠคํŠธ์—์„œ ์ƒ์„ฑ
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64

# ์ธ๋ฑ์Šค ์ง€์ •
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s)
# a 10
# b 20
# c 30
# dtype: int64

# ๋”•์…”๋„ˆ๋ฆฌ์—์„œ ์ƒ์„ฑ
data = {'Seoul': 9800000, 'Busan': 3400000, 'Incheon': 2900000}
s = pd.Series(data)
print(s)
# Seoul 9800000
# Busan 3400000
# Incheon 2900000
# dtype: int64

# ์ ‘๊ทผ
print(s['Seoul']) # 9800000
print(s[0]) # 9800000
print(s[['Seoul', 'Busan']]) # ์—ฌ๋Ÿฌ ๊ฐ’ ์„ ํƒ

DataFrame (2-dimensional)โ€‹

DataFrame is data in tabular form.

import pandas as pd

# ๋”•์…”๋„ˆ๋ฆฌ์—์„œ ์ƒ์„ฑ
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 28],
'city': ['Seoul', 'Busan', 'Incheon', 'Seoul']
}
df = pd.DataFrame(data)
print(df)
# name age city
# 0 Alice 25 Seoul
# 1 Bob 30 Busan
# 2 Charlie 35 Incheon
# 3 David 28 Seoul

# ๋ฆฌ์ŠคํŠธ์—์„œ ์ƒ์„ฑ
data = [
['Alice', 25, 'Seoul'],
['Bob', 30, 'Busan'],
['Charlie', 35, 'Incheon']
]
df = pd.DataFrame(data, columns=['name', 'age', 'city'])
print(df)

# CSV์—์„œ ์ƒ์„ฑ (๋‚˜์ค‘์— ์ž์„ธํžˆ)
# df = pd.read_csv('data.csv')

DataFrame Basic Informationโ€‹

import pandas as pd

data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 32],
'salary': [50000, 60000, 75000, 55000, 65000],
'department': ['IT', 'HR', 'IT', 'Sales', 'HR']
}
df = pd.DataFrame(data)

# ๊ธฐ๋ณธ ์ •๋ณด
print(df.shape) # (5, 4) - ํ–‰, ์—ด ๊ฐœ์ˆ˜
print(df.columns) # Index(['name', 'age', 'salary', 'department'])
print(df.index) # RangeIndex(start=0, stop=5, step=1)
print(df.dtypes) # ๊ฐ ์—ด์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…

# ์ฒ˜์Œ/๋งˆ์ง€๋ง‰ ๋ช‡ ํ–‰ ๋ณด๊ธฐ
print(df.head(3)) # ์ฒ˜์Œ 3ํ–‰
print(df.tail(2)) # ๋งˆ์ง€๋ง‰ 2ํ–‰

# ํ†ต๊ณ„ ์ •๋ณด
print(df.info()) # ์ „์ฒด ์ •๋ณด
print(df.describe()) # ์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ ํ†ต๊ณ„
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 age 5 non-null int64
2 salary 5 non-null int64
3 department 5 non-null object
dtypes: int64(2), object(2)

Data Selectionโ€‹

Column Selectionโ€‹

# ํ•œ ๊ฐœ ์—ด (Series)
print(df['name'])
# ๋˜๋Š”
print(df.name)

# ์—ฌ๋Ÿฌ ์—ด (DataFrame)
print(df[['name', 'age']])

# ์—ด ์ถ”๊ฐ€
df['bonus'] = df['salary'] * 0.1
print(df)

Row Selectionโ€‹

# ์ธ๋ฑ์Šค๋กœ ์„ ํƒ (iloc)
print(df.iloc[0]) # ์ฒซ ๋ฒˆ์งธ ํ–‰
print(df.iloc[0:3]) # 0~2 ํ–‰
print(df.iloc[[0, 2, 4]]) # 0, 2, 4 ํ–‰

# ๋ผ๋ฒจ๋กœ ์„ ํƒ (loc)
df = df.set_index('name')
print(df.loc['Alice']) # Alice ํ–‰
print(df.loc[['Alice', 'Bob']]) # ์—ฌ๋Ÿฌ ํ–‰

# ํŠน์ • ํ–‰๊ณผ ์—ด
print(df.iloc[0:2, 0:2]) # ์ฒ˜์Œ 2ํ–‰, 2์—ด
print(df.loc['Alice', 'age']) # Alice์˜ age

Conditional Selection (Filtering)โ€‹

# ๋‹จ์ผ ์กฐ๊ฑด
print(df[df['age'] > 30])
print(df[df['department'] == 'IT'])

# ์—ฌ๋Ÿฌ ์กฐ๊ฑด
print(df[(df['age'] > 25) & (df['salary'] > 55000)])
print(df[(df['department'] == 'IT') | (df['department'] == 'HR')])

# isin() ์‚ฌ์šฉ
print(df[df['department'].isin(['IT', 'Sales'])])

# ๋ฌธ์ž์—ด ์กฐ๊ฑด
print(df[df['name'].str.startswith('A')])
print(df[df['name'].str.contains('a')])

Reading/Writing Dataโ€‹

CSV Filesโ€‹

import pandas as pd

# CSV ์ฝ๊ธฐ
df = pd.read_csv('employees.csv')

# ์˜ต์…˜ ์ง€์ •
df = pd.read_csv(
'employees.csv',
encoding='utf-8', # ์ธ์ฝ”๋”ฉ
index_col='id', # ์ธ๋ฑ์Šค ์—ด
usecols=['name', 'age'], # ํŠน์ • ์—ด๋งŒ
na_values=['NA', 'N/A'] # ๊ฒฐ์ธก์น˜๋กœ ์ฒ˜๋ฆฌํ•  ๊ฐ’
)

# CSV ์“ฐ๊ธฐ
df.to_csv('output.csv', index=False, encoding='utf-8-sig')

Excel Filesโ€‹

pip install openpyxl
# Excel ์ฝ๊ธฐ
df = pd.read_excel('employees.xlsx', sheet_name='Sheet1')

# ์—ฌ๋Ÿฌ ์‹œํŠธ ์ฝ๊ธฐ
dfs = pd.read_excel('employees.xlsx', sheet_name=None)
for sheet_name, df in dfs.items():
print(f"Sheet: {sheet_name}")
print(df.head())

# Excel ์“ฐ๊ธฐ
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# ์—ฌ๋Ÿฌ ์‹œํŠธ์— ์“ฐ๊ธฐ
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)

JSON Filesโ€‹

# JSON ์ฝ๊ธฐ
df = pd.read_json('data.json')

# JSON ์“ฐ๊ธฐ
df.to_json('output.json', orient='records', indent=2)

Basic Statisticsโ€‹

import pandas as pd
import numpy as np

data = {
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [1000, 1500, 2000, 1200, 1800],
'quantity': [100, 150, 80, 120, 90]
}
df = pd.DataFrame(data)

# ๊ธฐ๋ณธ ํ†ต๊ณ„
print(df['price'].mean()) # ํ‰๊ท 
print(df['price'].median()) # ์ค‘์•™๊ฐ’
print(df['price'].std()) # ํ‘œ์ค€ํŽธ์ฐจ
print(df['price'].min()) # ์ตœ์†Ÿ๊ฐ’
print(df['price'].max()) # ์ตœ๋Œ“๊ฐ’
print(df['price'].sum()) # ํ•ฉ๊ณ„

# ์—ฌ๋Ÿฌ ํ†ต๊ณ„ ํ•œ ๋ฒˆ์—
print(df[['price', 'quantity']].describe())

# ์ƒ๊ด€๊ด€๊ณ„
print(df[['price', 'quantity']].corr())

# ๊ฐ’ ๊ฐœ์ˆ˜
print(df['product'].value_counts())

Sortingโ€‹

# ๊ฐ’์œผ๋กœ ์ •๋ ฌ
sorted_df = df.sort_values('age')
sorted_df = df.sort_values('age', ascending=False) # ๋‚ด๋ฆผ์ฐจ์ˆœ

# ์—ฌ๋Ÿฌ ์—ด๋กœ ์ •๋ ฌ
sorted_df = df.sort_values(['department', 'salary'], ascending=[True, False])

# ์ธ๋ฑ์Šค๋กœ ์ •๋ ฌ
sorted_df = df.sort_index()

Practical Examplesโ€‹

์˜ˆ์ œ 1: ํŒ๋งค ๋ฐ์ดํ„ฐ ๋ถ„์„โ€‹

import pandas as pd
import numpy as np

# ์ƒ˜ํ”Œ ํŒ๋งค ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100)
data = {
'date': dates,
'product': np.random.choice(['A', 'B', 'C', 'D'], 100),
'quantity': np.random.randint(1, 20, 100),
'price': np.random.choice([1000, 1500, 2000, 2500], 100)
}
df = pd.DataFrame(data)

# ๋งค์ถœ ๊ณ„์‚ฐ
df['revenue'] = df['quantity'] * df['price']

print("=== ํŒ๋งค ๋ฐ์ดํ„ฐ ์š”์•ฝ ===")
print(df.head())

# ์ด ๋งค์ถœ
total_revenue = df['revenue'].sum()
print(f"\n์ด ๋งค์ถœ: {total_revenue:,}์›")

# ์ œํ’ˆ๋ณ„ ํŒ๋งค๋Ÿ‰
product_sales = df.groupby('product')['quantity'].sum().sort_values(ascending=False)
print("\n์ œํ’ˆ๋ณ„ ํŒ๋งค๋Ÿ‰:")
print(product_sales)

# ์ œํ’ˆ๋ณ„ ํ‰๊ท  ๋‹จ๊ฐ€
avg_price = df.groupby('product')['price'].mean()
print("\n์ œํ’ˆ๋ณ„ ํ‰๊ท  ๋‹จ๊ฐ€:")
print(avg_price)

# ์ตœ๊ณ  ๋งค์ถœ ๋‚ ์งœ
best_day = df.groupby('date')['revenue'].sum().idxmax()
best_revenue = df.groupby('date')['revenue'].sum().max()
print(f"\n์ตœ๊ณ  ๋งค์ถœ ๋‚ ์งœ: {best_day.date()} ({best_revenue:,}์›)")

# ์ œํ’ˆ๋ณ„ ๋งค์ถœ ์ˆœ์œ„
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print("\n์ œํ’ˆ๋ณ„ ๋งค์ถœ ์ˆœ์œ„:")
for i, (product, revenue) in enumerate(product_revenue.items(), 1):
print(f"{i}. ์ œํ’ˆ {product}: {revenue:,}์›")

์˜ˆ์ œ 2: ํ•™์ƒ ์„ฑ์  ๊ด€๋ฆฌโ€‹

import pandas as pd

# ํ•™์ƒ ์„ฑ์  ๋ฐ์ดํ„ฐ
data = {
'student_id': [1001, 1002, 1003, 1004, 1005],
'name': ['๊น€์ฒ ์ˆ˜', '์ด์˜ํฌ', '๋ฐ•๋ฏผ์ˆ˜', '์ •์ง€์€', '์ตœํ˜ธ์ง„'],
'math': [85, 92, 78, 95, 88],
'english': [90, 88, 85, 92, 86],
'science': [78, 95, 80, 88, 92]
}
df = pd.DataFrame(data)

print("=== ํ•™์ƒ ์„ฑ์ ํ‘œ ===")
print(df)

# ์ด์ ๊ณผ ํ‰๊ท  ๊ณ„์‚ฐ
df['total'] = df[['math', 'english', 'science']].sum(axis=1)
df['average'] = df[['math', 'english', 'science']].mean(axis=1)

# ๋“ฑ์ˆ˜ ๊ณ„์‚ฐ
df['rank'] = df['total'].rank(ascending=False, method='min')

# ์ •๋ ฌ
df = df.sort_values('rank')

print("\n=== ์„ฑ์  ๊ฒฐ๊ณผ ===")
print(df[['name', 'total', 'average', 'rank']])

# ๊ณผ๋ชฉ๋ณ„ ํ†ต๊ณ„
print("\n=== ๊ณผ๋ชฉ๋ณ„ ํ†ต๊ณ„ ===")
subjects = ['math', 'english', 'science']
stats = df[subjects].agg(['mean', 'max', 'min', 'std'])
print(stats.round(2))

# ์šฐ์ˆ˜ ํ•™์ƒ (ํ‰๊ท  90์  ์ด์ƒ)
excellent = df[df['average'] >= 90]
print(f"\n=== ์šฐ์ˆ˜ ํ•™์ƒ ({len(excellent)}๋ช…) ===")
print(excellent[['name', 'average']])

# ๊ณผ๋ชฉ๋ณ„ 1๋“ฑ
print("\n=== ๊ณผ๋ชฉ๋ณ„ 1๋“ฑ ===")
for subject in subjects:
top_student = df.loc[df[subject].idxmax()]
print(f"{subject}: {top_student['name']} ({top_student[subject]}์ )")

์˜ˆ์ œ 3: ์›”๋ณ„ ์ง€์ถœ ๋ถ„์„โ€‹

import pandas as pd

# ์ง€์ถœ ๋ฐ์ดํ„ฐ
data = {
'date': pd.to_datetime([
'2024-01-05', '2024-01-12', '2024-01-20',
'2024-02-03', '2024-02-15', '2024-02-28',
'2024-03-08', '2024-03-18', '2024-03-25'
]),
'category': ['์‹๋น„', '๊ตํ†ต', '์‡ผํ•‘', '์‹๋น„', '๋ฌธํ™”', '์‹๋น„', '๊ตํ†ต', '์‡ผํ•‘', '์‹๋น„'],
'amount': [45000, 50000, 120000, 38000, 25000, 52000, 48000, 95000, 41000]
}
df = pd.DataFrame(data)

# ์›” ์ •๋ณด ์ถ”๊ฐ€
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.strftime('%Y-%m')

print("=== ์ง€์ถœ ๋‚ด์—ญ ===")
print(df)

# ์ด ์ง€์ถœ
total = df['amount'].sum()
print(f"\n์ด ์ง€์ถœ: {total:,}์›")

# ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ์ง€์ถœ
category_expense = df.groupby('category')['amount'].agg(['sum', 'count', 'mean'])
category_expense.columns = ['์ด์•ก', 'ํšŸ์ˆ˜', 'ํ‰๊ท ']
category_expense = category_expense.sort_values('์ด์•ก', ascending=False)
print("\n=== ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ์ง€์ถœ ===")
print(category_expense)

# ์›”๋ณ„ ์ง€์ถœ
monthly_expense = df.groupby('month_name')['amount'].sum()
print("\n=== ์›”๋ณ„ ์ง€์ถœ ===")
for month, amount in monthly_expense.items():
print(f"{month}: {amount:,}์›")

# ๊ฐ€์žฅ ํฐ ์ง€์ถœ
max_expense = df.loc[df['amount'].idxmax()]
print(f"\n=== ์ตœ๋Œ€ ์ง€์ถœ ===")
print(f"{max_expense['date'].date()} - {max_expense['category']}: {max_expense['amount']:,}์›")

# ์˜ˆ์‚ฐ ๋Œ€๋น„ ๋ถ„์„ (์›” ์˜ˆ์‚ฐ 150,000์›)
budget = 150000
monthly_total = df.groupby('month_name')['amount'].sum()
print("\n=== ์˜ˆ์‚ฐ ๋ถ„์„ (์›” ์˜ˆ์‚ฐ: 150,000์›) ===")
for month, amount in monthly_total.items():
diff = budget - amount
status = "์˜ˆ์‚ฐ ๋‚ด" if diff >= 0 else "์˜ˆ์‚ฐ ์ดˆ๊ณผ"
print(f"{month}: {amount:,}์› ({status}, {abs(diff):,}์›)")

์˜ˆ์ œ 4: ์ง์› ๋ฐ์ดํ„ฐ ๋ถ„์„โ€‹

import pandas as pd
import numpy as np

# ์ง์› ๋ฐ์ดํ„ฐ
data = {
'employee_id': range(1001, 1021),
'name': [f'์ง์›{i}' for i in range(1, 21)],
'department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing'], 20),
'position': np.random.choice(['์‚ฌ์›', '๋Œ€๋ฆฌ', '๊ณผ์žฅ', '์ฐจ์žฅ', '๋ถ€์žฅ'], 20),
'salary': np.random.randint(3000, 8000, 20) * 1000,
'years': np.random.randint(1, 15, 20)
}
df = pd.DataFrame(data)

print("=== ์ง์› ํ˜„ํ™ฉ ===")
print(df.head(10))

# ๋ถ€์„œ๋ณ„ ์ธ์›์ˆ˜
dept_count = df['department'].value_counts()
print("\n=== ๋ถ€์„œ๋ณ„ ์ธ์› ===")
print(dept_count)

# ๋ถ€์„œ๋ณ„ ํ‰๊ท  ์—ฐ๋ด‰
dept_salary = df.groupby('department')['salary'].agg(['mean', 'min', 'max'])
dept_salary.columns = ['ํ‰๊ท ', '์ตœ์†Œ', '์ตœ๋Œ€']
print("\n=== ๋ถ€์„œ๋ณ„ ์—ฐ๋ด‰ ===")
print(dept_salary.round())

# ์ง๊ธ‰๋ณ„ ํ†ต๊ณ„
position_stats = df.groupby('position').agg({
'salary': 'mean',
'years': 'mean',
'employee_id': 'count'
})
position_stats.columns = ['ํ‰๊ท ์—ฐ๋ด‰', 'ํ‰๊ท ๊ฒฝ๋ ฅ', '์ธ์›']
print("\n=== ์ง๊ธ‰๋ณ„ ํ†ต๊ณ„ ===")
print(position_stats.round())

# ๊ฒฝ๋ ฅ ๊ตฌ๊ฐ„๋ณ„ ๋ถ„์„
df['experience_level'] = pd.cut(
df['years'],
bins=[0, 3, 7, 15],
labels=['์‹ ์ž…', '์ค‘๊ธ‰', '๊ณ ๊ธ‰']
)
exp_salary = df.groupby('experience_level')['salary'].mean()
print("\n=== ๊ฒฝ๋ ฅ๋ณ„ ํ‰๊ท  ์—ฐ๋ด‰ ===")
print(exp_salary.round())

# ์ƒ์œ„ ์—ฐ๋ด‰์ž 5๋ช…
top5 = df.nlargest(5, 'salary')[['name', 'department', 'position', 'salary']]
print("\n=== ์—ฐ๋ด‰ Top 5 ===")
print(top5)

Useful Tipsโ€‹

1. ์ฒด์ด๋‹ (Method Chaining)โ€‹

# ์—ฌ๋Ÿฌ ์ž‘์—…์„ ์—ฐ๊ฒฐ
result = (df
.query('age > 25')
.sort_values('salary', ascending=False)
.head(10)
.reset_index(drop=True)
)

2. apply ํ•จ์ˆ˜โ€‹

# ํ•จ์ˆ˜ ์ ์šฉ
df['salary_category'] = df['salary'].apply(
lambda x: 'High' if x > 60000 else 'Normal'
)

# ์—ฌ๋Ÿฌ ์—ด ์‚ฌ์šฉ
df['bonus'] = df.apply(
lambda row: row['salary'] * 0.2 if row['department'] == 'Sales' else row['salary'] * 0.1,
axis=1
)

3. ๋‚ ์งœ ์ฒ˜๋ฆฌโ€‹

df['date'] = pd.to_datetime('2024-01-01')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_name'] = df['date'].dt.day_name()

Frequently Asked Questionsโ€‹

DataFrame๊ณผ Excel์˜ ์ฐจ์ด๋Š”?โ€‹

Pandas DataFrame:

  • ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ œ์–ด
  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ์ž๋™ํ™” ์‰ฌ์›€
  • ๋ณต์žกํ•œ ์—ฐ์‚ฐ ๊ฐ€๋Šฅ

Excel:

  • GUI ๊ธฐ๋ฐ˜
  • ์ž‘์€ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉ
  • ์‹œ๊ฐ์  ํŽธ์ง‘ ์šฉ์ด
  • ์ˆ˜์‹ ์ž…๋ ฅ ์ง๊ด€์ 

loc์™€ iloc์˜ ์ฐจ์ด๋Š”?โ€‹

df = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])

# loc: ๋ผ๋ฒจ ๊ธฐ๋ฐ˜
print(df.loc['a']) # ์ธ๋ฑ์Šค 'a'

# iloc: ์ •์ˆ˜ ์œ„์น˜ ๊ธฐ๋ฐ˜
print(df.iloc[0]) # ์ฒซ ๋ฒˆ์งธ ํ–‰

๋ณต์‚ฌ๋ณธ vs ๋ทฐ?โ€‹

# ๋ทฐ (์›๋ณธ ์˜ํ–ฅ)
view = df[df['age'] > 30]
view['age'] = 99 # SettingWithCopyWarning

# ๋ณต์‚ฌ๋ณธ (์•ˆ์ „)
copy = df[df['age'] > 30].copy()
copy['age'] = 99 # OK

๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Š”?โ€‹

# ์ฒญํฌ ๋‹จ์œ„ ์ฝ๊ธฐ
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process(chunk)

# ํŠน์ • ์—ด๋งŒ ์ฝ๊ธฐ
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])

# ๋ฐ์ดํ„ฐ ํƒ€์ž… ์ง€์ •
df = pd.read_csv('large_file.csv', dtype={'col1': 'int32'})

Next Stepsโ€‹

Pandas ๊ธฐ์ดˆ๋ฅผ ์ตํ˜”๋‹ค๋ฉด ๋‹ค์Œ์„ ํ•™์Šตํ•ด๋ณด์„ธ์š”:

  1. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ: ๊ฒฐ์ธก์น˜, ์ค‘๋ณต, ๋ณ€ํ™˜
  2. ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ: merge, join, concat
  3. ๊ทธ๋ฃนํ™”: groupby ๊ณ ๊ธ‰ ํ™œ์šฉ
  4. ์‹œ๊ณ„์—ด: ๋‚ ์งœ/์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  5. ์‹œ๊ฐํ™”: Matplotlib๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ

Referencesโ€‹