10 minutes to pandas - 결측치 처리

데이터 분석/Pandas 2022. 3. 20. 16:43

Missing Data

pandas primarily uses the value np.nan to represent missing data.
It is by default not included in computations.
Reindexing allows you to change/add/delete the index on a specified axis.

가장 중요한 포인트! NA 는 빼고 계산한다.

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

In [56]: df1.loc[dates[0] : dates[1], "E"] = 1

In [57]: df1
Out[57]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  NaN  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  NaN
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  NaN

NA 확인하기
To get the boolean mask where values are nan: nan 인 경우에 대해 boolean (T/F) 으로 마스킹한다.

pd.isna(df1)
pd.notna(df1)

NA 채우기 Filling missing values: fillna

Filling missing data: 결측치를 특정 값으로 채운다.
일반적으로 해당 컬럼의 기초값이나 평균값으로 채운다.

df1.fillna(value=5)

# 특정 값으로 업데이트
df.fillna(0)
df2["one"].fillna("missing")

# 앞 또는 뒤의 데이터로 업데이트
df.fillna(method="ffill")
df.fillna(method="ffill", limit=1)

# Pandas Object 계산값으로 업데이트 (groupby 활용하기도)
dff.fillna(dff.mean())
dff.fillna(dff.mean()["B":"C"])

NA 버리기 Dropping axis labels with missing data: dropna
You may wish to simply exclude labels from a data set which refer to missing data. To do this, use dropna():

To drop any rows that having missing data: 결측치가 하나라도 존재하는 행에 대해서 버린다.

df1.dropna(how="any")

#행을 날려라!
df.dropna(axis=0) 

#열을 날려라!
df.dropna(axis=1)

#drop 의 옵션들
dff.dropna(axis = 0, subset = ['A']) # subset A 에서만 na 를 제거하자. 
dff.dropna(axis = 0, how='any', thresh=2) # na 가 2개이상 있는 행을 제거하자.

🔥 결측치에 대한 포인트 요약.

1. np.nan 은 서로 비교가 불가능하다. (동치비교가 불가능하다.)
2. pandas 의 operation 함수는 모두 결측치를 제외하고 계산한다.
3. 결측치를 포함한 데이터에 대해 groupby 를 할 경우 자동적으로 배제하고 그룹핑을 한다.

결측치 종류는 다음과 같다.
- Numeric Container 에 대한 결측치는 NaN 이다. dtype: float64 (default)
- 만약 float64 가 아닌 dtype 으로 설정되어있을 경우, 결측치는 <NA> 이다. dtype: Int64
- Timestamp 에 대한 결측치는 NaT 이다. dtype: datetime64[ns]
- Object Contaniner 에 대한 결측치는 입력된 대로 처리한다. None 또는 NaN 이다. dtype: object

결측치는 데이터셋마다 다양하게 표시될 수 있으니 주의해야한다.

'na', 'N#A', 'none', 'NaN', '<NA>', '<NAT>'
'unknown', 'not available', 'missing', 'inf', '-inf'

참고로 np.nan 은 서로 비교가 불가능하다. (동치비교가 불가능하다.)

None == None    # True
np.nan == np.nan    # False
np.nan == None    # False

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data.

pandas 의 operation 함수는 모두 결측치를 제외하고 계산한다.
- skipna = True 가 기본 설정이며, 해당 결측치를 빼고 통계치를 계산한다.
- skipna = False 의 경우, 결측치가 한개만 있어도 최종결과를 결측치로 응답한다.

NA groups in GroupBy are automatically excluded.
결측치를 포함한 데이터에 대해 groupby 를 할 경우 자동적으로 배제하고 그룹핑을 한다.

+a. Filling missing values: fillna 외

Fillng by Interpolation

Both Series and DataFrame objects have interpolate() that, by default, performs linear interpolation at missing data points. Fill NaN values using an interpolation method. Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.

값을 추측하여 삽입해주는 메소드이다.
다른 것은 어려우니까 3개만 외워놓자!

# 기본: 앞/뒤 값이 있으면 linear, 없으면 뒤로 적용
s.interpolate() 
ser.interpolate(limit=1)

# 기본: 앞/뒤 값이 있으면 linear, 없으면 앞 또는 뒤 선택적용 
ser.interpolate(limit=1, limit_direction="both")
ser.interpolate(limit_direction="both")

# 앞의 값으로 채우기 (pad(ffill), limit 2개)
s.interpolate(method='pad', limit=2)

# 선형, 1차방정식
df.interpolate(method='linear', limit_direction='forward', axis=0)

# 다항식
s.interpolate(method='polynomial', order=1) #1차방정식 (linear)
s.interpolate(method='polynomial', order=2) #2차방정식 (quadratic)

s = pd.Series([1.0, np.nan, 9.0, 16.0])
s.interpolate(method='polynomial', order=1)
'''
0     1.0
1     5.0
2     9.0
3    16.0
dtype: float64
'''

df['d'].interpolate(method='polynomial', order=2)
'''
0     1.0
1     4.0
2     9.0
3    16.0
dtype: float64
'''

Replacing generic values

Often times we want to replace arbitrary values with other values. replace() in Series and replace() in DataFrame provides an efficient yet flexible way to perform such replacements.

ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])
df = pd.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})


#basic
ser.replace(0, 5)
ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
df.replace(r"\s*\.\s*", np.nan, regex=True)  # .포함한 문자열은 np.nan 으로 
df.replace(["a", "."], ["b", np.nan]) # a 는 b 로, . 은 np.nan 으로 


#mapping
ser.replace({0: 10, 1: 100}) '0은 10으로 1 은 100 으로

# advanced 
df.replace({"b": "."}, {"b": np.nan}) # b 열의 . 은 b 열의 np.nan 으로 
df.replace({"b": {"b": r""}}, regex=True) # b열에서 b 값은 '' 으로 
df.replace({"a": 0, "b": 5}, 100) 'a열 0, b열 5 는 100 으로


#interpolation
ser.replace([1, 2, 3], method="pad")

'데이터 분석 > Pandas' 카테고리의 다른 글

10 minutes to pandas - Group by (split - apply - combine) (0)	2022.03.29
10 minutes to pandas - Merge & Join (0)	2022.03.27
10 minutes to pandas - 생성 & 조회/변경 (0)	2022.03.19
Pandas 데이터 먼징 실습 2 - 날짜/시간 가공 (2)	2020.08.27
Pandas 데이터 먼징 실습 1 - Null/Outlier 처리 및 데이터타입 변환 (0)	2020.08.24

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

PM 의 생활 PM 의 생활

Missing Data

'데이터 분석 > Pandas' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

Missing Data

'데이터 분석 > Pandas' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역