pandas.DataFrame.replace, where, mask

데이터 분석/Pandas 2022. 5. 24. 08:56

클러스터링 기법을 사용하기 위해 index 에 대한 피쳐를 붙인다. (결과: cluster_df)
이 때, replace 함수가 필요하다.

number_of_order_per_CID = order_df.drop_duplicates(
	subset=['CustomerID', 'InvoiceNo']
)['CustomerID'].value_counts()
# CustomerID 를 index 로 하는 Series

cluster_df['주문횟수'] = cluster_df['CustomerID'].replace(
	number_of_order_per_CID.to_dict())
# CustomerID 를 CustomerID 의 주문횟수 값으로 치환
# 매칭되는 것이 없다면 CustomerID 값 유지

cluster_df.loc[cluster_df['CustomerID'] == cluster_df['주문횟수'], '주문횟수'] = 0
# 남은 CustomerID 값에 대해 주문횟수 0 으로 업데이트

replace

to_replace 값을 value 로 치환하는 함수

pandas.Series.replace(to_replace, value, inplace, limit, regex, method)
pandas.DataFrame.replace(to_replace, value, inplace, limit, regex, method)

to_replace: 치환 대상
value: 치환 후 값

to_replace existing values	value replacement values	비고
numeric, str, regex	numeric, str, regex
list of numeric, str, regex	numeric, str, regex *list of numeric, str, regex	리스트에 포함된 값을 하나의 값으로 치환 가능하다. *to_replace 와 value 의 리스트 길이가 같아야 한다. (그래야 매칭을 나란히 매칭시키니)
dict {'a':'b', 'y':'z'}	-	key 가 to_replace, value 가 value
dict {'a': {'b': np.nan}}	-	nested_dict : a 컬럼에 대해 key 가 to_replace, value 가 value
dict {'a':1, 'b':'z'}	{'a':99, 'b':'a'}	a 컬럼에 있는 값 1 을 99 로, b 컬럼에 있는 값 'z' 를 'a' 로 to_replace 와 value 에 key 가 맞아야 한다.

regex=True 면 to_replace 와 value 를 모두 regex 로 해석한다.

limit: max 자동으로 값을 채워줄 때 최대로 채워줄 값 사이즈

method: value 를 지정하지 않고, 행의 값을 기준으로 치환한다. (pad, ffill, bfill, None)
만약 value 를 지정하지 않은 경우 method = pad 로 앞의 값을 넣는다.

where / mask

where: condition 조건에 부합하지 않는 대상에 대해 값을 치환하는 함수

- 조건이 참이라면 원래의 값을 유지한다.
- 조건이 거짓이라면 other 에 명시된 값으로 치환한다.

mask: condition 조건에 부합하는 대상에 대해 값을 치환하는 함수

- 조건이 참이라면 other 에 명시된 값으로 치환한다.
- 조건이 거짓이라면 원래의 값을 유지한다.

pandas.Series.where(cond, other, inplace, axis, level, errors, try_cast)
pandas.DataFrame.where(cond, other, inplace, axis, level, errors, try_cast)

pandas.Series.mask(cond, other, inplace, axis, level, errors, try_cast)
pandas.DataFrame.mask(cond, other, inplace, axis, level, errors, try_cast)

cond 에 boolean Series/DataFrame 을 리턴하는 함수가 들어갈 수도 있다.
other 에 scalar or Series/DataFrame 을 리턴하는 함수가 들어갈 수도 있다.

inplace: 값을 변경할지 여부
axis: 어떠한 축에 대해 치환할 것인지
level: 어떤 레벨에 대해 치환할 것인지

errors: {'raise', 'ignore'}
try_cast: error 가 났을 때의 처리방식

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df

/*
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
*/



m = df % 3 == 0
df.where(m, -df)

/*
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
*/

df.where(
[[False, False], [False, True], [True, False], [True, True], [False, False]],
[100, 200])

/*
	A	B
0	100	200
1	100	3
2	4	200
3	6	7
4	100	200
*/

// Array conditional must be same shape as self  cond 결과의 shape 가 동일해야한다.
// operands could not be broadcast together with shapes (5,2) (5,2) (3,) 
// 치환값이 브로드캐스팅 연산이 가능해야 한다.

참고. np.where 과 동작방식이 다르다.

np.where (condition, true 일 때의 값, false 일 때의 값)

df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df
'''
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
'''

df["logic"] = np.where(df["AAA"] > 5, "high", "low")
df

'''
   AAA  BBB  CCC logic
0    4   10  100   low
1    5   20   50   low
2    6   30  -30  high
3    7   40  -50  high
'''

'데이터 분석 > Pandas' 카테고리의 다른 글

Pandas 집계 - pivot_table vs. groupby (0)	2022.08.15
10 minutes to pandas - Pivot Tables (0)	2022.04.01
10 minutes to pandas - Reshaping (0)	2022.03.30
10 minutes to pandas - Group by (split - apply - combine) (0)	2022.03.29
10 minutes to pandas - Merge & Join (0)	2022.03.27

ABOUT ME

PM 의 생활 PM 의 생활

replace

where / mask

'데이터 분석 > Pandas' 카테고리의 다른 글

티스토리툴바

ABOUT ME

replace

where / mask

'데이터 분석 > Pandas' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바