10 minutes to pandas - Merge & Join

데이터 분석/Pandas 2022. 3. 27. 15:10

Concat

pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)

주요기능

objs: concat 대상
axis: concat 기준
join: concat 기준에 대해서 합집합 or 교집합

부가기능

ignore_index: concat 기준에 대해 인덱스 재정렬 여부
verify_integrity: 기준에 중복이 있을 경우 에러발생
keys: 데이터 출처를 나타내기 위한 Multi Index 처리
names: keys 에 대한 이름

동작원리

result = pd.concat([df1, df4], axis=1, join="inner")

Concat 을 이용한 Append

DataFrame 에 데이터 1행을 추가한다면? (Append)
데이터에 대해 index 도 붙일 DataFrame 의 컬럼과 동일하게 표기해주어야 잘 붙는다!

동작원리

s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

MERGE (DB-style JOIN)

pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)

주요기능

left: 왼쪽 데이터셋
right: 오른쪽 데이터셋
how: join 방식 (left, right, outer, inner, cross)
on: 기준, 왼쪽 데이터셋과 오른쪽 데이터셋 모두에 있어야 한다.
left_on: 왼쪽 데이터셋의 기준으로 사용할 인덱스의 이름 (행, 열)
right_on: 오른쪽 데이터셋의 기준으로 사용할 인덱스의 이름 (행, 열)
left_index: 왼쪽 데이터셋의 행 인덱스 사용여부
right_index: 왼쪽 데이터셋의 행 인덱스 사용여부

How 방식

Merge method	SQL Join Name	Description
left	LEFT OUTER JOIN	Use keys from left frame only
right	RIGHT OUTER JOIN	Use keys from right frame only
outer	FULL OUTER JOIN	Use union of keys from both frames
inner	INNER JOIN	Use intersection of keys from both frames
cross	CROSS JOIN	Create the cartesian product of rows of both frames

https://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra

Merge, join, concatenate and compare — pandas 1.4.1 documentation

The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say

pandas.pydata.org

부가기능

sort: 결과 데이터셋, join key (기준) 에 대해 정렬
suffixes: 중복된 컬럼에 대해 suffixes (접미사) 를 붙인다. (_x, _y)
indicator: 각 행에 대해 join 결과에 대해 유형을 구분한다. (left_only, right_only, both)
validate: 데이터 구성여부를 확인한다. (one_to_one, one_to_many, many_to_one, many_to_many)

동작원리

result = pd.merge(left, right, how="left", on=["key1", "key2"])

key 는 (key1, key2) 의 조합으로 left 기준이다.

result = pd.merge(left, right, how="outer", on=["key1", "key2"])

key 는 (key1, key2) 의 조합으로 outer 는 합집합, inner 은 교집합이다.

result = pd.merge(left, right, how="cross")

중복키 방지

사용자는 merge 를 실행하기 전에, 키 중복이 있는지 사전검증할 수 있다.
merge key B 는 right 에서 중복된 값이 있다.
validate = 'one_to_one' 에서는 에러를 발생시킨다. / validate = 'one_to_many' 는 실행된다.

left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

# result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
# MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

pd.merge(left, right, on="B", how="outer", validate="one_to_many")
'''
   A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0
'''

인덱스 기준 Merge

result = pd.merge(left, right, left_index=True, right_index=True, how="outer")
result = left.join(right, how="outer")

result = pd.merge(left, right, left_on="key", right_index=True, how="left", sort=False)
result = left.join(right, on="key", how="left", sort=False)

DataFrame.join 은 default 가 left join 이며, 엑셀의 VLOOKUP 과 같이 동작한다. (따라서 many_to_one 일 때 주로 사용된다.)
join 에서 on 을 기재해주지 않으면, index 를 기준으로 join 한다.
join 에서 on 을 기재해주면 기재된 index 를 기준으로 join 한다. (이 때, 행의 index 와 column 의 index 모두 가능하다!)

Column 과 Index Level 같이 활용해 Join 할 경우

left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")
left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "key2": ["K0", "K1", "K0", "K1"],
    },
    index=left_index,
)

right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")
right = pd.DataFrame(
    {
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
        "key2": ["K0", "K0", "K0", "K1"],
    },
    index=right_index,
)

result = left.merge(right, on=["key1", "key2"])
result = pd.merge(left, right, on=["key1", "key2"])

'데이터 분석 > Pandas' 카테고리의 다른 글

10 minutes to pandas - Reshaping (0)	2022.03.30
10 minutes to pandas - Group by (split - apply - combine) (0)	2022.03.29
10 minutes to pandas - 결측치 처리 (0)	2022.03.20
10 minutes to pandas - 생성 & 조회/변경 (0)	2022.03.19
Pandas 데이터 먼징 실습 2 - 날짜/시간 가공 (2)	2020.08.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

PM 의 생활 PM 의 생활

Concat

주요기능

부가기능

동작원리

Concat 을 이용한 Append

동작원리

MERGE (DB-style JOIN)

주요기능

How 방식

부가기능

동작원리

중복키 방지

인덱스 기준 Merge

Column 과 Index Level 같이 활용해 Join 할 경우

'데이터 분석 > Pandas' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

Concat

주요기능

부가기능

동작원리

Concat 을 이용한 Append

동작원리

MERGE (DB-style JOIN)

주요기능

How 방식

부가기능

동작원리

중복키 방지

인덱스 기준 Merge

Column 과 Index Level 같이 활용해 Join 할 경우

'데이터 분석 > Pandas' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역