10 minutes to pandas - 생성 & 조회/변경

haloaround 2022. 3. 19. 22:09

Object Creation

Series
One-dimensional ndarray with axis labels (including time series).
The object supports both integer- and label-based indexing

DataFrame
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns).
Can be thought of as a dict-like container for Series objects.
DataFrame 은 Column 단위로 Series 를 엮었다고 생각하면 된다!

df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       4 non-null      float64       
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32       
 3   D       4 non-null      int32         
 4   E       4 non-null      category      
 5   F       4 non-null      object        
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes
'''

DataFrame 의 3가지 구성요소

data
Dict can contain Series, arrays, constants, dataclass or list-like objects.
각 dtype 을 가진 Series 의 묶음이라고 생각하면 된다.

index
총 4개의 행이 있다. 4 entries (0 to 3)
정렬 (Sorting) 또는 필터링 (Filtering) 을 해도 index 는 변하지 않는다.

Index 를 지정해주지 않으면 RangeIndex 이다.
컬럼에서 Index 를 명시적으로 지정해주었다면 해당 Index 유형을 따라간다.
위 예시에서는 컬럼 C 에서 Index 를 지정해주었기 때문에 pandas.core.indexes.numeric.Int64Index 이다.
index 도 이름을 가질 수 있다.

columns
총 6개의 컬럼이 있다. 각 컬럼은 dtype 을 가진다.

column 도 axis=1 인 Index 이다.
Index 를 지정해주지 않으면 RangeIndex 이다.
하지만 A~F 로 지정해주었기 때문에, pandas.core.indexes.base.Index 이다.
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object') 이다.

Selection

주의. df['A'] 는 view 를 반환하고, df.loc['A'] 는 copy 를 반환
view 를 반환한 결과를 바꾸는 경우에는 원본 자체도 변경이 일어날 수 있음.
copy 를 반환한 결과를 바꾸는 경우에는 원본에는 변경이 없음.

>> 값을 단순 조회할 때에는 상관없음.
>> 단, 값을 변경할 때에는 indexer 를 사용하는 것을 권장!!

Getting / Setting
Scalar 또는 Array 는 열로, Slicing 은 행으로 간주한다.

# 특정 열의 모든 행에 대한 values 구하기 (scalar/array by label name)
df.col1
df['col1']
df[['col1', 'col2']]

# 특정 행의 모든 열에 대한 values 구하기 (slicing, by position)
df[:]
df[0:2]
df[2:]
df[:2]

# 특정 행, 특정 열의 모든 values 구하기
df[:]['col1'] 
df['col1'][:]

df[0:2]['col1']
df['col1'][0:2]

df[0:2][['col1', 'col2']]
df[['col1', 'col2']][0:2]

Selection By Label / Position
loc 은 label name 기반이고, iloc 은 position 기반이다.
loc, iloc 을 활용하면 행과 열 모두 scalar(숫자), 슬라이싱(:) , 배열([]) 그리고 필터링 접근이 모두 가능하다.

#loc (select by lable name) 
df.loc[0, 'col1']
df.loc[0, ['col1', 'col2']]
df.loc[0, 'col1':'col2']
df.loc[0:2, 'col1']
df.loc[[0,1,2], 'col1']
df.loc[df['col1']>0, 'col1']


#iloc (select by position, slicing include/exclude)
df.iloc[0, 0]
df.iloc[0, [0,1]]
df.iloc[0, 0:2]
df.iloc[[0,1], 0]
df.iloc[0:2, 0
df.iloc[df['col1']>0, 0]

단, loc 의 슬라이싱은 포함/포함이며, iloc (포지션)에 대한 슬라이싱은 포함/제외 이다.

df = pd.DataFrame(
    {
        'A': [1,2,3],
        'B': [4,5,6],
        'C': [7,8,9]     
    }
)

df.loc[0:1, 'B':'C']
'''
	B	C
0	4	7
1	5	8
'''

df.iloc[0:1, 1:2]
'''
	B
0	4
'''

Boolean Indexing
조건을 만족하는 대상을 가져온다.
일반적으로 하나의 열에 대한 값이 특정 조건을 만족할 때, (> 비교연산, isin 포함여부 연산) 해당 행 전체를 가져온다.

Tip. & (AND) 와 | (OR) 와 같은 논리연산할 때에는 조건을 꼭 () 로 묶어주자.

df = pd.DataFrame({
    'A': [-1, 0, 1],
    'B': [-2, 1, 2],
    'C': [-3, 0, 3]
})

'''
	A	B	C
0	-1	-2	-3
1	0	1	0
2	1	2	3
'''


# Using a single column's values to select data
df[df['A']>0]

'''
	A	B	C
2	1	2	3
'''


# Using the isin() method for filtering
df[df['A'].isin([0, 1])]

'''
	A	B	C
1	0	1	0
2	1	2	3
'''

# Using more than 2 conditions for filtering  (Intersection)
df[(df['A'].isin([0,1]))&(df['B']==1)]

'''
	A	B	C
1	0	1	0
'''

# Using more than 2 conditions for filtering  (Union)
df[(df['A']==0)|(df['A']==1)] 
'''
	A	B	C
1	0	1	0
2	1	2	3
'''

특정 열에 대한 값을 지정해주지 않으면 전체 열에 적용한다.

# Selecting values from a DataFrame where a boolean condition is met
df[df>0]

'''
	A	B	C
0	NaN	NaN	NaN
1	NaN	1.0	NaN
2	1.0	2.0	3.0
'''