계층화된 열차/시험 분할(scikit

codememo

계층화된 열차/시험 분할(scikit

tipmemo 2023. 8. 11. 21:51

계층화된 열차/시험 분할(scikit

데이터를 교육 세트(75%)와 테스트 세트(25%)로 분할해야 합니다.저는 현재 아래 코드로 이 작업을 수행하고 있습니다.

X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)

하지만, 저는 제 교육 데이터 세트를 계층화하고 싶습니다.그걸 어떻게 하는 거죠?제가 조사해 봤는데요StratifiedKFold75%/25% 분할을 지정하지 않고 교육 데이터 세트만 계층화합니다.

[0.17에 대한 업데이트]

의 문서를 참조하십시오.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/0.17에 대한 업데이트]

여기에 꺼내기 요청이 있습니다.하지만 당신은 쉽게 할 수 있습니다.train, test = next(iter(StratifiedKFold(...)))원하는 경우 열차 및 테스트 지수를 사용합니다.

로 간단히 할 수 있습니다.train_test_split()Scikit 학습에서 사용할 수 있는 방법:

from sklearn.model_selection import train_test_split 
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])

어떻게 하는지 보여주는 짧은 GitHub Gist도 준비했습니다.stratify옵션 작업:

https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

TL;DR : 계층화된 셔플 분할 사용test_size=0.25

Scikit-learn은 계층화 분할을 위한 두 가지 모듈을 제공합니다.

계층화된KFold : 이 모듈은 직접 k-fold 교차 검증 연산자로 유용합니다. 설정할 모듈입니다.n_folds교육/훈련은 클래스가 양쪽 모두에서 동등하게 균형을 이루도록 설정합니다.

다음은 일부 코드입니다(위 설명서에서 직접).

>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    #fit and predict with X_train/test. Use accuracy metrics to check validation performance

StratizedShuffleSplit : 이 모듈은 균등하게 균형 잡힌(계층화된) 클래스를 가진 단일 교육/테스트 세트를 만듭니다.본질적으로 이것이 당신이 원하는 것입니다.n_iter=1여기서 테스트 크기를 다음과 같이 언급할 수 있습니다.train_test_split

코드:

>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

다음은 연속형/회귀 데이터의 예입니다(GitHub에 대한 이 문제가 해결될 때까지).

min = np.amin(y)
max = np.amax(y)

# 5 bins may be too few for larger datasets.
bins     = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    stratify=y_binned
)

어디에start이민 그리고stop연속 대상의 최대값입니다.
설정하지 않으면right=True그러면 최대값이 거의 별도의 빈으로 바뀌고 추가 빈에 샘플이 너무 적기 때문에 분할이 항상 실패합니다.

@Andreas Mueller가 수락한 답변 외에도 위에서 언급한 @tangy로 추가하고 싶습니다.

StratizedShuffleSplit은 train_test_messages(확인 = y)와 가장 유사하며 다음과 같은 기능이 추가되었습니다.

기본적으로 계층화합니다.
n_bullet을 지정하여 데이터를 반복적으로 분할합니다.

StratizedShuffleSplit은 생성하려는 모든 작은 데이터 세트에서 균등하게 표현되어야 하는 열을 선택한 후에 수행됩니다.접기는 각 클래스의 샘플 비율을 보존함으로써 이루어집니다.'

데이터 집합 'data'에 'season' 열이 있고 'season'을 고르게 표현하기를 원한다고 가정하면 다음과 같습니다.

from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)

for train_index, test_index in sss.split(data, data["season"]):
    sss_train = data.iloc[train_index]
    sss_test = data.iloc[test_index]

따라서 원래 데이터 세트에서 관찰된 것과 동일한 비율로 각 클래스의 예제를 보존하는 방식으로 데이터 세트를 교육 및 테스트 세트로 분할하는 것이 바람직합니다.

이를 계층화된 열차-시험 분할이라고 합니다.

우리는 "stratify" 인수를 원래 데이터 세트의 y 구성 요소로 설정하여 이를 달성할 수 있습니다.train_test_split() 함수는 트레인과 테스트 세트 모두 제공된 "y" 배열에 존재하는 각 클래스의 예제 비율을 갖도록 하기 위해 사용됩니다.

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15

X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903) 

X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)

X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

위에서 scikit-learn의 최신 버전으로 @tangy 답변 업데이트: 0.23.2 (계층화된 셔플 분할 문서).

from sklearn.model_selection import StratifiedShuffleSplit

n_splits = 1  # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

언급URL : https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn

'codememo' 카테고리의 다른 글

ASP.NET에서 제어 기능을 찾는 더 좋은 방법 (0)	2023.08.11
새로운 iTunes Connect 사이트에서 앱 빌드를 삭제하는 방법은 무엇입니까? (0)	2023.08.11
"이미지가 중지된 컨테이너에서 사용되고 있습니다" 오류 (0)	2023.08.11
쓰기 호스트 또는 큰따옴표에서 멤버 변수를 확장하는 방법은 무엇입니까? (0)	2023.08.11
시스템. 데이터.Oracle Client 네임스페이스가 중단되었습니까? (0)	2023.08.11

현재글계층화된 열차/시험 분할(scikit

각종 프로그래밍 정보를 다루는 블로그입니다.

spring, MariaDB, json, PowerShell, Angular, mongodb, jQuery, asp.net, ReactJS, ajax, PYTHON, spring-boot, angularJS, wordpress, sql-server, C, Excel, bash, Oracle, Git,

Today :
Yesterday :

tipmemo

계층화된 열차/시험 분할(scikit

계층화된 열차/시험 분할(scikit

'codememo' 카테고리의 다른 글

'codememo'의 다른글

티스토리툴바

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

계층화된 열차/시험 분할(scikit

계층화된 열차/시험 분할(scikit

'codememo' 카테고리의 다른 글

'codememo'의 다른글

관련글

티스토리툴바