Multilabel classification sklearn

Multilabel classification sklearn DEFAULT

1.12. Multiclass and multioutput algorithms¶

This section of the user guide covers functionality related to multi-learning problems, including multiclass, multilabel, and multioutput classification and regression.

The modules in this section implement meta-estimators, which require a base estimator to be provided in their constructor. Meta-estimators extend the functionality of the base estimator to support multi-learning problems, which is accomplished by transforming the multi-learning problem into a set of simpler problems, then fitting one estimator per problem.

This section covers two modules: and . The chart below demonstrates the problem types that each module is responsible for, and the corresponding meta-estimators that each module provides.

../_images/multi_org_chart.png

The table below provides a quick reference on the differences between problem types. More detailed explanations can be found in subsequent sections of this guide.

Number of targets

Target cardinality

Valid

Multiclass classification

1

>2

‘multiclass’

Multilabel classification

>1

2 (0 or 1)

‘multilabel-indicator’

Multiclass-multioutput classification

>1

>2

‘multiclass-multioutput’

Multioutput regression

>1

Continuous

‘continuous-multioutput’

Below is a summary of scikit-learn estimators that have multi-learning support built-in, grouped by strategy. You don’t need the meta-estimators provided by this section if you’re using one of these estimators. However, meta-estimators can provide additional strategies beyond what is built-in:

1.12.1. Multiclass classification¶

Warning

All classifiers in scikit-learn do multiclass classification out-of-the-box. You don’t need to use the module unless you want to experiment with different multiclass strategies.

Multiclass classification is a classification task with more than two classes. Each sample can only be labeled as one class.

For example, classification using features extracted from a set of images of fruit, where each image may either be of an orange, an apple, or a pear. Each image is one sample and is labeled as one of the 3 possible classes. Multiclass classification makes the assumption that each sample is assigned to one and only one label - one sample cannot, for example, be both a pear and an apple.

While all scikit-learn classifiers are capable of multiclass classification, the meta-estimators offered by permit changing the way they handle more than two classes because this may have an effect on classifier performance (either in terms of generalization error or required computational resources).

1.12.1.1. Target format¶

Valid multiclass representations for () are:

  • 1d or column vector containing more than two discrete values. An example of a vector for 4 samples:

    >>> importnumpyasnp>>> y=np.array(['apple','pear','apple','orange'])>>> print(y)['apple' 'pear' 'apple' 'orange']
  • Dense or sparse binary matrix of shape with a single sample per row, where each column represents one class. An example of both a dense and sparse binary matrix for 4 samples, where the columns, in order, are apple, orange, and pear:

    >>> importnumpyasnp>>> fromsklearn.preprocessingimportLabelBinarizer>>> y=np.array(['apple','pear','apple','orange'])>>> y_dense=LabelBinarizer().fit_transform(y)>>> print(y_dense) [[1 0 0] [0 0 1] [1 0 0] [0 1 0]]>>> fromscipyimportsparse>>> y_sparse=sparse.csr_matrix(y_dense)>>> print(y_sparse) (0, 0) 1 (1, 2) 1 (2, 0) 1 (3, 1) 1

For more information about , refer to Transforming the prediction target (y).

1.12.1.2. OneVsRestClassifier¶

The one-vs-rest strategy, also known as one-vs-all, is implemented in . The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice.

Below is an example of multiclass learning using OvR:

>>> fromsklearnimportdatasets>>> fromsklearn.multiclassimportOneVsRestClassifier>>> fromsklearn.svmimportLinearSVC>>> X,y=datasets.load_iris(return_X_y=True)>>> OneVsRestClassifier(LinearSVC(random_state=0)).fit(X,y).predict(X)array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

also supports multilabel classification. To use this feature, feed the classifier an indicator matrix, in which cell [i, j] indicates the presence of label j in sample i.

../_images/sphx_glr_plot_multilabel_001.png

1.12.1.3. OneVsOneClassifier¶

constructs one classifier per pair of classes. At prediction time, the class which received the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.

Since it requires to fit classifiers, this method is usually slower than one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for algorithms such as kernel algorithms which don’t scale well with . This is because each individual learning problem only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used times. The decision function is the result of a monotonic transformation of the one-versus-one classification.

Below is an example of multiclass learning using OvO:

>>> fromsklearnimportdatasets>>> fromsklearn.multiclassimportOneVsOneClassifier>>> fromsklearn.svmimportLinearSVC>>> X,y=datasets.load_iris(return_X_y=True)>>> OneVsOneClassifier(LinearSVC(random_state=0)).fit(X,y).predict(X)array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

References:

  • “Pattern Recognition and Machine Learning. Springer”, Christopher M. Bishop, page 183, (First Edition)

1.12.1.4. OutputCodeClassifier¶

Error-Correcting Output Code-based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class is represented in a Euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should be represented by a code as unique as possible and a good code book should be designed to optimize classification accuracy. In this implementation, we simply use a randomly-generated code book as advocated in 3 although more elaborate methods may be added in the future.

At fitting time, one binary classifier per bit in the code book is fitted. At prediction time, the classifiers are used to project new points in the class space and the class closest to the points is chosen.

In , the attribute allows the user to control the number of classifiers which will be used. It is a percentage of the total number of classes.

A number between 0 and 1 will require fewer classifiers than one-vs-the-rest. In theory, is sufficient to represent each class unambiguously. However, in practice, it may not lead to good accuracy since is much smaller than .

A number greater than 1 will require more classifiers than one-vs-the-rest. In this case, some classifiers will in theory correct for the mistakes made by other classifiers, hence the name “error-correcting”. In practice, however, this may not happen as classifier mistakes will typically be correlated. The error-correcting output codes have a similar effect to bagging.

Below is an example of multiclass learning using Output-Codes:

>>> fromsklearnimportdatasets>>> fromsklearn.multiclassimportOutputCodeClassifier>>> fromsklearn.svmimportLinearSVC>>> X,y=datasets.load_iris(return_X_y=True)>>> clf=OutputCodeClassifier(LinearSVC(random_state=0),... code_size=2,random_state=0)>>> clf.fit(X,y).predict(X)array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

References:

  • “Solving multiclass learning problems via error-correcting output codes”, Dietterich T., Bakiri G., Journal of Artificial Intelligence Research 2, 1995.

  • “The Elements of Statistical Learning”, Hastie T., Tibshirani R., Friedman J., page 606 (second-edition) 2008.

1.12.2. Multilabel classification¶

Multilabel classification (closely related to multioutputclassification) is a classification task labeling each sample with labels from possible classes, where can be 0 to inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. Formally, a binary output is assigned to each class, for every sample. Positive classes are indicated with 1 and negative classes with 0 or -1. It is thus comparable to running binary classification tasks, for example with . This approach treats each label independently whereas multilabel classifiers may treat the multiple classes simultaneously, accounting for correlated behavior among them.

For example, prediction of the topics relevant to a text document or video. The document or video may be about one of ‘religion’, ‘politics’, ‘finance’ or ‘education’, several of the topic classes or all of the topic classes.

1.12.2.1. Target format¶

A valid representation of multilabel is an either dense or sparse binary matrix of shape . Each column represents a class. The ’s in each row denote the positive classes a sample has been labeled with. An example of a dense matrix for 3 samples:

>>> y=np.array([[1,0,0,1],[0,0,1,1],[0,0,0,0]])>>> print(y)[[1 0 0 1] [0 0 1 1] [0 0 0 0]]

Dense binary matrices can also be created using . For more information, refer to Transforming the prediction target (y).

An example of the same in sparse matrix form:

>>> y_sparse=sparse.csr_matrix(y)>>> print(y_sparse) (0, 0) 1 (0, 3) 1 (1, 2) 1 (1, 3) 1

1.12.2.2. MultiOutputClassifier¶

Multilabel classification support can be added to any classifier with . This strategy consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class is to extend estimators to be able to estimate a series of target functions (f1,f2,f3…,fn) that are trained on a single X predictor matrix to predict a series of responses (y1,y2,y3…,yn).

Below is an example of multilabel classification:

>>> fromsklearn.datasetsimportmake_classification>>> fromsklearn.multioutputimportMultiOutputClassifier>>> fromsklearn.ensembleimportRandomForestClassifier>>> fromsklearn.utilsimportshuffle>>> importnumpyasnp>>> X,y1=make_classification(n_samples=10,n_features=100,n_informative=30,n_classes=3,random_state=1)>>> y2=shuffle(y1,random_state=1)>>> y3=shuffle(y1,random_state=2)>>> Y=np.vstack((y1,y2,y3)).T>>> n_samples,n_features=X.shape# 10,100>>> n_outputs=Y.shape[1]# 3>>> n_classes=3>>> forest=RandomForestClassifier(random_state=1)>>> multi_target_forest=MultiOutputClassifier(forest,n_jobs=-1)>>> multi_target_forest.fit(X,Y).predict(X)array([[2, 2, 0], [1, 2, 1], [2, 1, 0], [0, 0, 2], [0, 2, 1], [0, 0, 2], [1, 1, 0], [1, 1, 1], [0, 0, 2], [2, 0, 0]])

1.12.2.3. ClassifierChain¶

Classifier chains (see ) are a way of combining a number of binary classifiers into a single multi-label model that is capable of exploiting correlations among targets.

For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1. These integers define the order of models in the chain. Each classifier is then fit on the available training data plus the true labels of the classes whose models were assigned a lower number.

When predicting, the true labels will not be available. Instead the predictions of each model are passed on to the subsequent models in the chain to be used as features.

Clearly the order of the chain is important. The first model in the chain has no information about the other labels while the last model in the chain has features indicating the presence of all of the other labels. In general one does not know the optimal ordering of the models in the chain so typically many randomly ordered chains are fit and their predictions are averaged together.

References:

Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank,

“Classifier Chains for Multi-label Classification”, 2009.

1.12.3. Multiclass-multioutput classification¶

Multiclass-multioutput classification (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. A single estimator thus handles several joint classification tasks. This is both a generalization of the multilabel classification task, which only considers binary attributes, as well as a generalization of the multiclass classification task, where only one property is considered.

For example, classification of the properties “type of fruit” and “colour” for a set of images of fruit. The property “type of fruit” has the possible classes: “apple”, “pear” and “orange”. The property “colour” has the possible classes: “green”, “red”, “yellow” and “orange”. Each sample is an image of a fruit, a label is output for both properties and each label is one of the possible classes of the corresponding property.

Note that all classifiers handling multiclass-multioutput (also known as multitask classification) tasks, support the multilabel classification task as a special case. Multitask classification is similar to the multioutput classification task with different model formulations. For more information, see the relevant estimator documentation.

Warning

At present, no metric in supports the multiclass-multioutput classification task.

1.12.3.1. Target format¶

A valid representation of multioutput is a dense matrix of shape of class labels. A column wise concatenation of 1d multiclass variables. An example of for 3 samples:

>>> y=np.array([['apple','green'],['orange','orange'],['pear','green']])>>> print(y)[['apple' 'green'] ['orange' 'orange'] ['pear' 'green']]

1.12.4. Multioutput regression¶

Multioutput regression predicts multiple numerical properties for each sample. Each property is a numerical variable and the number of properties to be predicted for each sample is greater than or equal to 2. Some estimators that support multioutput regression are faster than just running estimators.

For example, prediction of both wind speed and wind direction, in degrees, using data obtained at a certain location. Each sample would be data obtained at one location and both wind speed and direction would be output for each sample.

1.12.4.1. Target format¶

A valid representation of multioutput is a dense matrix of shape of floats. A column wise concatenation of continuous variables. An example of for 3 samples:

>>> y=np.array([[31.4,94],[40.5,109],[25.0,30]])>>> print(y)[[ 31.4 94. ] [ 40.5 109. ] [ 25. 30. ]]

1.12.4.2. MultiOutputRegressor¶

Multioutput regression support can be added to any regressor with . This strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor. As fits one regressor per target it can not take advantage of correlations between targets.

Below is an example of multioutput regression:

>>> fromsklearn.datasetsimportmake_regression>>> fromsklearn.multioutputimportMultiOutputRegressor>>> fromsklearn.ensembleimportGradientBoostingRegressor>>> X,y=make_regression(n_samples=10,n_targets=3,random_state=1)>>> MultiOutputRegressor(GradientBoostingRegressor(random_state=0)).fit(X,y).predict(X)array([[-154.75474165, -147.03498585, -50.03812219], [ 7.12165031, 5.12914884, -81.46081961], [-187.8948621 , -100.44373091, 13.88978285], [-141.62745778, 95.02891072, -191.48204257], [ 97.03260883, 165.34867495, 139.52003279], [ 123.92529176, 21.25719016, -7.84253 ], [-122.25193977, -85.16443186, -107.12274212], [ -30.170388 , -94.80956739, 12.16979946], [ 140.72667194, 176.50941682, -17.50447799], [ 149.37967282, -81.15699552, -5.72850319]])

1.12.4.3. RegressorChain¶

Regressor chains (see ) is analogous to as a way of combining a number of regressions into a single multi-target model that is capable of exploiting correlations among targets.

Sours: http://scikit-learn.org/stable/modules/multiclass.html

Multi Label Text Classification with Scikit-Learn

Multi-class classification means a classification task with more than two classes; each label are mutually exclusive. The classification makes the assumption that each sample is assigned to one and only one label.

On the other hand, Multi-label classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as Tim Horton are often categorized as both bakery and coffee shop. Multi-label text classification has many real world applications such as categorizing businesses on Yelp or classifying movies into one or more genre(s).

Anyone who has been the target of abuse or harassment online will know that it doesn’t go away when you log off or switch off your phone. Researchers at Google are working on tools to study toxic comments online. In this post, we will build a multi-label model that’s capable of detecting different types of toxicity like severe toxic, threats, obscenity, insults, and so on. We will be using supervised classifiers and text representations. A toxic comment might be about any of toxic, severe toxic, obscene, threat, insult or identity hate at the same time or none of the above. The data set can be found at Kaggle.

(Disclaimer from the data source: the dataset contains text that may be considered profane, vulgar, or offensive.)

%matplotlib inline
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import seaborn as snsdf = pd.read_csv("train 2.csv", encoding = "ISO-8859-1")
df.head()

Number of comments in each category

df_toxic = df.drop(['id', 'comment_text'], axis=1)
counts = []
categories = list(df_toxic.columns.values)
for i in categories:
counts.append((i, df_toxic[i].sum()))
df_stats = pd.DataFrame(counts, columns=['category', 'number_of_comments'])
df_stats
df_stats.plot(x='category', y='number_of_comments', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of comments per category")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('category', fontsize=12)

How many comments have multi labels?

rowsums = df.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()#plot
plt.figure(figsize=(8,5))
ax = sns.barplot(x.index, x.values)
plt.title("Multiple categories per comment")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of categories', fontsize=12)

Vast majority of the comment text are not labeled.

print('Percentage of comments that are not labelled:')
print(len(df[(df['toxic']==0) & (df['severe_toxic']==0) & (df['obscene']==0) & (df['threat']== 0) & (df['insult']==0) & (df['identity_hate']==0)]) / len(df))

Percentage of comments that are not labelled:
0.8983211235124177

The distribution of the number of words in comment texts.

lens = df.comment_text.str.len()
lens.hist(bins = np.arange(0,5000,50))

Most of the comment text length are within 500 characters, with some outliers up to 5,000 characters long.

There is no missing comment in comment text column.

print('Number of missing comments in comment text:')
df['comment_text'].isnull().sum()

Number of missing comments in comment text:

0

Have a peek the first comment, the text needs to be cleaned.

df['comment_text'][0]

Explanation\rWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27

Create a function to clean the text

def clean_text(text):
text = text.lower()
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"\'scuse", " excuse ", text)
text = re.sub('\W', ' ', text)
text = re.sub('\s+', ' ', text)
text = text.strip(' ')
return text

Clean up comment_text column:

df['comment_text'] = df['comment_text'].map(lambda com : clean_text(com))df['comment_text'][0]

explanation why the edits made under my username hardcore metallica fan were reverted they were not vandalisms just closure on some gas after i voted at new york dolls fac and please do not remove the template from the talk page since i am retired now 89 205 38 27

Much better!

Split the data to train and test sets:

categories = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']train, test = train_test_split(df, random_state=42, test_size=0.33, shuffle=True)X_train = train.comment_text
X_test = test.comment_text
print(X_train.shape)
print(X_test.shape)

(106912,)
(52659,)

Pipeline

Scikit-learn provides a pipeline utility to help automate machine learning workflows. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. So we will utilize pipeline to train every classifier.

OneVsRest multi-label strategy

The Multi-label algorithm accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Naive Bayes

OneVsRest strategy can be used for multi-label learning, where a classifier is used to predict multiple labels for instance. Naive Bayes supports multi-class, but we are in a multi-label scenario, therefore, we wrap Naive Bayes in the OneVsRestClassifier.

# Define a pipeline combining a text feature extractor with multi lable classifier
NB_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
NB_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = NB_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

… Processing toxic
Test accuracy is 0.9191401279933155
… Processing severe_toxic
Test accuracy is 0.9900112041626312
… Processing obscene
Test accuracy is 0.9514802787747584
… Processing threat
Test accuracy is 0.9971135038644866
… Processing insult
Test accuracy is 0.9517271501547694
… Processing identity_hate
Test accuracy is 0.9910556600011394

LinearSVC

SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = SVC_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

… Processing toxic
Test accuracy is 0.9599498661197516
… Processing severe_toxic
Test accuracy is 0.9906948479842003
… Processing obscene
Test accuracy is 0.9789019920621356
… Processing threat
Test accuracy is 0.9974173455629617
… Processing insult
Test accuracy is 0.9712299891756395
… Processing identity_hate
Test accuracy is 0.9919861752027194

Logistic Regression

LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
])for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
LogReg_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

… Processing toxic
Test accuracy is 0.9548415275641391
… Processing severe_toxic
Test accuracy is 0.9910556600011394
… Processing obscene
Test accuracy is 0.9761104464573956
… Processing threat
Test accuracy is 0.9973793653506523
… Processing insult
Test accuracy is 0.9687612753755294
… Processing identity_hate
Test accuracy is 0.991758293928863

The three classifiers produced similar results. We have created a strong baseline for the toxic comment multi-label text classification problem.

The full code for this post can be found on Github. I look forward to hearing any feedback or comment.

Sours: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5
  1. 1 gram mushrooms
  2. Mopar v6
  3. Dark humor cards against humanity
  4. Rhino lining bakersfield

News

0.2.0 (released 2018-12-10)

A new feature release:

  • first python implementation of multi-label SVM (MLTSVM)
  • a general multi-label embedding framework with several embedders supported (LNEMLC, CLEMS)
  • balanced k-means clusterer from HOMER implemented
  • wrapper for Keras model use in scikit-multilearn

0.1.0 [stable] (released 2018-09-04)

Fix a lot of bugs and generally improve stability, cross-platform functionality standard and unit test coverage. This release has been tested with a large set of unit tests that work across Windows. Also, new features:

  • multi-label stratification algorithm and stratification quality measures
  • a robust reorganization of label space division, alongside with a working stochastic blockmodel approach and new underlying layer - graph builders that allow using graph models for dividing the label space based not just on label co-occurence but on any kind of network relationships between labels you can come up with
  • meka wrapper works fully cross-platform now, including windows 10
  • multi-label data set downloading and load/save functionality brought in, like sklearn's dataset
  • kNN models support sparse input
  • MLARAM models support sparse input
  • BSD-compatible label space partitioning via NetworkX
  • dependence on GPL libraries made optional
  • working predict_proba added for label space partitioning methods
  • MLARAM moved to from neurofuzzy to adapt
  • test coverage increased to 94%
  • Classifier Chains allow specifying the chain order
  • lots of documentation updates
Sours: http://scikit.ml/
NLP Tutorial 17 - Multi-Label Text Classification for Stack Overflow Tag Prediction

Multiclass and multilabel algorithms

Warning All classifiers in scikit-learn do multiclass classification out-of-the-box. You don’t need to use the sklearn.multiclass module unless you want to experiment with different multiclass strategies.

The module implements meta-estimators to solve multiclass and multilabel classification problems by decomposing such problems into binary classification problems. Multitarget regression is also supported.

  • Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
  • Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
  • Multioutput regression assigns each sample a set of target values. This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
  • Multioutput-multiclass classification and multi-task classification means that a single estimator has to handle several joint classification tasks. This is both a generalization of the multi-label classification task, which only considers binary classification, as well as a generalization of the multi-class classification task. The output format is a 2d numpy array or sparse matrix. The set of labels can be different for each output variable. For instance, a sample could be assigned “pear” for an output variable that takes possible values in a finite set of species such as “pear”, “apple”; and “blue” or “green” for a second output variable that takes possible values in a finite set of colors such as “green”, “red”, “blue”, “yellow”… This means that any classifiers handling multi-output multiclass or multi-task classification tasks, support the multi-label classification task as a special case. Multi-task classification is similar to the multi-output classification task with different model formulations. For more information, see the relevant estimator documentation.

All scikit-learn classifiers are capable of multiclass classification, but the meta-estimators offered by permit changing the way they handle more than two classes because this may have an effect on classifier performance (either in terms of generalization error or required computational resources).

Below is a summary of the classifiers supported by scikit-learn grouped by strategy; you don’t need the meta-estimators in this class if you’re using one of these, unless you want custom multiclass behavior:

Inherently multiclass:

  • sklearn.naive_bayes.BernoulliNB
  • sklearn.tree.DecisionTreeClassifier
  • sklearn.tree.ExtraTreeClassifier
  • sklearn.ensemble.ExtraTreesClassifier
  • sklearn.naive_bayes.GaussianNB
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.semi_supervised.LabelPropagation
  • sklearn.semi_supervised.LabelSpreading
  • sklearn.discriminant_analysis.LinearDiscriminantAnalysis
  • sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
  • sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
  • sklearn.linear_model.LogisticRegressionCV (setting multi_class=”multinomial”)
  • sklearn.neural_network.MLPClassifier
  • sklearn.neighbors.NearestCentroid
  • sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis
  • sklearn.neighbors.RadiusNeighborsClassifier
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.linear_model.RidgeClassifier
  • sklearn.linear_model.RidgeClassifierCV

Multiclass as One-Vs-One:

  • sklearn.svm.NuSVC
  • sklearn.svm.SVC.
  • sklearn.gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_one”)

Multiclass as One-Vs-All:

  • sklearn.ensemble.GradientBoostingClassifier
  • sklearn.gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_rest”)
  • sklearn.svm.LinearSVC (setting multi_class=”ovr”)
  • sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)
  • sklearn.linear_model.LogisticRegressionCV (setting multi_class=”ovr”)
  • sklearn.linear_model.SGDClassifier
  • sklearn.linear_model.Perceptron
  • sklearn.linear_model.PassiveAggressiveClassifier

Support multilabel:

  • sklearn.tree.DecisionTreeClassifier
  • sklearn.tree.ExtraTreeClassifier
  • sklearn.ensemble.ExtraTreesClassifier
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.neural_network.MLPClassifier
  • sklearn.neighbors.RadiusNeighborsClassifier
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.linear_model.RidgeClassifierCV

Support multiclass-multioutput:

  • sklearn.tree.DecisionTreeClassifier
  • sklearn.tree.ExtraTreeClassifier
  • sklearn.ensemble.ExtraTreesClassifier
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.neighbors.RadiusNeighborsClassifier
  • sklearn.ensemble.RandomForestClassifier

Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.

Multilabel classification format

In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values: the one, i.e. the non zero elements, corresponds to the subset of labels. An array such as

represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.

Producing multilabel data as a list of sets of labels may be more intuitive. The transformer can be used to convert between a collection of collections of labels and the indicator format.

One-Vs-The-Rest

This strategy, also known as one-vs-all, is implemented in . The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice.

Multiclass learning

Below is an example of multiclass learning using OvR:

Multilabel learning

OneVsRestClassifier also supports multilabel classification. To use this feature, feed the classifier an indicator matrix, in which cell [i, j] indicates the presence of label j in sample i.

One-Vs-One

constructs one classifier per pair of classes. At prediction time, the class which received the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.

Since it requires to fit classifiers, this method is usually slower than one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for algorithms such as kernel algorithms which don’t scale well with . This is because each individual learning problem only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used times.

Multiclass learning

Below is an example of multiclass learning using OvO:

Error-Correcting Output-Codes

Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class is represented in a Euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should be represented by a code as unique as possible and a good code book should be designed to optimize classification accuracy. In this implementation, we simply use a randomly-generated code book as advocated in  although more elaborate methods may be added in the future.

At fitting time, one binary classifier per bit in the code book is fitted. At prediction time, the classifiers are used to project new points in the class space and the class closest to the points is chosen.

In , the attribute allows the user to control the number of classifiers which will be used. It is a percentage of the total number of classes.

A number between 0 and 1 will require fewer classifiers than one-vs-the-rest. In theory, is sufficient to represent each class unambiguously. However, in practice, it may not lead to good accuracy since is much smaller than .

A number greater than 1 will require more classifiers than one-vs-the-rest. In this case, some classifiers will in theory correct for the mistakes made by other classifiers, hence the name “error-correcting”. In practice, however, this may not happen as classifier mistakes will typically be correlated. The error-correcting output codes have a similar effect to bagging.

Multiclass learning

Below is an example of multiclass learning using Output-Codes:

Multioutput regression

Multioutput regression support can be added to any regressor with MultiOutputRegressor. This strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor. As MultiOutputRegressor fits one regressor per target it can not take advantage of correlations between targets.

Below is an example of multioutput regression:

Multioutput classification

Multioutput classification support can be added to any classifier with MultiOutputClassifier. This strategy consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class is to extend estimators to be able to estimate a series of target functions (f1,f2,f3…,fn) that are trained on a single X predictor matrix to predict a series of responses (y1,y2,y3…,yn).

Below is an example of multioutput classification:

Classifier Chain

Classifier chains (see ClassifierChain) are a way of combining a number of binary classifiers into a single multi-label model that is capable of exploiting correlations among targets.

For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1. These integers define the order of models in the chain. Each classifier is then fit on the available training data plus the true labels of the classes whose models were assigned a lower number.

When predicting, the true labels will not be available. Instead the predictions of each model are passed on to the subsequent models in the chain to be used as features.

Clearly the order of the chain is important. The first model in the chain has no information about the other labels while the last model in the chain has features indicating the presence of all of the other labels. In general one does not know the optimal ordering of the models in the chain so typically many randomly ordered chains are fit and their predictions are averaged together.

Regressor Chain

Regressor chains is analogous to ClassifierChain as a way of combining a number of regressions into a single multi-target model that is capable of exploiting correlations among targets.
 

Sours: http://semantic-portal.net/

Classification sklearn multilabel

Last Updated on 12 November 2020

Classification comes in many flavors. For example, if you need to categorize your input samples into one out of two classes, you are dealing with a binary classification problem. Is the number of classes > 2, the problem is a multiclass one. But now, what if you won’t classify your input sample into one out of many classes, but rather into some of the many classes?

That would be a multilabel classification problem and we’re going to cover it from a Support Vector Machine perspective in this article.

Support Vector Machines can be used for building classifiers. They are natively equipped to perform binary classification tasks. However, they cannot perform multiclass and multilabel classification natively. Fortunately, there are techniques out there with which this becomes possible. How the latter – multilabel classification – can work with an SVM is what you will see in this article. It is structured as follows.

Firstly, we’ll take a look at multilabel classification in general. What is it? What can it be used for? And how is it different from multiclass classification? This is followed by looking at multilabel classification with Support Vector Machines. In particular, we will look at why multilabel classification is not possible natively. Fortunately, the Scikit-learn library for machine learning provides a module, with which it is possible to create a multilabel SVM! We cover implementing one with Scikit-learn and Python step by step in the final part of this article.

Let’s take a look! 😎



What is multilabel classification?

Imagine that you’re an employee working in a factory. Your task is to monitor a conveyor belt which is forwarding two types of objects: a yellow rotated-and-square-shaped block and a blue, circular one. When an object is near the end of the conveyor belt, you must label it with two types of labels: its color and its shape.

In other words, the labels yellow and square are attached to the yellow squares, while blue and circular end up with the blue circles.

This is a human-powered multilabel classifier. Human beings inspect objects, attach \(N\) labels to them (here \(N = 2\)), and pass them on – possibly into a bucket or onto another conveyor belt for packaging. So far, so good.

Human beings can however be quite a bottleneck in such a process. Because it is so repetitive, it can become boring, and if humans don’t like something, it’s to be bored at work. In addition, the work is very continuous and hence tiring, increasing the odds of human error. In other words, wouldn’t it be a good idea to replace the human being with a machine here? The result would be a reduction in error rates while humans might be happier, doing more creative work.

That’s where Machine Learning comes into play. If we can learn to distinguish the yellow objects from the blue ones, we can build an automated system that attaches the labels for us. Since machines never get tired and work with what they have learnt from observations, they could potentially be a good replacement in our conveyor belt scenario.

There are many algorithms with which multilabel classification can be implemented. Neural Networks also belong to that category and are very popular these days. However, another class of algorithms with which a multilabel classifier can be created is that of Support Vector Machines. Let’s now take a look at what SVMs are, how they work, and how we can create a multilabel classifier with them.


Multilabel classification with Support Vector Machines

If we want to build a multilabel classifier with Support Vector Machines, we must first know how they work. For this reason, we will now take a brief look at what SVMs are conceptually and how they work. In addition, we’ll provide some brief insight into why a SVM cannot be used for multilabel classification natively. This provides the necessary context for understanding how we can make it work regardless, and you will understand the technique and the need for it better.

Let’s now cut to the chase.

A Support Vector Machine is a class of Machine Learning algorithms which uses kernel functions to learn a decision boundary between two classes (or learn a function for regression, should you be doing that). This decision boundary is of maximum margin between the two classes, meaning that it is equidistant from classes one and two. In the figure below, that would be the class of black items and the class of white ones. In addition, determining the boundary (which is called a hyperplane) is performed by means of support vectors.

Let's pause for a second! 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!

By signing up, you consent that any information you receive can include services and special offers by email.

All right, that’s quite a lot of complexity, so let’s break it apart into plainer English.

In the figure below, you can see three decision boundaries \(H_1\), \(H_2\) and \(H_3\). These decision boundaries are also called hyperplanes because they are dimensional compared to the feature space itself. In other words, in the figure below, we have a two-dimensional feature space (axes \(X_1\) and \(X_2\)) and have three one-dimensional lines (i.e. hyperplane) that serve as candidate decision boundaries: indeed, \(H_1\), \(H_2\) and \(H_3\).

\(H_1\) is actually no decision boundary at all, because it cannot distinguish between the classes. The other two are decision boundaries, because they can successfully be used to separate the classes from each other. But which is best? Obviously, that’s \(H_3\), even intuitively. But why is that the case? Let’s look at the decision boundary in more detail.

If you look at the line more closely, you can see that it is precisely in the middle of the area between the samples from each class that are closest to each other. These samples are called the support vectors, and hence the name Support Vector Machine. They effectively support the algorithm in learning the decision boundary. Now, recall that the line is precisely in the middle of the area in between those support vectors. This means that the line is equidistant to the two classes, meaning that on both ends the distance is the same. This in return means that our decision boundary is of maximum margin – it has the highest margin between the classes and is hence (one of the two) best decision boundaries that can be found.

Why SVMs can’t perform multiclass and multilabel classification natively

An unfortunate consequence of the way that SVMs learn their decision boundary is that they cannot be used for multilabel or multiclass classification. The reason why is simple: for a decision boundary to be a decision boundary in a SVM, the hyperplane (in our two-dimensional feature space that’s a line) must be equidistant from the classes in order to ensure maximum margin.

We can see that if we would add another class, generating a multiclass classification scenario, this would no longer be the case: at maximum, we can only guarantee equidistance between two of the classes – discarding this property with all other classes. The way an SVM works thus means that it cannot be used for multiclass classification, but fortunately there are many approaches (such as One-vs-One/One-vs-Rest) which can be used. Error-Correcting Output Codes are another means for generating a multiclass SVM classifier.

The other case would be multilabel classification. Here, we don’t assign one out of multiple classes to the input sample, but rather, we assign multiple classes to the input sample. Here, the number of classes assigned can in theory be equal to the absolute number of classes available, but often this is not the case. Now let’s take a look at assigning multiple labels to a SVM. The SVM is really rigid, a.k.a. relatively high bias, in terms of the function that is learned: one line separating two classes from each other. There is simply no way that multiple classes can be learned. This is why, next to multiclass classification, multilabel classification cannot be performed natively with SVMs.

Using a trick for creating a multilabel SVM classifier

As usual, people have found workarounds for creating a multilabel classifier with SVMs. The answer lies in the fact that the classification problem, which effectively involves assigning multiple labels to an instance, can be converted into many classification problems. While this increases the computational complexity of your Machine Learning problem, it is thus possible to create a multilabel SVM based classifier.

Since manually splitting the problem into many classification problems would be a bit cumbersome, we will now take a look at how we can implement multilabel classification with Scikit-learn.


Implementing a MultiOutputClassifier SVM with Scikit-learn

Scikit-learn provides the functionality, which implements a multilabel classifier for any regular classifier. For this reason, it will also work with an SVM. Let’s first generate two blobs of data which represent the , or the ‘type’ from the assembly line scenario above:

Code language:PHP(php)

This looks as follows – with two blobs of data belonging to one class. Do note that we also create which is an array of the same shape as the array. It is filled randomly for the sake of simplicity. This array contains the second label (color) that we will be using in this multilabel classification setting.

We can now use Scikit-learn to generate a multilabel SVM classifier. Here, we assume that our data is linearly separable. For the array, we will see that this is the case. For the array, this is not necessarily true since we generate it randomly. For this reason, you might wish to look for a particular kernel function that provides the linear decision boundary if you would use this code in a production setting. Always ensure that your data is or can become linearly separable before using SVMs!

  • First of all, we ensure that all our dependencies are imported. We import the API from Matplotlib for visualizing our results. Numpy is used for some numbers processing, and we import some dependencies as well. More specifically, we use for data generation, for the multilabel classifier, for the (linear!) SVM, for splitting the data into a training and testing set, and finally and for generating and visualizing a confusion matrix.
  • We then specify some configuration options, such as the number of samples to generate, the cluster centers, and the number of classes. We can see here that we define two centers, and hence have two classes for the first label.
  • We then generate the data with the spec we provided in the previous bullet point. In addition, we create an array of the same shape for the second label – . We initialize it randomly for the sake of simplicity. While linearity is guaranteed for the first label, we might not find it for the second due to this reason!
  • We then combine the training labels into one array so that we can generate a split between training and testing data. This is what we do directly afterwards.
  • Then, we initialize the SVM classifier and turn it into a multilabel one. The attribute indicates that all available processor functionality can be used for learning the classifiers.
  • We then the data to the classifier, meaning that we start the training process. After fitting is complete, the trained classifier is available in . We can then call to generate predictions for our testing data.
  • Comparing the (actual ground truth labels) and (predicted labels) can be done by means of a confusion matrix (follows directly after the code segment). We can create a confusion matrix for each label with , and then plot it with using Matplotlib.

That’s it – we have now created a multilabel Support Vector Machine! Now, ensure that , and are installed onto your system / into your environment, and run the code.

Code language:PHP(php)

You’ll then get two popups with confusion matrices:

We can clearly see that our initial estimations with regards to the dataset were true. For the linearly separable label (i.e. the label), our Confusion Matrix illustrates perfect behavior – with no wrong predictions. For the label (which was randomly generated based on the label) we see worse performance: this label is predicted right in only 50% of the cases. Now, this is of course due to the fact that this label was generated randomly. If, say, we added colors based on the class, we would also see good performance here.

Never miss new Machine Learning articles ✅

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!

By signing up, you consent that any information you receive can include services and special offers by email.

A next step a ML engineer would undertake now is finding out how to make the data for the second label linearly separable by means of a kernel function. That’s however outside the scope of this article. We did manage to create a multilabel SVM though! 🙂


Summary

In this article, we looked at creating a multilabel Support Vector Machine with Scikit-learn. Firstly, we looked at what multilabel classification is and how it is different than multiclass and binary classification. More specifically, a multilabel classifier assigns multiple labels to an input sample, e.g. the labels color and type if we are looking at an assembly line scenario. This is contrary to the multiclass and binary classifiers which assign just one class to an input sample.

Then, we looked at Support Vector Machines work in particular and why their internals are at odds with how multilabel classification works. Fortunately, people have sought to fix this, and we thus continued with making it work. More specifically, we used Scikit-learn’s for wrapping the SVM into a situation where multiple classifiers are generated that together predict the labels. By means of a confusion matrix, we then inspected the performance of our model, and provided insight in what to do when a confusion matrix does not show adequate performance.

I hope that you have learned something from this article! If you did, I would be happy to hear from you, so please feel free to leave a comment in the comments section below 💬 If you have other remarks or suggestions, please leave a message as well. I’d love to hear from you! Anyway, thank you for reading MachineCurve today and happy engineering! 😎


References

Wikipedia. (2005, February 21). Equidistant. Wikipedia, the free encyclopedia. Retrieved November 11, 2020, from https://en.wikipedia.org/wiki/Equidistant

Scikit-learn. (n.d.). 1.12. Multiclass and multilabel algorithms — scikit-learn 0.23.2 documentation. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from https://scikit-learn.org/stable/modules/multiclass.html#multioutput-classification

Scikit-learn. (n.d.). Sklearn.multioutput.MultiOutputClassifier — scikit-learn 0.23.2 documentation. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html

Sours: https://www.machinecurve.com/index.php/2020/11/12/how-to-create-a-multilabel-svm-classifier-with-scikit-learn/
Multiclass Learning for Scikit Learn

.

Now discussing:

.



706 707 708 709 710