machine-learning
Classificationの紹介：Wekaを使ったいくつかのモデルの生成

algorithm apache-spark C++ Java Language MATLAB Language opencv pandas Python Language R Language tensorflow

前書き

このチュートリアルでは、WekaをJavaコードで使用する方法、データファイルを読み込む方法、分類器を学習する方法、機械学習の背後にある重要な概念について説明します。

Wekaは機械学習のためのツールキットです。これは、機械学習と視覚化技術のライブラリを含み、ユーザーフレンドリーなGUIを備えています。

このチュートリアルには、JAVAで書かれた例が含まれており、GUIで生成されたビジュアルが含まれています。 GUIを使って構造化された実験のためのデータとJAVAコードを調べることをお勧めします。

はじめに：ファイルからデータセットを読み込む

Irisの花のデータセットは、実証目的で広く使用されているデータセットです。我々はそれを読み込み、それを検査し、後の使用のためにそれを少し修正する。

import java.io.File;
import java.net.URL;
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.RenameAttribute;
import weka.classifiers.evaluation.Evaluation;
import weka.classifiers.rules.ZeroR;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.lazy.IBk;
import weka.classifiers.trees.J48;
import weka.classifiers.meta.AdaBoostM1;

public class IrisExperiments {
    public static void main(String args[]) throws Exception
     {
        //First we open stream to a data set as provided on http://archive.ics.uci.edu
         CSVLoader loader = new CSVLoader();
         loader.setSource(new URL("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data").openStream());
         Instances data = loader.getDataSet();
         
         //This file has 149 examples with 5 attributes
         //In order:
         //  sepal length in cm
         //  sepal width in cm
         //  petal length in cm 
         //  petal width in cm
         //  class ( Iris Setosa , Iris Versicolour, Iris Virginica)

         //Let's briefly inspect the data
         System.out.println("This file has " + data.numInstances()+" examples.");
         System.out.println("The first example looks like this: ");
         for(int i = 0; i < data.instance(0).numAttributes();i++ ){
             System.out.println(data.instance(0).attribute(i));
         } 
         
         // NOTE that the last attribute is Nominal
         // It is convention to have a nominal variable at the last index as target variable
         
         // Let's tidy up the data a little bit
         // Nothing too serious just to show how we can manipulate the data with filters
         RenameAttribute renamer = new RenameAttribute();
         renamer.setOptions(weka.core.Utils.splitOptions("-R last -replace Iris-type"));
         renamer.setInputFormat(data);                          
         data = Filter.useFilter(data, renamer); 
         
         System.out.println("We changed the name of the target class.");
         System.out.println("And now it looks like this:");
         System.out.println(data.instance(0).attribute(4));
         
         //Now we do this for all the attributes
         renamer.setOptions(weka.core.Utils.splitOptions("-R 1 -replace sepal-length"));
         renamer.setInputFormat(data);                          
         data = Filter.useFilter(data, renamer); 
         
         renamer.setOptions(weka.core.Utils.splitOptions("-R 2 -replace sepal-width"));
         renamer.setInputFormat(data);                          
         data = Filter.useFilter(data, renamer); 
         
         renamer.setOptions(weka.core.Utils.splitOptions("-R 3 -replace petal-length"));
         renamer.setInputFormat(data);                          
         data = Filter.useFilter(data, renamer); 
         
         renamer.setOptions(weka.core.Utils.splitOptions("-R 4 -replace petal-width"));
         renamer.setInputFormat(data);                          
         data = Filter.useFilter(data, renamer); 
         
         //Lastly we save our newly created file to disk
         ArffSaver saver = new ArffSaver();
         saver.setInstances(data);
         saver.setFile(new File("IrisSet.arff"));
         saver.writeBatch();
  }
}

最初のクラシファイアをトレーニングする：ZeroRでベースラインを設定する

ZeroRは単純な分類子です。インスタンスごとに動作するのではなく、クラスの一般的な配布で動作します。これは、事前確率が最大のクラスを選択する。それは候補者の情報を使用しないという意味では良い分類子ではありませんが、しばしばベースラインとして使用されます。 注：その他のベースラインは、次のように使用できます。業界標準の分類器または手作りのルール

 // First we tell our data that it's class is hidden in the last attribute
 data.setClassIndex(data.numAttributes() -1);
 // Then we split the data in to two sets
 // randomize first because we don't want unequal distributions
 data.randomize(new java.util.Random(0));
 Instances testset = new Instances(data, 0, 50);
 Instances trainset = new Instances(data, 50, 99);
 
 // Now we build a classifier
 // Train it with the trainset
 ZeroR classifier1 = new ZeroR();
 classifier1.buildClassifier(trainset);
 // Next we test it against the testset
 Evaluation Test = new Evaluation(trainset);
 Test.evaluateModel(classifier1, testset);
 System.out.println(Test.toSummaryString());

セットの中で最大のクラスは、34％の正確なレートを提供します。（149のうち50）

注：ZeroRは約30％を実行します。これは、列車とテストセットにランダムに分割したためです。列車セットの中で最大のセットは、テストセットの中で最も小さくなるでしょう。良いテスト/電車のセットを作ることはあなたの価値がある

データの感触を得る。 Naive BayesとkNNのトレーニング

良い分類子を構築するために、我々はしばしば、データが特徴空間内でどのように構造化されているかを知る必要がある。 Wekaは、役立つ可視化モジュールを提供しています。

いくつかのディメンションは既にクラスをかなりよく分離しています。花びら幅は、例えば、花びら幅と比較して、非常にきちんと概念を発注します。

単純な分類子を訓練することで、データの構造についてもかなりの部分を明らかにすることができます。私は通常、その目的のためにNearest NeighborとNaive Bayesを使用するのが好きです。 Naive Bayesは独立性を前提としています。 k-Nearest-Neighborは、特徴空間内のk個の最も近い（既知の）インスタンスのクラスを割り当てることによって機能する。地理的な地理的依存関係を調べるためによく使われます。私たちはそれを使って、私たちの概念が地物空間でローカルに定義されているかどうかを調べます。

 //Now we build a Naive Bayes classifier
 NaiveBayes classifier2 = new NaiveBayes();
 classifier2.buildClassifier(trainset);
 // Next we test it against the testset
 Test = new Evaluation(trainset);
 Test.evaluateModel(classifier2, testset);
 System.out.println(Test.toSummaryString());
 
//Now we build a kNN classifier
IBk classifier3 = new IBk();
// We tell the classifier to use the first nearest neighbor as example 
classifier3.setOptions(weka.core.Utils.splitOptions("-K 1"));
classifier3.buildClassifier(trainset);
// Next we test it against the testset
Test = new Evaluation(trainset);
Test.evaluateModel(classifier3, testset);
System.out.println(Test.toSummaryString());

Naive Bayesは、新しく確立されたベースラインよりもはるかに優れたパフォーマンスを示し、独立した機能が情報を保持していることを示します（ペタル幅を覚えていますか？）。

1NNもうまくいく（この場合は実際には少し良い）、私たちの情報の一部はローカルであることを示しています。より良い性能は、いくつかの二次効果が情報も保持することを示すことができる（クラスzよりxおよびyの場合） 。

それをまとめる：木を訓練する

ツリーは、独立したフィーチャと2次効果で動作するモデルを構築できます。だから彼らはこのドメインの良い候補者かもしれません。ツリーは連鎖しているルールで、ルールの下にあるルールに到達し、ルールの下のルールに渡すインスタンスを分割します。

ツリー学習者はルールを生成し、それらを連鎖させ、ルールがあまりにも具体的であると感じるときにツリーを構築するのをやめ、オーバーフィットを回避します。オーバーフィットとは、私たちが探しているコンセプトにとって複雑すぎるモデルを構築することを意味します。余裕をもたせたモデルは電車のデータではうまくいくが、新しいデータではうまく機能しない

私たちはJavaの実装であるJ48を、一般的なアルゴリズムであるC4.5を使用しています。

//We train a tree using J48
//J48 is a JAVA implementation of the C4.5 algorithm
J48 classifier4 = new J48();
//We set it's confidence level to 0.1
//The confidence level tell J48 how specific a rule can be before it gets pruned
classifier4.setOptions(weka.core.Utils.splitOptions("-C 0.1"));
classifier4.buildClassifier(trainset);
// Next we test it against the testset
Test = new Evaluation(trainset);
Test.evaluateModel(classifier4, testset);
System.out.println(Test.toSummaryString());

System.out.print(classifier4.toString());

//We set it's confidence level to 0.5
//Allowing the tree to maintain more complex rules
classifier4.setOptions(weka.core.Utils.splitOptions("-C 0.5"));
classifier4.buildClassifier(trainset);
// Next we test it against the testset
Test = new Evaluation(trainset);
Test.evaluateModel(classifier4, testset);
System.out.println(Test.toSummaryString());
     
System.out.print(classifier4.toString());

最も高い信頼度で訓練されたツリー学習者は、最も具体的なルールを生成し、テストセットで最高のパフォーマンスを示します。明らかにその特異性が保証されています。

注：両方の学習者は、花びらの幅に関するルールから始めます。ビジュアライゼーションでこの次元に気付いたことを覚えていますか？

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow