weka.filters.supervised.instance
Class DistributionBasedBalance

java.lang.Object
  extended by weka.filters.Filter
      extended by weka.filters.supervised.instance.DistributionBasedBalance
All Implemented Interfaces:
java.io.Serializable, weka.core.CapabilitiesHandler, weka.core.OptionHandler, weka.core.RevisionHandler, weka.core.TechnicalInformationHandler, weka.filters.SupervisedFilter

public class DistributionBasedBalance
extends weka.filters.Filter
implements weka.filters.SupervisedFilter, weka.core.OptionHandler, weka.core.TechnicalInformationHandler

Re-samples a dataset for each class label selected with option -L. Instances are re-sampled using a selected distribution which is learned for each pair . For more information, see

Pablo Bermejo et. al. Improving the performance of Naive Bayes Multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Systems With Applications. Volume 38 Issue 3, pages 2072-2080. March 2011.

BibTeX:

 @article{BermejoDBB,
    author = {Pablo Bermejo and Jose A. Gamez and Jose M. Puerta},
    journal = {Expert Systems With Applications},
    pages = {2072--2080},
    title = {Improving the performance of Naive Bayes Multinomial in e-mail foldering by introducing distribution-based balance of datasets},
    volume = {38-3},
    year = {2011}
 }
 

Valid options are:

 -D 
 Specifies the distribution to learn from training set .
 (default 0: GAUSSIAN_BALANCE)
 
 -L <col1,col2-col4,...>
 Specifies the indexes of class label to balance. first, last and all are allowed
 (default all)
 
 -I (true|false)
  Specifies if class labels indexes are to be inverted."
  (default false)
 
 -N (true|false)
  Specifies if sampled values for attributes are allowed to be negative
  (default false)
 
 -P 
  Specifies the number of instances to sample per class label.
  (default 30)
 
 -S 
   Seed for values generation from learned distributions.
 
 -X 
   Specifies if fast aproximated sampling of poission values is allowed
  (default true)
 

Version:
$Revision: 1.0 $
Author:
Pablo Bermejo (Pablo.Bermejo@uclm.es)
See Also:
Serialized Form

Field Summary
static int GAUSSIAN_BALANCE
          learn and sample Gaussian Distribution
private  boolean m_allowNegativeValues
          indicate if negative values are to be sampled. if false, sampled values are 0 as minimum
private  boolean m_allowPoissonApproximation
          indicates if sampling from a poisson distribution can be using an approximate way which reduces the sampling time and gets similar sampled values
private  int m_balanceType
          distribution to learn in order to re-sample new instances
private  weka.core.Range m_labelsRange
          range of selected label indexes to balance
private  int m_P
          number P of instances to re-sample per class label
private  double m_samplingTime_ms
          stores the time spent (milliseconds) in sammpling the new instances
private  int m_seed
          seed use for numbers generation
private  double m_statisticsTime_ms
          stores the time spent (milliseconds) in learning the distribution
private  int[] m_totalInstancesPerClass
          used when MULTINOMIAL_BALANCE is selected
static int MULTINOMIAL_BALANCE
          learn and sample Multinomial Distribution
static int POISSON_BALANCE
          learn and sample Poisson Distribution
(package private) static long serialVersionUID
           
static weka.core.Tag[] TAGS_BALANCE
           
static int UNIFORM_BALANCE
          learn and sample Uniform Distribution
 
Fields inherited from class weka.filters.Filter
m_FirstBatchDone, m_InputRelAtts, m_InputStringAtts, m_NewBatch, m_OutputRelAtts, m_OutputStringAtts
 
Constructor Summary
DistributionBasedBalance()
          The constructor.
 
Method Summary
 java.lang.String allowNegativeValuesTipText()
          Returns the tip text for this property.
 java.lang.String allowPoissonApproximationTipText()
          Returns the tip text for this property.
 java.lang.String balanceTypeTipText()
          Returns the tip text for this property.
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
private  float[][][] computeStats()
          If m_balanceType==UNIFORM_BALANCE then computes the max (values[0]) and min (values[1]) values for each selected label and all atts if m_balanceType==GAUSSIAN_BALANCE then computes the mean (values[0]) and variance (values[1]) for each selected label and all atts if m_balanceType==POISSON_BALANCE then computes the mean (values[0]) for each selected label and all atts
private  void dbBalance()
          method to call the corresponding method to perform a distribution-based balance of training data
private  int drawAttribute(float[] probs, java.util.Random r)
          search for the first index i such that probs[i]>= r.nextFloat() a binary search is used since this is the bottleneck in multinomial resampling
 boolean getAllowNegativeValues()
           
 boolean getAllowPoissonApproximation()
           
 weka.core.SelectedTag getBalanceType()
           
 weka.core.Capabilities getCapabilities()
           
 double getFilteringTime_ms()
           
 boolean getInvertSelection()
           
 java.lang.String getLabelsRange()
           
 java.lang.String[] getOptions()
           
 int getP()
           
 java.lang.String getRevision()
          Returns the revision string.
 double getSamplingTime_ms()
           
 int getSeed()
           
 int[] getSelectedClassLabels()
           
 double getStatisticsTime_ms()
           
 weka.core.TechnicalInformation getTechnicalInformation()
          Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
 java.lang.String globalInfo()
           
 boolean input(weka.core.Instance instance)
          Input an instance for filtering.
 java.lang.String invertSelectionTipText()
          Returns the tip text for this property.
 java.lang.String labelsRangeTipText()
          Returns the tip text for this property.
 java.util.Enumeration<weka.core.Option> listOptions()
          Returns an enumeration describing the available options.
protected  double lnFactorial(int n)
          Fast computation of ln(n!)
static void main(java.lang.String[] args)
          Main method for running this filter.
private  int nextPoisson(float lambda, java.util.Random random)
           
private  int nextPoissonApproximated(float lambda, java.util.Random random)
          Approximation of poisson distribution so that sampling is faster See 'method PA' in The Computer Generation of Poisson Random Variables by A.
 java.lang.String pTipText()
          Returns the tip text for this property.
private  void resampleFromDistribution(float[][][] values)
          pushes new instances on the filter output stack sampled from the selected distribution
private  void resampleFromMultinomialDistribution(float[][][] values)
          Pushes new instances on the filter output stack sampled from a multinomial distribution Appropiate for textual databases
 void resetOptions()
          reset all options to the same state than when this filter is created
 java.lang.String seedTipText()
          Returns the tip text for this property.
 void setAllowNegativeValues(boolean m_allowNegativeValues)
          set if it is desired to allow negative values when sampling from a Gaussian or Uniform distribution.
 void setAllowPoissonApproximation(boolean m_allowPoissonApproximation)
          Specify if fast approximate poisson sampling is to be done.
 void setBalanceType(weka.core.SelectedTag newType)
           
 boolean setInputFormat(weka.core.Instances instanceInfo)
          Sets the format of the input instances.
 void setInvertSelection(boolean m_invertSelection)
          Specifies if class labels are selected in an inverted manner.
 void setLabelsRange(java.lang.String rangeList)
          Set which class labels are to be balanced first, last and all are allowed
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setP(int m_P)
          Sets the number of instances to sample per each class label selected
 void setSeed(int seed)
          Sets the seed to use for random number generators
 
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyValues, copyValues, filterFile, flushInput, getCapabilities, getInputFormat, getOutputFormat, initInputLocators, initOutputLocators, inputFormatPeek, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, numPendingOutput, output, outputFormatPeek, outputPeek, push, resetQueue, runFilter, setOutputFormat, testInputFormat, toString, useFilter, wekaStaticWrapper
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

m_balanceType

private int m_balanceType
distribution to learn in order to re-sample new instances


m_P

private int m_P
number P of instances to re-sample per class label


m_allowNegativeValues

private boolean m_allowNegativeValues
indicate if negative values are to be sampled. if false, sampled values are 0 as minimum


m_seed

private int m_seed
seed use for numbers generation


m_allowPoissonApproximation

private boolean m_allowPoissonApproximation
indicates if sampling from a poisson distribution can be using an approximate way which reduces the sampling time and gets similar sampled values


m_labelsRange

private weka.core.Range m_labelsRange
range of selected label indexes to balance


m_statisticsTime_ms

private double m_statisticsTime_ms
stores the time spent (milliseconds) in learning the distribution


m_samplingTime_ms

private double m_samplingTime_ms
stores the time spent (milliseconds) in sammpling the new instances


m_totalInstancesPerClass

private int[] m_totalInstancesPerClass
used when MULTINOMIAL_BALANCE is selected


UNIFORM_BALANCE

public static final int UNIFORM_BALANCE
learn and sample Uniform Distribution

See Also:
Constant Field Values

GAUSSIAN_BALANCE

public static final int GAUSSIAN_BALANCE
learn and sample Gaussian Distribution

See Also:
Constant Field Values

POISSON_BALANCE

public static final int POISSON_BALANCE
learn and sample Poisson Distribution

See Also:
Constant Field Values

MULTINOMIAL_BALANCE

public static final int MULTINOMIAL_BALANCE
learn and sample Multinomial Distribution

See Also:
Constant Field Values

TAGS_BALANCE

public static final weka.core.Tag[] TAGS_BALANCE

serialVersionUID

static final long serialVersionUID
See Also:
Constant Field Values
Constructor Detail

DistributionBasedBalance

public DistributionBasedBalance()
The constructor.

Method Detail

globalInfo

public java.lang.String globalInfo()
Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

getCapabilities

public weka.core.Capabilities getCapabilities()
Specified by:
getCapabilities in interface weka.core.CapabilitiesHandler
Overrides:
getCapabilities in class weka.filters.Filter
Returns:
Capabilities of this filter

getTechnicalInformation

public weka.core.TechnicalInformation getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Specified by:
getTechnicalInformation in interface weka.core.TechnicalInformationHandler
Returns:
the technical information about this class

resampleFromDistribution

private void resampleFromDistribution(float[][][] values)
                               throws java.lang.Exception
pushes new instances on the filter output stack sampled from the selected distribution

Parameters:
values - results from calling computeStats(). its meaning depends on m_balanceType
Throws:
java.lang.Exception - if sth goes wrong

resampleFromMultinomialDistribution

private void resampleFromMultinomialDistribution(float[][][] values)
                                          throws java.lang.Exception
Pushes new instances on the filter output stack sampled from a multinomial distribution Appropiate for textual databases

Parameters:
values - results from calling computeStats(). values[0] is probOfAttGivenClass[][] values[1] is counts[][1] of atts values along each class
Throws:
java.lang.Exception - if sth goes wrong

drawAttribute

private int drawAttribute(float[] probs,
                          java.util.Random r)
search for the first index i such that probs[i]>= r.nextFloat() a binary search is used since this is the bottleneck in multinomial resampling

Parameters:
probs - [] array of ordered values from 0 to 1
r - Random object
Returns:
min index in probs[] such that probs[i]>= r.nextFloat()

nextPoisson

private int nextPoisson(float lambda,
                        java.util.Random random)
Parameters:
lambda - mean of distribution
random - Random object
Returns:
int generated from a poisson distribution with mean==lambda

nextPoissonApproximated

private int nextPoissonApproximated(float lambda,
                                    java.util.Random random)
Approximation of poisson distribution so that sampling is faster See 'method PA' in The Computer Generation of Poisson Random Variables by A. C. Atkinson, Journal of the Royal Statistical Society Series C (Applied Statistics) Vol. 28, No. 1. (1979), pages 29-35.

Parameters:
lambda - mean of distribution
random - Random object
Returns:
int generated from a poisson distribution with mean==lambda

lnFactorial

protected double lnFactorial(int n)
Fast computation of ln(n!) for non-negative ints negative ints are passed on to the general gamma-function based version in weka.core.SpecialFunctions if the current n value is higher than any previous one, the cache is extended and filled to cover it the common case is reduced to a simple array lookup

Parameters:
n - the integer
Returns:
ln(n!)

computeStats

private float[][][] computeStats()
                          throws java.lang.Exception
If m_balanceType==UNIFORM_BALANCE then computes the max (values[0]) and min (values[1]) values for each selected label and all atts if m_balanceType==GAUSSIAN_BALANCE then computes the mean (values[0]) and variance (values[1]) for each selected label and all atts if m_balanceType==POISSON_BALANCE then computes the mean (values[0]) for each selected label and all atts

Returns:
double[][][] with stats for each selected label and all atts
Throws:
java.lang.Exception

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Should be called after setInputFormat()

Valid options are:

 -D 
 Specifies the distribution to learn from training set .
 (default 0: GAUSSIAN_BALANCE)
 
 -L <col1,col2-col4,...>
 Specifies the indexes of class label to balance.
 (default all)
 
 -I (true|false)
  Specifies if class labels indexes are to be inverted."
  (default false)
 
 -N (true|false)
  Specifies if sampled values for attributes are allowed to be negative
  (default false)
 
 -P 
  Specifies the number of instances to sample per class label.
  (default 30)
 
 -S 
   Seed for values generation from learned distributions.
  )
 
 -X 
   Specifies if fast aproximated sampling of poission values is allowed
  (default true)
 

Specified by:
setOptions in interface weka.core.OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getInvertSelection

public boolean getInvertSelection()
Returns:
true if indexes of class labels are selected in an inverted manner. Otherwise,false.

setInvertSelection

public void setInvertSelection(boolean m_invertSelection)
                        throws java.lang.Exception
Specifies if class labels are selected in an inverted manner.

Parameters:
m_invertSelection -
Throws:
java.lang.Exception

getP

public int getP()
Returns:
int number of instances to sample per each class label selected

setP

public void setP(int m_P)
Sets the number of instances to sample per each class label selected

Parameters:
m_P -

getSeed

public int getSeed()
Returns:
int seed for random number generators

setSeed

public void setSeed(int seed)
Sets the seed to use for random number generators

Parameters:
seed -

getAllowNegativeValues

public boolean getAllowNegativeValues()
Returns:
true if it is desired to allow negative values when sampling from a Gaussian or Uniform distribution. Otherwise,false.

setAllowNegativeValues

public void setAllowNegativeValues(boolean m_allowNegativeValues)
set if it is desired to allow negative values when sampling from a Gaussian or Uniform distribution.

Parameters:
m_allowNegativeValues -

balanceTypeTipText

public java.lang.String balanceTypeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

allowNegativeValuesTipText

public java.lang.String allowNegativeValuesTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

seedTipText

public java.lang.String seedTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

allowPoissonApproximationTipText

public java.lang.String allowPoissonApproximationTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

pTipText

public java.lang.String pTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

labelsRangeTipText

public java.lang.String labelsRangeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

invertSelectionTipText

public java.lang.String invertSelectionTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

resetOptions

public void resetOptions()
reset all options to the same state than when this filter is created


getOptions

public java.lang.String[] getOptions()
Specified by:
getOptions in interface weka.core.OptionHandler
Returns:
String[] describing the options

setInputFormat

public boolean setInputFormat(weka.core.Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class weka.filters.Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the input format can't be set successfully

input

public boolean input(weka.core.Instance instance)
Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:
input in class weka.filters.Filter
Parameters:
instance - the input instance
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.IllegalStateException - if no input structure has been defined

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class weka.filters.Filter
Returns:
true if there are instances pending output
Throws:
java.lang.IllegalStateException - if no input structure has been defined
java.lang.Exception - if provided options cannot be executed on input instances

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface weka.core.RevisionHandler
Overrides:
getRevision in class weka.filters.Filter
Returns:
the revision

listOptions

public java.util.Enumeration<weka.core.Option> listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface weka.core.OptionHandler
Returns:
an enumeration of all the available options.

getAllowPoissonApproximation

public boolean getAllowPoissonApproximation()
Returns:
true if fast approximate poission sampling is to be done

setAllowPoissonApproximation

public void setAllowPoissonApproximation(boolean m_allowPoissonApproximation)
Specify if fast approximate poisson sampling is to be done. Sampled values are similar to those obtained by exact poisson sampling. This affects when m_balanceType is POISSON_BALANCE or MULTINOMIAL_BALANCE

Parameters:
m_allowPoissonApproximation - true or false

getFilteringTime_ms

public double getFilteringTime_ms()
Returns:
double indicating the time spent (milliseconds) in the whole filtring (balancing) process

getSamplingTime_ms

public double getSamplingTime_ms()
Returns:
double indicating the time spent (milliseconds) when sampling new instances from a given distribution

getStatisticsTime_ms

public double getStatisticsTime_ms()
Returns:
double indicating the time spent (milliseconds) when learning the necessary statistics for a given distribution

getBalanceType

public weka.core.SelectedTag getBalanceType()
Returns:
SelectedTag indicating the type of balance set

setBalanceType

public void setBalanceType(weka.core.SelectedTag newType)
                    throws java.lang.Exception
Parameters:
newType - the type of distribution-based balancing desired
Throws:
java.lang.Exception

getSelectedClassLabels

public int[] getSelectedClassLabels()
Returns:
int[] class labels which will be balanced

setLabelsRange

public void setLabelsRange(java.lang.String rangeList)
                    throws java.lang.Exception
Set which class labels are to be balanced first, last and all are allowed

Parameters:
rangeList - a string representing the list of label indexes. Labels are indexed from 1 for users. eg: first-2,4,6-last
Throws:
java.lang.Exception

getLabelsRange

public java.lang.String getLabelsRange()

dbBalance

private void dbBalance()
                throws java.lang.Exception
method to call the corresponding method to perform a distribution-based balance of training data

Throws:
java.lang.Exception - is sth goes wrong

main

public static void main(java.lang.String[] args)
Main method for running this filter.

Parameters:
args - should contain arguments to the filter: