. For more information,
see
Pablo Bermejo et. al. Improving the performance of Naive Bayes
Multinomial in e-mail foldering by introducing distribution-based balance of
datasets. Expert Systems With Applications. Volume 38 Issue 3, pages 2072-2080. March 2011.
BibTeX:
@article{BermejoDBB,
author = {Pablo Bermejo and Jose A. Gamez and Jose M. Puerta},
journal = {Expert Systems With Applications},
pages = {2072--2080},
title = {Improving the performance of Naive Bayes Multinomial in e-mail foldering by introducing distribution-based balance of datasets},
volume = {38-3},
year = {2011}
}
Valid options are:
-D
Specifies the distribution to learn from training set .
(default 0: GAUSSIAN_BALANCE)
-L <col1,col2-col4,...>
Specifies the indexes of class label to balance. first, last and all are allowed
(default all)
-I (true|false)
Specifies if class labels indexes are to be inverted."
(default false)
-N (true|false)
Specifies if sampled values for attributes are allowed to be negative
(default false)
-P
Specifies the number of instances to sample per class label.
(default 30)
-S
Seed for values generation from learned distributions.
-X
Specifies if fast aproximated sampling of poission values is allowed
(default true)
- Version:
- $Revision: 1.0 $
- Author:
- Pablo Bermejo (Pablo.Bermejo@uclm.es)
- See Also:
- Serialized Form
Field Summary |
static int |
GAUSSIAN_BALANCE
learn and sample Gaussian Distribution |
private boolean |
m_allowNegativeValues
indicate if negative values are to be sampled. if false, sampled values
are 0 as minimum |
private boolean |
m_allowPoissonApproximation
indicates if sampling from a poisson distribution can be using an
approximate way which reduces the sampling time and gets similar sampled
values |
private int |
m_balanceType
distribution to learn in order to re-sample new instances |
private weka.core.Range |
m_labelsRange
range of selected label indexes to balance |
private int |
m_P
number P of instances to re-sample per class label |
private double |
m_samplingTime_ms
stores the time spent (milliseconds) in sammpling the new instances |
private int |
m_seed
seed use for numbers generation |
private double |
m_statisticsTime_ms
stores the time spent (milliseconds) in learning the distribution |
private int[] |
m_totalInstancesPerClass
used when MULTINOMIAL_BALANCE is selected |
static int |
MULTINOMIAL_BALANCE
learn and sample Multinomial Distribution |
static int |
POISSON_BALANCE
learn and sample Poisson Distribution |
(package private) static long |
serialVersionUID
|
static weka.core.Tag[] |
TAGS_BALANCE
|
static int |
UNIFORM_BALANCE
learn and sample Uniform Distribution |
Fields inherited from class weka.filters.Filter |
m_FirstBatchDone, m_InputRelAtts, m_InputStringAtts, m_NewBatch, m_OutputRelAtts, m_OutputStringAtts |
Method Summary |
java.lang.String |
allowNegativeValuesTipText()
Returns the tip text for this property. |
java.lang.String |
allowPoissonApproximationTipText()
Returns the tip text for this property. |
java.lang.String |
balanceTypeTipText()
Returns the tip text for this property. |
boolean |
batchFinished()
Signify that this batch of input to the filter is finished. |
private float[][][] |
computeStats()
If m_balanceType==UNIFORM_BALANCE then computes the max (values[0]) and
min (values[1]) values for each selected label and all atts if
m_balanceType==GAUSSIAN_BALANCE then computes the mean (values[0]) and
variance (values[1]) for each selected label and all atts if
m_balanceType==POISSON_BALANCE then computes the mean (values[0]) for
each selected label and all atts |
private void |
dbBalance()
method to call the corresponding method to perform a distribution-based
balance of training data |
private int |
drawAttribute(float[] probs,
java.util.Random r)
search for the first index i such that probs[i]>= r.nextFloat() a binary
search is used since this is the bottleneck in multinomial resampling |
boolean |
getAllowNegativeValues()
|
boolean |
getAllowPoissonApproximation()
|
weka.core.SelectedTag |
getBalanceType()
|
weka.core.Capabilities |
getCapabilities()
|
double |
getFilteringTime_ms()
|
boolean |
getInvertSelection()
|
java.lang.String |
getLabelsRange()
|
java.lang.String[] |
getOptions()
|
int |
getP()
|
java.lang.String |
getRevision()
Returns the revision string. |
double |
getSamplingTime_ms()
|
int |
getSeed()
|
int[] |
getSelectedClassLabels()
|
double |
getStatisticsTime_ms()
|
weka.core.TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed
information about the technical background of this class, e.g., paper
reference or book this class is based on. |
java.lang.String |
globalInfo()
|
boolean |
input(weka.core.Instance instance)
Input an instance for filtering. |
java.lang.String |
invertSelectionTipText()
Returns the tip text for this property. |
java.lang.String |
labelsRangeTipText()
Returns the tip text for this property. |
java.util.Enumeration<weka.core.Option> |
listOptions()
Returns an enumeration describing the available options. |
protected double |
lnFactorial(int n)
Fast computation of ln(n!) |
static void |
main(java.lang.String[] args)
Main method for running this filter. |
private int |
nextPoisson(float lambda,
java.util.Random random)
|
private int |
nextPoissonApproximated(float lambda,
java.util.Random random)
Approximation of poisson distribution so that sampling is faster See
'method PA' in The Computer Generation of Poisson Random Variables by A. |
java.lang.String |
pTipText()
Returns the tip text for this property. |
private void |
resampleFromDistribution(float[][][] values)
pushes new instances on the filter output stack sampled from the selected
distribution |
private void |
resampleFromMultinomialDistribution(float[][][] values)
Pushes new instances on the filter output stack sampled from a
multinomial distribution Appropiate for textual databases |
void |
resetOptions()
reset all options to the same state than when this filter is created |
java.lang.String |
seedTipText()
Returns the tip text for this property. |
void |
setAllowNegativeValues(boolean m_allowNegativeValues)
set if it is desired to allow negative values when sampling from a
Gaussian or Uniform distribution. |
void |
setAllowPoissonApproximation(boolean m_allowPoissonApproximation)
Specify if fast approximate poisson sampling is to be done. |
void |
setBalanceType(weka.core.SelectedTag newType)
|
boolean |
setInputFormat(weka.core.Instances instanceInfo)
Sets the format of the input instances. |
void |
setInvertSelection(boolean m_invertSelection)
Specifies if class labels are selected in an inverted manner. |
void |
setLabelsRange(java.lang.String rangeList)
Set which class labels are to be balanced first, last and all are allowed |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setP(int m_P)
Sets the number of instances to sample per each class label selected |
void |
setSeed(int seed)
Sets the seed to use for random number generators |
Methods inherited from class weka.filters.Filter |
batchFilterFile, bufferInput, copyValues, copyValues, filterFile, flushInput, getCapabilities, getInputFormat, getOutputFormat, initInputLocators, initOutputLocators, inputFormatPeek, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, numPendingOutput, output, outputFormatPeek, outputPeek, push, resetQueue, runFilter, setOutputFormat, testInputFormat, toString, useFilter, wekaStaticWrapper |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
m_balanceType
private int m_balanceType
- distribution to learn in order to re-sample new instances
m_P
private int m_P
- number P of instances to re-sample per class label
m_allowNegativeValues
private boolean m_allowNegativeValues
- indicate if negative values are to be sampled. if false, sampled values
are 0 as minimum
m_seed
private int m_seed
- seed use for numbers generation
m_allowPoissonApproximation
private boolean m_allowPoissonApproximation
- indicates if sampling from a poisson distribution can be using an
approximate way which reduces the sampling time and gets similar sampled
values
m_labelsRange
private weka.core.Range m_labelsRange
- range of selected label indexes to balance
m_statisticsTime_ms
private double m_statisticsTime_ms
- stores the time spent (milliseconds) in learning the distribution
m_samplingTime_ms
private double m_samplingTime_ms
- stores the time spent (milliseconds) in sammpling the new instances
m_totalInstancesPerClass
private int[] m_totalInstancesPerClass
- used when MULTINOMIAL_BALANCE is selected
UNIFORM_BALANCE
public static final int UNIFORM_BALANCE
- learn and sample Uniform Distribution
- See Also:
- Constant Field Values
GAUSSIAN_BALANCE
public static final int GAUSSIAN_BALANCE
- learn and sample Gaussian Distribution
- See Also:
- Constant Field Values
POISSON_BALANCE
public static final int POISSON_BALANCE
- learn and sample Poisson Distribution
- See Also:
- Constant Field Values
MULTINOMIAL_BALANCE
public static final int MULTINOMIAL_BALANCE
- learn and sample Multinomial Distribution
- See Also:
- Constant Field Values
TAGS_BALANCE
public static final weka.core.Tag[] TAGS_BALANCE
serialVersionUID
static final long serialVersionUID
- See Also:
- Constant Field Values
DistributionBasedBalance
public DistributionBasedBalance()
- The constructor.
globalInfo
public java.lang.String globalInfo()
- Returns:
- a description of the filter suitable for displaying in the
explorer/experimenter gui
getCapabilities
public weka.core.Capabilities getCapabilities()
- Specified by:
getCapabilities
in interface weka.core.CapabilitiesHandler
- Overrides:
getCapabilities
in class weka.filters.Filter
- Returns:
- Capabilities of this filter
getTechnicalInformation
public weka.core.TechnicalInformation getTechnicalInformation()
- Returns an instance of a TechnicalInformation object, containing detailed
information about the technical background of this class, e.g., paper
reference or book this class is based on.
- Specified by:
getTechnicalInformation
in interface weka.core.TechnicalInformationHandler
- Returns:
- the technical information about this class
resampleFromDistribution
private void resampleFromDistribution(float[][][] values)
throws java.lang.Exception
- pushes new instances on the filter output stack sampled from the selected
distribution
- Parameters:
values
- results from calling computeStats(). its meaning depends on
m_balanceType
- Throws:
java.lang.Exception
- if sth goes wrong
resampleFromMultinomialDistribution
private void resampleFromMultinomialDistribution(float[][][] values)
throws java.lang.Exception
- Pushes new instances on the filter output stack sampled from a
multinomial distribution Appropiate for textual databases
- Parameters:
values
- results from calling computeStats(). values[0] is
probOfAttGivenClass[][] values[1] is counts[][1] of atts
values along each class
- Throws:
java.lang.Exception
- if sth goes wrong
drawAttribute
private int drawAttribute(float[] probs,
java.util.Random r)
- search for the first index i such that probs[i]>= r.nextFloat() a binary
search is used since this is the bottleneck in multinomial resampling
- Parameters:
probs
- [] array of ordered values from 0 to 1r
- Random object
- Returns:
- min index in probs[] such that probs[i]>= r.nextFloat()
nextPoisson
private int nextPoisson(float lambda,
java.util.Random random)
- Parameters:
lambda
- mean of distributionrandom
- Random object
- Returns:
- int generated from a poisson distribution with mean==lambda
nextPoissonApproximated
private int nextPoissonApproximated(float lambda,
java.util.Random random)
- Approximation of poisson distribution so that sampling is faster See
'method PA' in The Computer Generation of Poisson Random Variables by A.
C. Atkinson, Journal of the Royal Statistical Society Series C (Applied
Statistics) Vol. 28, No. 1. (1979), pages 29-35.
- Parameters:
lambda
- mean of distributionrandom
- Random object
- Returns:
- int generated from a poisson distribution with mean==lambda
lnFactorial
protected double lnFactorial(int n)
- Fast computation of ln(n!) for non-negative ints
negative ints are passed on to the general gamma-function based version
in weka.core.SpecialFunctions
if the current n value is higher than any previous one, the cache is
extended and filled to cover it
the common case is reduced to a simple array lookup
- Parameters:
n
- the integer
- Returns:
- ln(n!)
computeStats
private float[][][] computeStats()
throws java.lang.Exception
- If m_balanceType==UNIFORM_BALANCE then computes the max (values[0]) and
min (values[1]) values for each selected label and all atts if
m_balanceType==GAUSSIAN_BALANCE then computes the mean (values[0]) and
variance (values[1]) for each selected label and all atts if
m_balanceType==POISSON_BALANCE then computes the mean (values[0]) for
each selected label and all atts
- Returns:
- double[][][] with stats for each selected label and all atts
- Throws:
java.lang.Exception
setOptions
public void setOptions(java.lang.String[] options)
throws java.lang.Exception
- Parses a given list of options. Should be called after setInputFormat()
Valid options are:
-D
Specifies the distribution to learn from training set .
(default 0: GAUSSIAN_BALANCE)
-L <col1,col2-col4,...>
Specifies the indexes of class label to balance.
(default all)
-I (true|false)
Specifies if class labels indexes are to be inverted."
(default false)
-N (true|false)
Specifies if sampled values for attributes are allowed to be negative
(default false)
-P
Specifies the number of instances to sample per class label.
(default 30)
-S
Seed for values generation from learned distributions.
)
-X
Specifies if fast aproximated sampling of poission values is allowed
(default true)
- Specified by:
setOptions
in interface weka.core.OptionHandler
- Parameters:
options
- the list of options as an array of strings
- Throws:
java.lang.Exception
- if an option is not supported
getInvertSelection
public boolean getInvertSelection()
- Returns:
- true if indexes of class labels are selected in an inverted
manner. Otherwise,false.
setInvertSelection
public void setInvertSelection(boolean m_invertSelection)
throws java.lang.Exception
- Specifies if class labels are selected in an inverted manner.
- Parameters:
m_invertSelection
-
- Throws:
java.lang.Exception
getP
public int getP()
- Returns:
- int number of instances to sample per each class label selected
setP
public void setP(int m_P)
- Sets the number of instances to sample per each class label selected
- Parameters:
m_P
-
getSeed
public int getSeed()
- Returns:
- int seed for random number generators
setSeed
public void setSeed(int seed)
- Sets the seed to use for random number generators
- Parameters:
seed
-
getAllowNegativeValues
public boolean getAllowNegativeValues()
- Returns:
- true if it is desired to allow negative values when sampling from
a Gaussian or Uniform distribution. Otherwise,false.
setAllowNegativeValues
public void setAllowNegativeValues(boolean m_allowNegativeValues)
- set if it is desired to allow negative values when sampling from a
Gaussian or Uniform distribution.
- Parameters:
m_allowNegativeValues
-
balanceTypeTipText
public java.lang.String balanceTypeTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
allowNegativeValuesTipText
public java.lang.String allowNegativeValuesTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
seedTipText
public java.lang.String seedTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
allowPoissonApproximationTipText
public java.lang.String allowPoissonApproximationTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
pTipText
public java.lang.String pTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
labelsRangeTipText
public java.lang.String labelsRangeTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
invertSelectionTipText
public java.lang.String invertSelectionTipText()
- Returns the tip text for this property.
- Returns:
- tip text for this property suitable for displaying in the
explorer/experimenter gui
resetOptions
public void resetOptions()
- reset all options to the same state than when this filter is created
getOptions
public java.lang.String[] getOptions()
- Specified by:
getOptions
in interface weka.core.OptionHandler
- Returns:
- String[] describing the options
setInputFormat
public boolean setInputFormat(weka.core.Instances instanceInfo)
throws java.lang.Exception
- Sets the format of the input instances.
- Overrides:
setInputFormat
in class weka.filters.Filter
- Parameters:
instanceInfo
- an Instances object containing the input instance structure
(any instances contained in the object are ignored - only the
structure is required).
- Returns:
- true if the outputFormat may be collected immediately
- Throws:
java.lang.Exception
- if the input format can't be set successfully
input
public boolean input(weka.core.Instance instance)
- Input an instance for filtering. Filter requires all training instances
be read before producing output.
- Overrides:
input
in class weka.filters.Filter
- Parameters:
instance
- the input instance
- Returns:
- true if the filtered instance may now be collected with output().
- Throws:
java.lang.IllegalStateException
- if no input structure has been defined
batchFinished
public boolean batchFinished()
throws java.lang.Exception
- Signify that this batch of input to the filter is finished. If the filter
requires all instances prior to filtering, output() may now be called to
retrieve the filtered instances.
- Overrides:
batchFinished
in class weka.filters.Filter
- Returns:
- true if there are instances pending output
- Throws:
java.lang.IllegalStateException
- if no input structure has been defined
java.lang.Exception
- if provided options cannot be executed on input instances
getRevision
public java.lang.String getRevision()
- Returns the revision string.
- Specified by:
getRevision
in interface weka.core.RevisionHandler
- Overrides:
getRevision
in class weka.filters.Filter
- Returns:
- the revision
listOptions
public java.util.Enumeration<weka.core.Option> listOptions()
- Returns an enumeration describing the available options.
- Specified by:
listOptions
in interface weka.core.OptionHandler
- Returns:
- an enumeration of all the available options.
getAllowPoissonApproximation
public boolean getAllowPoissonApproximation()
- Returns:
- true if fast approximate poission sampling is to be done
setAllowPoissonApproximation
public void setAllowPoissonApproximation(boolean m_allowPoissonApproximation)
- Specify if fast approximate poisson sampling is to be done. Sampled
values are similar to those obtained by exact poisson sampling. This
affects when m_balanceType is POISSON_BALANCE or MULTINOMIAL_BALANCE
- Parameters:
m_allowPoissonApproximation
- true or false
getFilteringTime_ms
public double getFilteringTime_ms()
- Returns:
- double indicating the time spent (milliseconds) in the whole
filtring (balancing) process
getSamplingTime_ms
public double getSamplingTime_ms()
- Returns:
- double indicating the time spent (milliseconds) when sampling new
instances from a given distribution
getStatisticsTime_ms
public double getStatisticsTime_ms()
- Returns:
- double indicating the time spent (milliseconds) when learning the
necessary statistics for a given distribution
getBalanceType
public weka.core.SelectedTag getBalanceType()
- Returns:
- SelectedTag indicating the type of balance set
setBalanceType
public void setBalanceType(weka.core.SelectedTag newType)
throws java.lang.Exception
- Parameters:
newType
- the type of distribution-based balancing desired
- Throws:
java.lang.Exception
getSelectedClassLabels
public int[] getSelectedClassLabels()
- Returns:
- int[] class labels which will be balanced
setLabelsRange
public void setLabelsRange(java.lang.String rangeList)
throws java.lang.Exception
- Set which class labels are to be balanced first, last and all are allowed
- Parameters:
rangeList
- a string representing the list of label indexes. Labels are
indexed from 1 for users. eg: first-2,4,6-last
- Throws:
java.lang.Exception
getLabelsRange
public java.lang.String getLabelsRange()
dbBalance
private void dbBalance()
throws java.lang.Exception
- method to call the corresponding method to perform a distribution-based
balance of training data
- Throws:
java.lang.Exception
- is sth goes wrong
main
public static void main(java.lang.String[] args)
- Main method for running this filter.
- Parameters:
args
- should contain arguments to the filter: