Class

org.apache.spark.mllib.clustering

LDA

Related Doc: package clustering

Permalink

class LDA extends Logging

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

References:

Annotations
@Since( "1.3.0" )
See also

Latent Dirichlet allocation (Wikipedia)

Linear Supertypes
Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. LDA
  2. Logging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new LDA()

    Permalink

    Constructs a LDA instance with default parameters.

    Constructs a LDA instance with default parameters.

    Annotations
    @Since( "1.3.0" )

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. def getAlpha: Double

    Permalink

    Alias for getDocConcentration

    Annotations
    @Since( "1.3.0" )
  10. def getAsymmetricAlpha: Vector

    Permalink

    Alias for getAsymmetricDocConcentration

    Annotations
    @Since( "1.5.0" )
  11. def getAsymmetricDocConcentration: Vector

    Permalink

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    This is the parameter to a Dirichlet distribution.

    Annotations
    @Since( "1.5.0" )
  12. def getBeta: Double

    Permalink

    Alias for getTopicConcentration

    Annotations
    @Since( "1.3.0" )
  13. def getCheckpointInterval: Int

    Permalink

    Period (in iterations) between checkpoints.

    Period (in iterations) between checkpoints.

    Annotations
    @Since( "1.3.0" )
  14. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  15. def getDocConcentration: Double

    Permalink

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    This method assumes the Dirichlet distribution is symmetric and can be described by a single Double parameter. It should fail if docConcentration is asymmetric.

    Annotations
    @Since( "1.3.0" )
  16. def getK: Int

    Permalink

    Number of topics to infer, i.e., the number of soft cluster centers.

    Number of topics to infer, i.e., the number of soft cluster centers.

    Annotations
    @Since( "1.3.0" )
  17. def getMaxIterations: Int

    Permalink

    Maximum number of iterations allowed.

    Maximum number of iterations allowed.

    Annotations
    @Since( "1.3.0" )
  18. def getOptimizer: LDAOptimizer

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    LDAOptimizer used to perform the actual calculation

    Annotations
    @Since( "1.4.0" ) @DeveloperApi()
  19. def getSeed: Long

    Permalink

    Random seed for cluster initialization.

    Random seed for cluster initialization.

    Annotations
    @Since( "1.3.0" )
  20. def getTopicConcentration: Double

    Permalink

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    This is the parameter to a symmetric Dirichlet distribution.

    Annotations
    @Since( "1.3.0" )
    Note

    The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.

  21. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  22. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean = false): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  23. def initializeLogIfNecessary(isInterpreter: Boolean): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  24. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  25. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  26. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  27. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  28. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  29. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  30. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  31. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  32. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  33. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  34. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  35. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  36. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  37. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  38. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  39. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  40. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  41. def run(documents: JavaPairRDD[Long, Vector]): LDAModel

    Permalink

    Java-friendly version of run()

    Java-friendly version of run()

    Annotations
    @Since( "1.3.0" )
  42. def run(documents: RDD[(Long, Vector)]): LDAModel

    Permalink

    Learn an LDA model using the given dataset.

    Learn an LDA model using the given dataset.

    documents

    RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and greater than or equal to 0.

    returns

    Inferred LDA model

    Annotations
    @Since( "1.3.0" )
  43. def setAlpha(alpha: Double): LDA.this.type

    Permalink

    Alias for setDocConcentration()

    Alias for setDocConcentration()

    Annotations
    @Since( "1.3.0" )
  44. def setAlpha(alpha: Vector): LDA.this.type

    Permalink

    Alias for setDocConcentration()

    Alias for setDocConcentration()

    Annotations
    @Since( "1.5.0" )
  45. def setBeta(beta: Double): LDA.this.type

    Permalink

    Alias for setTopicConcentration()

    Alias for setTopicConcentration()

    Annotations
    @Since( "1.3.0" )
  46. def setCheckpointInterval(checkpointInterval: Int): LDA.this.type

    Permalink

    Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).

    Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored. (default = 10)

    Annotations
    @Since( "1.3.0" )
    See also

    org.apache.spark.SparkContext#setCheckpointDir

  47. def setDocConcentration(docConcentration: Double): LDA.this.type

    Permalink

    Replicates a Double docConcentration to create a symmetric prior.

    Replicates a Double docConcentration to create a symmetric prior.

    Annotations
    @Since( "1.3.0" )
  48. def setDocConcentration(docConcentration: Vector): LDA.this.type

    Permalink

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

    This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).

    If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k. (default = Vector(-1) = automatic)

    Optimizer-specific parameter settings:

    • EM
      • Currently only supports symmetric distributions, so all values in the vector should be the same.
      • Values should be greater than 1.0
      • default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    • Online
      • Values should be greater than or equal to 0
      • default = uniformly (1.0 / k), following the implementation from here.
    Annotations
    @Since( "1.5.0" )
  49. def setK(k: Int): LDA.this.type

    Permalink

    Set the number of topics to infer, i.e., the number of soft cluster centers.

    Set the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)

    Annotations
    @Since( "1.3.0" )
  50. def setMaxIterations(maxIterations: Int): LDA.this.type

    Permalink

    Set the maximum number of iterations allowed.

    Set the maximum number of iterations allowed. (default = 20)

    Annotations
    @Since( "1.3.0" )
  51. def setOptimizer(optimizerName: String): LDA.this.type

    Permalink

    Set the LDAOptimizer used to perform the actual calculation by algorithm name.

    Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.

    Annotations
    @Since( "1.4.0" )
  52. def setOptimizer(optimizer: LDAOptimizer): LDA.this.type

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)

    Annotations
    @Since( "1.4.0" ) @DeveloperApi()
  53. def setSeed(seed: Long): LDA.this.type

    Permalink

    Set the random seed for cluster initialization.

    Set the random seed for cluster initialization.

    Annotations
    @Since( "1.3.0" )
  54. def setTopicConcentration(topicConcentration: Double): LDA.this.type

    Permalink

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

    This is the parameter to a symmetric Dirichlet distribution.

    Annotations
    @Since( "1.3.0" )
    Note

    The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009. If set to -1, then topicConcentration is set automatically. (default = -1 = automatic) Optimizer-specific parameter settings:

    • EM
      • Value should be greater than 1.0
      • default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
    • Online
      • Value should be greater than or equal to 0
      • default = (1.0 / k), following the implementation from here.
  55. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  56. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  57. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  58. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  59. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped