Class LlamaService

java.lang.Object
chat.octet.model.LlamaService

public class LlamaService extends Object
Llama.cpp API

C++ source: llamajava.h, llamajava.cpp

Since:
b1497
Author:
William
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static int
    batchDecode(int sequenceId, int[] tokens, int inputLength, int pastTokenSize)
    Batch decoding.
    static void
    clearCache(int sequenceId)
    Clear cache in K-V sequences.
    static void
    clearCache(int sequenceId, int posStart, int posEnd)
    Clear cache in K-V sequences.
    static void
    Create new context with model.
    static int
    Get model context size.
    static float[]
    Get embedding
    Get llama context default params.
    Get llama model default params.
    getLlamaTokenType(int token)
    Get token type define.
    static float[]
    getLogits(int index)
    Get Logits based on index, and the default index must be 0.
    static Metrics
    getSamplingMetrics(boolean reset)
    Get sampling metrics
    static String
    Get system parameter information.
    static int
    Get special BOS token.
    static int
    Get special EOS token.
    static int
    getTokenType(int token)
    Get token type code.
    static int
    Get model vocab size.
    static void
    initial JNI context.
    static boolean
    Check whether MLOCK is supported.
    static boolean
    Check whether MMAP is supported.
    static void
    Call once at the end of the program.
    static void
    llamaBackendInit(boolean numa)
    Initialize the llama + ggml backend If numa is true, use NUMA optimizations Call once at the start of the program.
    static boolean
    loadLlamaGrammar(String grammarRules)
    Load llama grammar by rules.
    static void
    Load Llama model from file.
    static int
    loadLoraModelFromFile(String loraPath, float loraScale, String baseModelPath, int threads)
    Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter.
    static void
    Close model and release all resources.
    static int
    sampling(float[] logits, int[] lastTokens, int lastTokensSize, float penalty, float alphaFrequency, float alphaPresence, boolean penalizeNL, int mirostatMode, float mirostatTAU, float mirostatETA, float temperature, int topK, float topP, float tsf, float typical, float minP, int sequenceId, int pastTokenSize)
    Inference sampling the next token.
    static int
    tokenize(byte[] buf, int bufferLength, int[] tokens, int maxTokens, boolean addBos, boolean specialTokens)
    Convert the provided text into tokens.
    static int[]
    tokenize(String text, boolean addBos, boolean specialTokens)
    Convert the provided text into tokens.
    static int
    tokenToPiece(int token, byte[] buf, int bufferLength)
    Convert the token id to text piece.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • LlamaService

      public LlamaService()
  • Method Details

    • initNative

      public static void initNative()
      initial JNI context.
    • getLlamaModelDefaultParams

      public static LlamaModelParams getLlamaModelDefaultParams()
      Get llama model default params.
      Returns:
      LlamaModelParams
    • getLlamaContextDefaultParams

      public static LlamaContextParams getLlamaContextDefaultParams()
      Get llama context default params.
      Returns:
      LlamaContextParams
    • llamaBackendInit

      public static void llamaBackendInit(boolean numa)
      Initialize the llama + ggml backend If numa is true, use NUMA optimizations Call once at the start of the program.
      Parameters:
      numa - Use NUMA optimizations.
    • llamaBackendFree

      public static void llamaBackendFree()
      Call once at the end of the program. NOTE: currently only used for MPI.
    • loadLlamaModelFromFile

      public static void loadLlamaModelFromFile(String modelPath, LlamaModelParams params) throws ModelException
      Load Llama model from file.
      Parameters:
      modelPath - Llama model file path.
      params - Llama model params.
      Throws:
      ModelException
      See Also:
    • createNewContextWithModel

      public static void createNewContextWithModel(LlamaContextParams params) throws ModelException
      Create new context with model.
      Parameters:
      params - Llama context params.
      Throws:
      ModelException
      See Also:
    • release

      public static void release()
      Close model and release all resources.
    • isMmapSupported

      public static boolean isMmapSupported()
      Check whether MMAP is supported.
      Returns:
      boolean
    • isMlockSupported

      public static boolean isMlockSupported()
      Check whether MLOCK is supported.
      Returns:
      boolean
    • getVocabSize

      public static int getVocabSize()
      Get model vocab size.
      Returns:
      int
    • getContextSize

      public static int getContextSize()
      Get model context size.
      Returns:
      int
    • loadLoraModelFromFile

      public static int loadLoraModelFromFile(String loraPath, float loraScale, String baseModelPath, int threads) throws ModelException
      Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter. Can be NULL to use the current loaded model. The model needs to be reloaded before applying a new adapter, otherwise the adapter will be applied on top of the previous one.
      Parameters:
      loraPath - LoRA adapter file path.
      loraScale - LoRA scale.
      baseModelPath - Base model file path.
      threads - Thread number.
      Returns:
      int, Returns 0 on success, else failed.
      Throws:
      ModelException
    • getLogits

      public static float[] getLogits(int index)
      Get Logits based on index, and the default index must be 0.
      Parameters:
      index - index
      Returns:
      float[], Returns one-dimensional float array.
    • getEmbedding

      public static float[] getEmbedding()
      Get embedding
      Returns:
      float[], Returns embedding float array.
    • getTokenType

      public static int getTokenType(int token)
      Get token type code.
      Parameters:
      token - Token id.
      Returns:
      int
    • getTokenBOS

      public static int getTokenBOS()
      Get special BOS token.
      Returns:
      int, Returns token id.
    • getTokenEOS

      public static int getTokenEOS()
      Get special EOS token.
      Returns:
      int, Returns token id.
    • tokenize

      public static int tokenize(byte[] buf, int bufferLength, int[] tokens, int maxTokens, boolean addBos, boolean specialTokens)
      Convert the provided text into tokens. The tokens pointer must be large enough to hold the resulting tokens. Returns the number of tokens on success, no more than n_max_tokens.
      Parameters:
      buf - Text byte buffer.
      bufferLength - Text byte buffer length.
      tokens - Empty token arrays, Used to receive the returned tokens.
      maxTokens - Max token size, by default is context size.
      addBos - Add special BOS token.
      specialTokens - Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
      Returns:
      int, Returns a negative number on failure, else the number of tokens that would have been returned.
    • tokenToPiece

      public static int tokenToPiece(int token, byte[] buf, int bufferLength)
      Convert the token id to text piece.
      Parameters:
      token - Token id.
      buf - Input byte buffer.
      bufferLength - Input byte buffer length.
      Returns:
      int, Returns byte buffer length of the piece.
    • getSamplingMetrics

      public static Metrics getSamplingMetrics(boolean reset)
      Get sampling metrics
      Parameters:
      reset - Reset the counter when finished.
      Returns:
      Metrics
      See Also:
    • getSystemInfo

      public static String getSystemInfo()
      Get system parameter information.
      Returns:
      String
    • sampling

      public static int sampling(float[] logits, int[] lastTokens, int lastTokensSize, float penalty, float alphaFrequency, float alphaPresence, boolean penalizeNL, int mirostatMode, float mirostatTAU, float mirostatETA, float temperature, int topK, float topP, float tsf, float typical, float minP, int sequenceId, int pastTokenSize) throws DecodeException
      Inference sampling the next token.
      Parameters:
      logits - User-defined logits, Adjustments can be made via LogitsProcessor.
      lastTokens - Last token array.
      lastTokensSize - Last token array size.
      penalty - Control the repetition of token sequences in the generated text.
      alphaFrequency - Repeat alpha frequency penalty.
      alphaPresence - Repeat alpha presence penalty.
      penalizeNL - Disable penalization for newline tokens when applying the repeat penalty.
      mirostatMode - Mirostat Sampling Use Mirostat sampling, controlling perplexity during text generation.
      mirostatTAU - Mirostat Sampling Set the Mirostat target entropy.
      mirostatETA - Mirostat Sampling Set the Mirostat learning rate.
      temperature - Adjust the randomness of the generated text.
      topK - TOP-K Sampling Limit the next token selection to the K most probable tokens.
      topP - TOP-P Sampling Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.
      tsf - Tail Free Sampling (TFS) Enable tail free sampling with parameter z.
      typical - Typical Sampling Enable typical sampling sampling with parameter p.
      minP - MIN-P Sampling Sets a minimum base probability threshold for token selection.
      sequenceId - Generation sequence id.
      pastTokenSize - Past token size.
      Returns:
      int, Returns the sampled token id.
      Throws:
      DecodeException
      See Also:
    • loadLlamaGrammar

      public static boolean loadLlamaGrammar(String grammarRules)
      Load llama grammar by rules.
      Parameters:
      grammarRules - Grammar rules.
      Returns:
      boolean, Returns true on success, else failed.
    • batchDecode

      public static int batchDecode(int sequenceId, int[] tokens, int inputLength, int pastTokenSize)
      Batch decoding.
      Parameters:
      sequenceId - Specify a unique generation sequence id.
      tokens - Arrays of tokens that need to be decoding.
      inputLength - Input context length.
      pastTokenSize - Past token size.
      Returns:
      int, Returns 0 on success, else failed.
    • clearCache

      public static void clearCache(int sequenceId, int posStart, int posEnd)
      Clear cache in K-V sequences.
      Parameters:
      sequenceId - Generation sequence id.
      posStart - Start position.
      posEnd - End position.
    • clearCache

      public static void clearCache(int sequenceId)
      Clear cache in K-V sequences.
      Parameters:
      sequenceId - Generation sequence id.
    • tokenize

      public static int[] tokenize(String text, boolean addBos, boolean specialTokens)
      Convert the provided text into tokens.
      Parameters:
      text - Input text.
      addBos - Add special BOS token.
      specialTokens - Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
      Returns:
      Returns a negative number on failure, else the number of tokens that would have been returned.
    • getLlamaTokenType

      public static LlamaTokenType getLlamaTokenType(int token)
      Get token type define.
      Parameters:
      token - Token id.
      Returns:
      LlamaTokenType
      See Also: