Package chat.octet.model
Class LlamaService
java.lang.Object
chat.octet.model.LlamaService
Llama.cpp API
C++ source: llamajava.h, llamajava.cpp
- Since:
- b1865
- Author:
- William
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic intbatchDecode(int sequenceId, int[] tokens, int inputLength, int pastTokenSize) Batch decoding.static voidclearCache(int sequenceId) Clear cache in K-V sequences.static voidclearCache(int sequenceId, int posStart, int posEnd) Clear cache in K-V sequences.static voidCreate new context with model.static intGet model context size.static float[]Get embeddingstatic LlamaContextParamsGet llama context default params.static LlamaModelParamsGet llama model default params.static LlamaModelQuantizeParamsGet llama model quantize default params.static LlamaTokenTypegetLlamaTokenType(int token) Get token type define.static float[]getLogits(int index) Get Logits based on index, and the default index must be 0.static MetricsgetSamplingMetrics(boolean reset) Get sampling metricsstatic StringGet system parameter information.static intGet special BOS token.static intGet special EOS token.static intgetTokenType(int token) Get token type code.static intGet model vocab size.static voidinitial JNI context.static booleanCheck whether MLOCK is supported.static booleanCheck whether MMAP is supported.static voidCall once at the end of the program.static voidllamaBackendInit(boolean numa) Initialize the llama + ggml backend If numa is true, use NUMA optimizations Call once at the start of the program.static intllamaModelQuantize(String sourceModelFilePath, String outputModelFilePath, LlamaModelQuantizeParams params) Quantize the model.static intllamaModelQuantize(String sourceModelFilePath, String outputModelFilePath, ModelFileType modelFileType) Quantize the model.static booleanloadLlamaGrammar(String grammarRules) Load llama grammar by rules.static voidloadLlamaModelFromFile(String modelPath, LlamaModelParams params) Load Llama model from file.static intloadLoraModelFromFile(String loraPath, float loraScale, String baseModelPath, int threads) Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter.static voidrelease()Close model and release all resources.static intsampling(float[] logits, int[] lastTokens, int lastTokensSize, float penalty, float alphaFrequency, float alphaPresence, boolean penalizeNL, int mirostatMode, float mirostatTAU, float mirostatETA, float temperature, int topK, float topP, float tsf, float typical, float minP, int sequenceId, int pastTokenSize) Inference sampling the next token.static inttokenize(byte[] buf, int bufferLength, int[] tokens, int maxTokens, boolean addBos, boolean specialTokens) Convert the provided text into tokens.static int[]Convert the provided text into tokens.static inttokenToPiece(int token, byte[] buf, int bufferLength) Convert the token id to text piece.
-
Constructor Details
-
LlamaService
public LlamaService()
-
-
Method Details
-
initNative
public static void initNative()initial JNI context. -
getLlamaModelDefaultParams
Get llama model default params.- Returns:
- LlamaModelParams
-
getLlamaContextDefaultParams
Get llama context default params.- Returns:
- LlamaContextParams
-
getLlamaModelQuantizeDefaultParams
Get llama model quantize default params.- Returns:
- LlamaModelQuantizeParams
-
llamaBackendInit
public static void llamaBackendInit(boolean numa) Initialize the llama + ggml backend If numa is true, use NUMA optimizations Call once at the start of the program.- Parameters:
numa- Use NUMA optimizations.
-
llamaBackendFree
public static void llamaBackendFree()Call once at the end of the program. NOTE: currently only used for MPI. -
loadLlamaModelFromFile
public static void loadLlamaModelFromFile(String modelPath, LlamaModelParams params) throws ModelException Load Llama model from file.- Parameters:
modelPath- Llama model file path.params- Llama model params.- Throws:
ModelException- See Also:
-
createNewContextWithModel
Create new context with model.- Parameters:
params- Llama context params.- Throws:
ModelException- See Also:
-
release
public static void release()Close model and release all resources. -
isMmapSupported
public static boolean isMmapSupported()Check whether MMAP is supported.- Returns:
- boolean
-
isMlockSupported
public static boolean isMlockSupported()Check whether MLOCK is supported.- Returns:
- boolean
-
getVocabSize
public static int getVocabSize()Get model vocab size.- Returns:
- int
-
getContextSize
public static int getContextSize()Get model context size.- Returns:
- int
-
loadLoraModelFromFile
public static int loadLoraModelFromFile(String loraPath, float loraScale, String baseModelPath, int threads) throws ModelException Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter. Can be NULL to use the current loaded model. The model needs to be reloaded before applying a new adapter, otherwise the adapter will be applied on top of the previous one.- Parameters:
loraPath- LoRA adapter file path.loraScale- LoRA scale.baseModelPath- Base model file path.threads- Thread number.- Returns:
- int, Returns 0 on success, else failed.
- Throws:
ModelException
-
getLogits
public static float[] getLogits(int index) Get Logits based on index, and the default index must be 0.- Parameters:
index- index- Returns:
- float[], Returns one-dimensional float array.
-
getEmbedding
public static float[] getEmbedding()Get embedding- Returns:
- float[], Returns embedding float array.
-
getTokenType
public static int getTokenType(int token) Get token type code.- Parameters:
token- Token id.- Returns:
- int
-
getTokenBOS
public static int getTokenBOS()Get special BOS token.- Returns:
- int, Returns token id.
-
getTokenEOS
public static int getTokenEOS()Get special EOS token.- Returns:
- int, Returns token id.
-
tokenize
public static int tokenize(byte[] buf, int bufferLength, int[] tokens, int maxTokens, boolean addBos, boolean specialTokens) Convert the provided text into tokens. The tokens pointer must be large enough to hold the resulting tokens. Returns the number of tokens on success, no more than n_max_tokens.- Parameters:
buf- Text byte buffer.bufferLength- Text byte buffer length.tokens- Empty token arrays, Used to receive the returned tokens.maxTokens- Max token size, by default is context size.addBos- Add special BOS token.specialTokens- Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.- Returns:
- int, Returns a negative number on failure, else the number of tokens that would have been returned.
-
tokenToPiece
public static int tokenToPiece(int token, byte[] buf, int bufferLength) Convert the token id to text piece.- Parameters:
token- Token id.buf- Input byte buffer.bufferLength- Input byte buffer length.- Returns:
- int, Returns byte buffer length of the piece.
-
getSamplingMetrics
Get sampling metrics- Parameters:
reset- Reset the counter when finished.- Returns:
- Metrics
- See Also:
-
getSystemInfo
Get system parameter information.- Returns:
- String
-
sampling
public static int sampling(float[] logits, int[] lastTokens, int lastTokensSize, float penalty, float alphaFrequency, float alphaPresence, boolean penalizeNL, int mirostatMode, float mirostatTAU, float mirostatETA, float temperature, int topK, float topP, float tsf, float typical, float minP, int sequenceId, int pastTokenSize) throws DecodeException Inference sampling the next token.- Parameters:
logits- User-defined logits, Adjustments can be made via LogitsProcessor.lastTokens- Last token array.lastTokensSize- Last token array size.penalty- Control the repetition of token sequences in the generated text.alphaFrequency- Repeat alpha frequency penalty.alphaPresence- Repeat alpha presence penalty.penalizeNL- Disable penalization for newline tokens when applying the repeat penalty.mirostatMode- Mirostat Sampling Use Mirostat sampling, controlling perplexity during text generation.mirostatTAU- Mirostat Sampling Set the Mirostat target entropy.mirostatETA- Mirostat Sampling Set the Mirostat learning rate.temperature- Adjust the randomness of the generated text.topK- TOP-K Sampling Limit the next token selection to the K most probable tokens.topP- TOP-P Sampling Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.tsf- Tail Free Sampling (TFS) Enable tail free sampling with parameter z.typical- Typical Sampling Enable typical sampling sampling with parameter p.minP- MIN-P Sampling Sets a minimum base probability threshold for token selection.sequenceId- Generation sequence id.pastTokenSize- Past token size.- Returns:
- int, Returns the sampled token id.
- Throws:
DecodeException- See Also:
-
loadLlamaGrammar
Load llama grammar by rules.- Parameters:
grammarRules- Grammar rules.- Returns:
- boolean, Returns true on success, else failed.
-
batchDecode
public static int batchDecode(int sequenceId, int[] tokens, int inputLength, int pastTokenSize) Batch decoding.- Parameters:
sequenceId- Specify a unique generation sequence id.tokens- Arrays of tokens that need to be decoding.inputLength- Input context length.pastTokenSize- Past token size.- Returns:
- int, Returns 0 on success, else failed.
-
clearCache
public static void clearCache(int sequenceId, int posStart, int posEnd) Clear cache in K-V sequences.- Parameters:
sequenceId- Generation sequence id.posStart- Start position.posEnd- End position.
-
clearCache
public static void clearCache(int sequenceId) Clear cache in K-V sequences.- Parameters:
sequenceId- Generation sequence id.
-
llamaModelQuantize
public static int llamaModelQuantize(String sourceModelFilePath, String outputModelFilePath, LlamaModelQuantizeParams params) Quantize the model.- Parameters:
sourceModelFilePath- Source model file path.outputModelFilePath- Output model file path.params- Quantize parameters.- Returns:
- int, Returns 0 on success, else failed.
-
llamaModelQuantize
public static int llamaModelQuantize(String sourceModelFilePath, String outputModelFilePath, ModelFileType modelFileType) Quantize the model.- Parameters:
sourceModelFilePath- Source model file path.outputModelFilePath- Output model file path.modelFileType- Model file type.- Returns:
- int, Returns 0 on success, else failed.
- See Also:
-
tokenize
Convert the provided text into tokens.- Parameters:
text- Input text.addBos- Add special BOS token.specialTokens- Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.- Returns:
- Returns a negative number on failure, else the number of tokens that would have been returned.
-
getLlamaTokenType
Get token type define.- Parameters:
token- Token id.- Returns:
- LlamaTokenType
- See Also:
-