AidVoice C++ API Documentation
Important things to know when developing with the AidVoice SDK for C++:
- Include the header file at compile time: /usr/local/include/aidlux/aidvoice/aidvoice_speech.hpp
- Link the library file at build time: /usr/local/lib/libaidvoice_speech.so
Feature Type.enum FeatureType
FeatureType is used to specify the core feature module when initializing the AidVoice SDK. Since the SDK includes multiple speech features, you must use this enum to clearly define which feature you want when creating a feature instance (object). The SDK currently supports automatic speech recognition (ASR) and text-to-speech (TTS). More speech features will be added in future releases.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_DEFAULT | uint8_t | 0 | Invalid data type |
| TYPE_ASR | uint8_t | 1 | Automatic speech recognition |
| TYPE_TTS | uint8_t | 2 | Text to speech |
Audio Type.enum AudioType
When running an ASR task, you need to define the encoding format and sampling properties of the input or output audio. By setting this enum, the SDK can correctly parse audio stream data or generate audio files in the required format. This enum is mainly used to identify the audio type for ASR input. The TTS output format is currently fixed by the engine.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_DEFAULT | uint8_t | 0 | Invalid data type |
| TYPE_WAV | uint8_t | 1 | WAV audio |
| TYPE_PCM | uint8_t | 2 | PCM audio |
Important
To ensure ASR accuracy and system stability, the raw audio stream sent to the SDK must strictly follow these requirements:
- Sample rate: fixed at 16 kHz.
- Channel configuration: mono only.
- Bit depth: 16-bit signed.
The TTS module currently outputs WAV audio only. The audio enum values here are mainly used by the ASR module to identify different input audio types.
Log Level.enum LogLevel
AidVoice SDK provides logging-related interfaces, which are introduced later in this document. If you need to specify the current logging level used by AidVoice SDK, use this enum.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_INFO | uint8_t | 0 | Information |
| TYPE_WARNING | uint8_t | 1 | Warning |
| TYPE_ERROR | uint8_t | 2 | Error |
| TYPE_FATAL | uint8_t | 3 | Fatal error |
| TYPE_DEBUG | uint8_t | 4 | Debug |
| TYPE_OFF | uint8_t | 5 | Disabled |
Return Status.enum ResultStatus
ResultStatus defines the return status for all SDK operations. By checking this enum value, you can tell whether the current operation completed successfully. If the return value indicates failure, you can troubleshoot the issue based on the detailed error information.
| Member Name | Type | Value | Description |
|---|---|---|---|
| AV_OK | uint8_t | 0 | Success |
| AV_ERR_INVALID_ARG | uint8_t | 1 | Invalid argument |
| AV_ERR_LOAD_FILE | uint8_t | 2 | Failed to load file |
| AV_ERR_RUN_FAIL | uint8_t | 3 | Runtime error |
| AV_ERR_UNSUPPORTED | uint8_t | 4 | Operation not supported |
| AV_ERR_GENERATE_OBJECT | uint8_t | 5 | Failed to create object |
| AV_OTHER | uint8_t | 6 | Other |
Device Information.struct DeviceInfo
The DeviceInfo class is used to describe NPU information on the current device.
Member List
| Member | version |
| Type | uint32_t |
| Default | 0X00020301 |
| Description | Version of the current device information |
| Member | id |
| Type | uint32_t |
| Default | 0 |
| Description | ID of the current device |
| Member | type |
| Type | uint32_t |
| Default | 0 |
| Description | Type of the current device |
| Member | cores_num |
| Type | uint32_t |
| Default | 1 |
| Description | Number of cores in the current device information (NPU cores) |
| Member | cores_id |
| Type | std::vector<uint32_t> |
| Default | No default value |
| Description | IDs of all cores in the current device information |
Important
- The current SDK device information class is only used to query NPU information on the device. For CPU, GPU, and other hardware information, please use other methods.
Global Configuration.class FeatureConfig
The FeatureConfig structure stores all configuration information required to build a specific feature object. Before initializing an SDK instance, you need to create this structure and set the feature type, model selection, and logging level based on your business needs.
Member List
FeatureConfig includes the following parameters:
| Member | feature_type |
| Type | FeatureType |
| Default | No default value |
| Description | Specifies the SDK feature mode, such as speech recognition or speech synthesis |
| Member | model_path |
| Type | string |
| Default | Empty string |
| Description | Specifies the path of the selected model |
| Member | log_type |
| Type | LogLevel |
| Default | TYPE_OFF |
| Description | Specifies the log level |
| Member | custom_device_info |
| Type | std::vector<DeviceInfo> |
| Default | Empty array |
| Description | List of NPU device information |
ASR Interfaces
This section describes the core API interfaces in AidVoice SDK for automatic speech recognition (ASR). With these interfaces, you can complete the full ASR workflow, from creating an ASR instance and sending audio data to retrieving recognition results.
Feature overview: The ASR module converts 16 kHz, mono, 16-bit raw audio streams into text in real time or offline. It currently supports mainstream inference models such as senseVoice_small and whisper_tiny, whisper_base, whisper_base_en, whisper_small, and whisper_medium.
ASR Mode.enum ASRMode
ASRMode defines how ASR results are returned. Depending on your real-time requirements, you can choose streaming mode for incremental feedback or non-streaming mode for complete sentence output.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_STREAM | uint8_t | 0 | Streaming output: can return intermediate transcription results |
| TYPE_NOSTREAM | uint8_t | 1 | Non-streaming output: returns the final transcription result for each processing step |
Important
Notes:
- Streaming: temporary transcription results are generated in real time during audio processing. As more audio is fed in, the SDK continues to revise and update the intermediate text. This mode can return data before the full audio buffer is processed, which reduces first-token latency and improves the interaction experience.
- Non-streaming: transcription is performed based on a fixed audio duration. If speech end is detected early, such as when the user stops input, or if the speech duration is shorter than expected, the system immediately returns the final transcription result.
Important
Different inference models have strict limits on the audio length for a single processing task. When configuring non-streaming transcription or preparing audio segments, keep the following limits in mind:
- Whisper models: the maximum audio length for a single input is 24 seconds.
- SenseVoice models: the maximum audio length for a single input is 15 seconds.
- If the audio sent in one request exceeds these limits, the SDK truncates the input audio. The remaining audio after truncation automatically starts a new transcription task.
Speech Transcription Status.enum AsrStatus
AsrStatus indicates the state of the ASR text result in the current recognition round. By checking this status, you can tell whether the current text is an intermediate transcription that may still change or the final confirmed result.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_PARTIAL | uint8_t | 0 | Intermediate transcription, recognition not finished |
| TYPE_FINAL | uint8_t | 1 | Final transcription result, recognition finished |
Important
Notes:
- Streaming mode (TYPE_STREAM): when the result status is PARTIAL, it means the SDK is returning an intermediate result before the current audio buffer has been fully processed. Only when the status is FINAL does it mean processing of the current buffer is complete.
- Non-streaming mode (TYPE_NOSTREAM): in non-streaming mode, the status of each transcription result is always FINAL.
// Streaming mode:
I am. TYPE_PARTIAL
I am a boy. TYPE_PARTIAL
I am a boy. I like Aplux. TYPE_FINAL
// Non-streaming mode:
I am a boy. I like Aplux. TYPE_FINALASR Result.class AsrResult
The AsrResult structure carries the ASR transcription result and its status. When the SDK finishes processing a segment of audio, it returns the recognized text together with its status, either intermediate or final, in this structure.
Member List
AsrResult includes the following parameters:
| Member | status |
| Type | AsrStatus |
| Default | No default value |
| Description | Transcription status of the current result |
| Member | text |
| Type | std::string |
| Default | No default value |
| Description | Current transcription text |
| Member | id |
| Type | int |
| Default | 0 |
| Description | Result ID |
ASR Error.class AsrError
The AsrError structure carries ASR error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.
Member List
AsrError includes the following parameters:
| Member | status |
| Type | ResultStatus |
| Default | No default value |
| Description | Error status |
| Member | error_code |
| Type | int |
| Default | No default value |
| Description | Error code |
| Member | message |
| Type | std::string |
| Default | No default value |
| Description | Error message |
ASR Callback Interface.class ASRCallbacks
ASRCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive recognition results or error information.
Get Transcription Result.onResult()
This callback is triggered automatically when the ASR engine finishes processing a segment of audio and generates text.
| API | onResult |
| Description | Speech recognition result callback |
| Parameters | result: transcription result object that contains the current recognized text and result status |
| Returns | void |
| API | onError |
| Description | Error callback. Used to receive and handle different exceptions that occur while ASR is running |
| Parameters | error: error information object that contains the current error message, error code, and related details |
| Returns | void |
| API | onStop |
| Description | Stop callback |
| Parameters | result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback |
| Returns | void |
class ASRCallbacksImpl : public ASRCallbacks
{
public:
void onResult(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("============callback result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
total_echo = sid;
}
void onError(const AsrError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("============error callback=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("===========================================\n\n");
}
void onStop(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("============stop result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n");
}
~ASRCallbacksImpl() = default;
};ASR Core Class.class AidVoiceASR
AidVoiceASR is the main ASR feature class in the SDK. It manages the full lifecycle of speech recognition. You use the interfaces provided by this class to load models, push audio data, and stop recognition tasks. This class must be initialized together with FeatureConfig.
Create Instance.create_asr()
This is the first interface you call when using the SDK. It creates and initializes a specific ASR object in memory based on the global configuration, such as feature type and model type.
| API | create_asr |
| Description | Builds a specific ASR instance from the configuration object |
| Parameters | cfg: global configuration used to specify the model type and log level |
| Returns | Returns an AidVoiceASR instance pointer on success. Returns nullptr on failure |
Important
Notes:
- Before calling this interface, make sure cfg.feature_type is set to TYPE_ASR.
Set Mode.set_mode()
This interface sets the working mode of ASR. Depending on your application needs, such as real-time speech interaction or offline long-form transcription, you can set it to streaming mode or non-streaming mode.
| API | set_mode |
| Description | Sets the ASR recognition mode |
| Parameters | mode: the recognition mode to use |
| Returns | void |
Set Callback.set_callback()
This interface registers a user-implemented callback listener instance with ASR. Once registered, the SDK uses this instance to asynchronously return transcription results through onResult and error information through onError.
| API | set_callback |
| Description | Registers a callback listener object used to receive asynchronous recognition results |
| Parameters | cb: pointer to an instance of a user-defined ASRCallbacks implementation |
| Returns | void |
// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
// After registration, ownership of the object is transferred to AidVoice.
asr->set_callback(mASRCallbacks);Important
Notes:
- The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the ASR instance is destroyed.
Enable Special Token Output.set_special_tokens()
This interface controls whether special tokens are included in the output, such as the model start token, end token, and other tokens with special meanings. If enabled, the callback also returns these special tokens.
| API | set_special_tokens |
| Description | Controls whether the model's special token characters are returned |
| Parameters | is_add: whether to include them. The default value is false |
| Returns | void |
Set Maximum Audio Processing Duration.set_echo_ms()
This interface sets the maximum audio length, in milliseconds, that a single ASR inference task can accept.
| API | set_echo_ms |
| Description | Sets the audio duration threshold for a single ASR inference task |
| Parameters | echo_ms: duration threshold for a single processing task |
| Returns | void |
Important
When setting this value, you must follow the per-model processing limits:
- Whisper models: must not exceed 24000 (24 s).
- SenseVoice models: echo_ms must not exceed 15000 (15 s).
Set Streaming Feedback Interval.set_step_ms()
This interface is designed for streaming mode (TYPE_STREAM). It sets how often ASR returns intermediate transcription results with PARTIAL status.
| API | set_step_ms |
| Description | Sets the callback interval for streaming transcription results in milliseconds. A callback is triggered each time the specified amount of audio has been processed |
| Parameters | step_ms: time step for result feedback |
| Returns | void |
Important
Notes:
- This setting only takes effect in streaming mode. In non-streaming mode, the system ignores this setting and returns the final result only after recognition is complete.
- A smaller step_ms gives better real-time feedback. In live microphone input scenarios, setting step_ms too small may make the output less continuous. To preserve real-time responsiveness, the SDK uses an overwrite-based buffer strategy. If the model cannot process audio as fast as it is being fed in, older unprocessed data may be overwritten by newly received audio, which can lead to broken recognition results.
Save Input Audio.set_save_audio()
This interface is mainly used for live microphone input. When enabled, the SDK automatically captures the raw audio stream received from the microphone and saves it locally in WAV format.
| API | set_save_audio |
| Description | Controls whether raw microphone input audio is saved locally |
| Parameters | save_audio: boolean value. true enables saving. false disables it, which is the default |
| Returns | void |
Get Device NPU Information.get_device_info()
This interface is used to query NPU information on the current device.
| API | get_device_info |
| Description | Queries NPU information on the current device |
| Parameters | device_info_list |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- You must call this interface after initialization has completed to get the current device NPU information.
auto asr = AidLux::AidVoice::create_asr(cfg);
asr->init();
std::vector<DeviceInfo> device_info;
asr->get_device_info(device_info);Bind an NPU Device for Execution.set_device_info()
This interface configures the NPU binding information used by the SDK. By passing the target NPU device settings, you can bind SDK inference tasks to the selected NPU so that the model runs on the intended hardware.
| API | set_device_info |
| Description | Sets the NPU device used by the SDK |
| Parameters | device_info |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.
auto asr = AidLux::AidVoice::create_asr(cfg);
DeviceInfo device01;
device01.id = 0;
device01.type = 0;
device01.cores_num = 1;
device01.cores_id = {0};
asr->set_device_info(device01);
asr->init();Initialize.init()
After the ASR object is created, you need to run initialization steps such as environment checks and resource setup.
| API | init |
| Description | Completes the initialization work required for inference |
| Parameters | void |
| Returns | Returns 0 on successful initialization. Any non-zero value means the operation failed |
// Initialize ASR. Any non-zero return value indicates an error.
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
printf("asr->init() failure!\n");
return EXIT_FAILURE;
}Data Input.write()
After init() succeeds, you can use the write() interface to send audio data for recognition. The SDK supports multiple input sources to cover different scenarios, such as file transcription and streaming audio capture.
Use an Audio File as Input
This interface directly reads a local audio file for recognition.
| API | write |
| Description | Passes the path of a 16 kHz WAV audio file. The SDK parses the file automatically and performs recognition |
| Parameters | wav_16k_file: absolute or relative path to the local audio file |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
// Use an audio file as input. Any non-zero return value indicates an error.
std::string wave_path = "audio.wav";
int ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}Important
Notes:
The audio file must be mono, 16-bit WAV or PCM audio with a sample rate of 16000 Hz.
Use a Raw Byte Stream as Input
This interface accepts raw audio bytes stored in memory.
| API | write |
| Description | Pushes a raw audio byte stream to ASR |
| Parameters | data: pointer to the audio data buffer len: byte length of the buffer data |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
// Use a raw byte stream as input. Any non-zero return value indicates an error.
char *data = new char[fileLen];
// ... Fill the audio data here ..
int ret = asr->write(data, data_size);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}Use a float Array as Input
This interface accepts floating-point audio sample data. It is suitable for audio streams that have already been preprocessed and converted to the standard float format.
| API | write |
| Description | Pushes a float array of audio samples to ASR |
| Parameters | auido_data: float array containing audio sample points |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
// Use a float array as input. Any non-zero return value indicates an error.
std::vector<float> audio_;
// ... Fill the audio data here ..
int ret = asr->write(audio_);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}Real-Time Microphone Input.audio_microphone()
This interface uses the configured microphone ID to capture audio through the microphone driver and sends the captured data directly to the ASR engine.
| API | audio_microphone |
| Description | Starts the microphone device with the specified ID and begins real-time speech recognition |
| Parameters | id: hardware device ID of the microphone. The default device is 0 |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- This interface supports streaming mode only (TYPE_STREAM). Before calling it, you must first call set_mode(TYPE_STREAM).
- In a terminal environment, you can stop the input stream safely by capturing Ctrl + C. If the microphone device is disconnected during capture, ASR also stops input automatically.
// After this call, the microphone device with ID 1 starts and receives speech in real time.
asr->audio_microphone(1);Stop ASR Input.stop()
This interface notifies the ASR engine that the audio stream has ended. After it is called, ASR immediately cuts off the audio input. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.
| API | stop |
| Description | Stops audio input |
| Parameters | void |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- This interface is intended for interrupted output scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop. It is suitable for fast interruption in streaming mode.
Destroy ASR Object.asr_destroy()
When all speech recognition tasks are finished and the ASR feature is no longer needed, you must call this interface. It fully releases all related resources.
| API | asr_destroy |
| Description | Completely destroys the ASR instance and releases all related resources |
| Parameters | void |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
TTS Interfaces
This section describes the core API interfaces in AidVoice SDK for text-to-speech (TTS). With these interfaces, you can complete the full TTS workflow, from creating a TTS instance and submitting text to retrieving the generated audio.
Feature overview: The TTS module converts input text into audio files. It currently supports two mainstream inference models: melotts_chinese and melotts_english.
TTS Mode.enum TTSMode
TTSMode is used to configure how synthesized audio is returned. Based on your real-time requirements, you can choose whole-output mode or fragment-output mode.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_WHOLE | uint8_t | 0 | Whole output: the complete audio is returned in one callback after the full sentence is synthesized |
| TYPE_FRAGMENT | uint8_t | 1 | Fragment output: the text is split by punctuation or semantic pauses, and short sentences are returned as soon as they are synthesized |
Important
Notes:
- Whole output: the full input text is processed as one task. The result callback is triggered only after all audio data has been synthesized, which ensures the audio is returned as one complete output.
- Fragment output: long text is intelligently split into multiple short sentences based on punctuation and semantic pauses. Each sentence is returned as soon as its audio has been synthesized.
Synthesized Audio Status.enum TTSStatus
TTSStatus indicates the current state of the audio result returned by the TTS engine. By checking this status, you can tell whether the returned audio is a partial segment produced during the task or the final result.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_PARTIAL | uint8_t | 0 | Partial synthesized audio |
| TYPE_FINAL | uint8_t | 1 | Complete synthesized audio |
Important
Notes:
- Fragment output (TYPE_FRAGMENT): when the result status is PARTIAL, the current audio is only one intermediate segment of the full input text, such as a short sentence. Only when the status is FINAL does it mean the synthesis task for the current sentence has fully finished.
- Whole output (TYPE_WHOLE): in whole mode, the status of each synthesized audio result is always FINAL.
TTS Result.class TTSResult
The TTSResult structure carries synthesized audio data and its status. When the SDK finishes synthesizing a segment of audio, it returns the related audio information and audio status, either partial or final, in this structure.
Member List
TTSResult includes the following parameters:
| Member | status |
| Type | TTSStatus |
| Default | No default value |
| Description | Status of the current returned audio |
| Member | audio_name |
| Type | std::string |
| Default | Empty string |
| Description | File name of the current output audio |
| Member | audio_data |
| Type | vector<float > |
| Default | No default value |
| Description | Generated raw audio data. Output format: float, mono, 44100 Hz sample rate |
| Member | audio_time |
| Type | double |
| Default | 0 |
| Description | Duration of the current output audio in seconds |
| Member | seq |
| Type | int |
| Default | 1 |
| Description | Indicates which segment of the synthesis sequence this returned audio block belongs to |
| Member | id |
| Type | int |
| Default | 0 |
| Description | Result ID |
TTS Error.class TTSError
The TTSError structure carries TTS error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.
Member List
TTSError includes the following parameters:
| Member | status |
| Type | ResultStatus |
| Default | ResultStatus::AV_OTHER |
| Description | Error status |
| Member | error_code |
| Type | int |
| Default | -1 |
| Description | Error code |
| Member | message |
| Type | std::string |
| Default | No default value |
| Description | Error message |
TTS Callback Interface.class TTSCallbacks
TTSCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive synthesis results or error information.
Get Synthesis Result.onResult()
This callback is triggered automatically when the TTS engine finishes processing a segment of text and generates audio.
| API | onResult |
| Description | Speech synthesis result callback |
| Parameters | result: synthesis result object that contains the current synthesized audio information and status |
| Returns | void |
| API | onError |
| Description | Error callback. Used to receive and handle different exceptions that occur while TTS is running |
| Parameters | error: error information object that contains the current error message, error code, and related details |
| Returns | void |
| API | onStop |
| Description | Stop callback |
| Parameters | result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback |
| Returns | void |
class TTSCallbacksImpl : public TTSCallbacks
{
public:
void onResult(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
std::vector<float> audio_data = result.audio_data;
double audio_time = result.audio_time;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("============callback result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:" << audio_name << std::endl;
std::cout << "audio_data size:" << audio_data.size() << std::endl;
std::cout << "audio_time: " << (double)audio_time << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
}
void onError(const TTSError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("============error callback=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("===========================================\n\n");
}
void onStop(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
std::vector<float> audio_data = result.audio_data;
double audio_time = result.audio_time;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("============stop result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:" << audio_name << std::endl;
std::cout << "audio_data size:" << audio_data.size() << std::endl;
std::cout << "audio_time: " << (double)audio_time << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
}
~TTSCallbacksImpl() = default;
};TTS Core Class.class AidVoiceTTS
AidVoiceTTS is the main TTS feature class in the SDK. It manages the full lifecycle of speech synthesis. You use the interfaces provided by this class to load models, submit text, and stop synthesis tasks. This class must be initialized together with FeatureConfig.
Create Instance.create_tts()
This is the first interface you call when using the SDK. It creates and initializes a specific TTS object in memory based on the global configuration, such as feature type and model type.
| API | create_tts |
| Description | Builds a specific TTS instance from the configuration object |
| Parameters | cfg: global configuration used to specify the model type and log level |
| Returns | Returns an AidVoiceTTS instance pointer on success. Returns nullptr on failure |
Important
Notes:
- Before calling this interface, make sure cfg.feature_type is set to TYPE_TTS.
Set Mode.set_mode()
This interface sets the working mode of TTS. Depending on your business needs, you can set it to whole-output mode or fragment-output mode. By default, TTS runs in whole mode (TYPE_WHOLE).
| API | set_mode |
| Description | Sets the working mode of TTS |
| Parameters | mode: the target working mode |
| Returns | void |
Set Callback.set_callback()
This interface registers a user-implemented callback listener instance with TTS. Once registered, the SDK uses this instance to asynchronously return synthesized audio information through onResult and error information through onError.
| API | set_callback |
| Description | Registers a callback listener object used to receive asynchronously returned synthesized audio information |
| Parameters | cb: pointer to an instance of a user-defined TTSCallbacks implementation |
| Returns | void |
// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
// After registration, ownership of the object is transferred to AidVoice.
tts->set_callback(mTTSCallbacks);Important
Notes:
- The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the TTS instance is destroyed.
Enable Speaker Playback for Synthesized Audio.set_play_audio()
This interface is used to play the speech generated by TTS. You must specify the speaker device ID. If you do not know the correct speaker device ID, see audio device lookup.
| API | set_play_audio |
| Description | Plays the synthesized speech |
| Parameters | dev_id: audio device ID |
| Returns | void |
Important
Notes:
- Audio playback is disabled by default. You need to set an audio device ID greater than 0 to enable the audio device.
Set Output Audio File Path.set_out_audio_path()
This interface sets the save path for the synthesized audio. It can be either a relative path or an absolute path. By default, no output audio path is set, so no audio file is generated.
| API | set_out_audio_path |
| Description | Save path for synthesized audio |
| Parameters | path: audio save path |
| Returns | 0: path set successfully; otherwise, setting failed |
Get Device NPU Information.get_device_info()
This interface is used to query NPU information on the current device.
| API | get_device_info |
| Description | Queries NPU information on the current device |
| Parameters | device_info_list |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- You must call this interface after initialization has completed to get the current device NPU information.
auto tts = AidLux::AidVoice::create_tts(cfg);
tts->init();
std::vector<DeviceInfo> device_info;
tts->get_device_info(device_info);Bind an NPU Device for Execution.set_device_info()
This interface configures the NPU binding information used by the SDK. By passing the target NPU device settings, you can bind SDK inference tasks to the selected NPU so that the model runs on the intended hardware.
| API | set_device_info |
| Description | Sets the NPU device used by the SDK |
| Parameters | device_info |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.
auto tts = AidLux::AidVoice::create_tts(cfg);
DeviceInfo device01;
device01.id = 0;
device01.type = 0;
device01.cores_num = 1;
device01.cores_id = {0};
tts->set_device_info(device01);
tts->init();Initialize.init()
After the TTS object is created, you need to run initialization steps such as environment checks and resource setup.
| API | init |
| Description | Completes the initialization work required for inference |
| Parameters | void |
| Returns | Returns 0 on successful initialization. Any non-zero value means the operation failed |
// Initialize TTS. Any non-zero return value indicates an error.
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
printf("tts->init() failure!\n");
return EXIT_FAILURE;
}Data Input.write()
After the TTS instance is initialized successfully, meaning init() returns success, you can use write() to submit text for synthesis. This interface accepts a string array, vector<string>, so you can submit multiple independent text items in one call. The synthesized audio is returned asynchronously.
| API | write |
| Description | Submits text data to TTS. The interface accepts a string array, vector<string>, and supports submitting multiple independent text segments in one call |
| Parameters | Array of text to synthesize. Each element in the array is treated as an independent synthesis task |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
// Use a string array as input. Any non-zero return value indicates an error.
std::vector<std::string> str_vec = {"I am a boy.", "I like Aplux."};
int ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
printf("tts->write() failure!\n");
return EXIT_FAILURE;
}Important
Notes:
After the text is submitted, the system returns the synthesized audio stream asynchronously through the callback interface. The output audio strictly follows these specifications:
- File container: standard WAV format
- Sample rate: 44100 Hz
- Channels: mono
Stop TTS Input.stop()
This interface notifies the TTS engine that the input stream has ended. After it is called, TTS immediately cuts off the input data. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.
| API | stop |
| Description | Formally closes the text input stream |
| Parameters | void |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Important
Notes:
- This interface is intended for interrupted output scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop. It is suitable for quick interruption in streaming-like output scenarios.
Destroy TTS Object.tts_destroy()
When all audio synthesis tasks have finished and the application no longer needs the TTS feature during its lifecycle, you must call this interface. It fully releases the system resources used by the SDK.
| API | tts_destroy |
| Description | Completely destroys the TTS instance and releases all related resources |
| Parameters | void |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
Other Methods
In addition to the inference-related interfaces described above, the AidVoice SDK also provides the following helper interfaces.
Get Microphone List.show_microphone_dev()
Before calling audio_microphone(), it is recommended to call this interface first to list the available audio input devices on the current system, so you can get the correct device ID.
| API | show_microphone_dev |
| Description | Lists all available microphone hardware devices in the system. This interface prints the device name and its corresponding ID to standard output or the logging system |
| Parameters | void |
| Returns | No return value |
Get Current AidVoice SDK Version.get_library_version()
Gets version information for the current AidVoice SDK.
| API | get_library_version |
| Description | Gets version information for the current AidVoice SDK |
| Parameters | void |
| Returns | string: version information |
Get Current Log Level.get_log_level()
| API | get_log_level |
| Description | Gets the current log level |
| Parameters | void |
| Returns | LogLevel: log level |
Set Log Level.set_log_level()
| API | set_log_level |
| Description | Sets the log level |
| Parameters | LogLevel: log level |
| Returns | Returns 0 by default |
Output Logs to the Console.log_to_console()
| API | log_to_console |
| Description | Sends log output to the standard error console |
| Parameters | void |
| Returns | Returns 0 by default |
Output Logs to a Text File.log_to_file()
| API | log_to_file |
| Description | Sends log output to the specified text file |
| Parameters | path_and_prefix: path and filename prefix for log files also_to_console: whether to also output logs to stderr. The default value is false |
| Returns | Returns 0 on success. Any non-zero value means the operation failed |
AidVoice C++ Sample Programs
AidVoice ASR Audio Recognition Sample
Using audio transcription as an example, a typical C++ sample for ASR includes the following parts:
// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_ASR;
cfg.model_path = "model_path";
// Build the ASR object
auto asr = AidLux::AidVoice::create_asr(cfg);
if (!asr)
{
printf("create_asr failure!\n");
return EXIT_FAILURE;
}
// Inherit the callback interface
class ASRCallbacksImpl : public ASRCallbacks
{
public:
void onResult(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("============callback result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
total_echo = sid;
}
void onError(const AsrError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("============error callback=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("===========================================\n\n");
}
void onStop(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("============stop result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
total_echo = sid;
}
~ASRCallbacksImpl() = default;
};
// Create the callback object and register it
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
asr->set_callback(mASRCallbacks);
// Initialize the ASR object
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
printf("asr->init() failure!\n");
return EXIT_FAILURE;
}
// Send audio file data
ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}
// Stop input
ret = asr->stop();
if (ret != EXIT_SUCCESS)
{
printf("asr->stop() failure!\n");
return EXIT_FAILURE;
}
// Destroy the object
ret = asr->asr_destroy();
if (ret != EXIT_SUCCESS)
{
printf("asr->asr_destroy() failure!\n");
return EXIT_FAILURE;
}AidVoice TTS Audio Synthesis Sample
A typical C++ sample for audio synthesis includes the following parts:
// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_TTS;
cfg.model_path = "model_path"; // You can set the model path here or pass it through command-line arguments
// Build the TTS object
auto tts = AidLux::AidVoice::create_tts(cfg);
if (!tts)
{
printf("create tts failure!\n");
return EXIT_FAILURE;
}
// Set the TTS working mode
tts->set_mode(TTSMode::TYPE_WHOLE);
// Inherit the callback interface
class TTSCallbacksImpl : public TTSCallbacks
{
public:
void onResult(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
std::vector<float> audio_data = result.audio_data;
double audio_len = result.audio_len;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("============callback result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:" << audio_name << std::endl;
std::cout << "audio_data size:" << audio_data.size() << std::endl;
std::cout << "audio_len: " << (double)audio_len << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
}
void onError(const TTSError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("============error callback=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("===========================================\n\n");
}
void onStop(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
std::vector<float> audio_data = result.audio_data;
double audio_len = result.audio_len;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("============stop result ===============\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:" << audio_name << std::endl;
std::cout << "audio_data size:" << audio_data.size() << std::endl;
std::cout << "audio_len: " << (double)audio_len << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("===========================================\n\n");
}
~TTSCallbacksImpl() = default;
};
// Create the callback object and register it
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
tts->set_callback(mTTSCallbacks);
// Initialize the TTS object
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
printf("tts->init() failure!\n");
return EXIT_FAILURE;
}
// Send text for synthesis
std::vector<std::string> str_vec = {"This is an example of text to speech using Melo for English. How does it sound?"};
ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
printf("tts->write() failure!\n");
return EXIT_FAILURE;
}
// Stop input
ret = tts->stop();
if (ret != EXIT_SUCCESS)
{
printf("tts->stop() failure!\n");
return EXIT_FAILURE;
}
// Destroy the object
ret = tts->tts_destroy();
if (ret != EXIT_SUCCESS)
{
printf("tts->tts_destroy() failure!\n");
return EXIT_FAILURE;
}Important
More usage examples are available in the following locations:
- C++ ASR sample path: /usr/local/share/aidvoice/examples/asr/cpp/
- C++ TTS sample path: /usr/local/share/aidvoice/examples/tts/cpp/
This completes the full list of interfaces provided by the AidVoice SDK.