Skip to content

AidVoice C++ API Documentation

Important things to know when developing with the AidVoice SDK for C++:

  • Include the header file at compile time: /usr/local/include/aidlux/aidvoice/aidvoice_speech.hpp
  • Link the library file at build time: /usr/local/lib/libaidvoice_speech.so

Feature Type.enum FeatureType

FeatureType is used to specify the core feature module when initializing the AidVoice SDK. Since the SDK includes multiple speech features, you must use this enum to clearly define which feature you want when creating a feature instance (object). The SDK currently supports automatic speech recognition (ASR) and text-to-speech (TTS). More speech features will be added in future releases.

Member NameTypeValueDescription
TYPE_DEFAULTuint8_t0Invalid data type
TYPE_ASRuint8_t1Automatic speech recognition
TYPE_TTSuint8_t2Text to speech

Audio Type.enum AudioType

When running an ASR task, you need to define the encoding format and sampling properties of the input or output audio. By setting this enum, the SDK can correctly parse audio stream data or generate audio files in the required format. This enum is mainly used to identify the audio type for ASR input. The TTS output format is currently fixed by the engine.

Member NameTypeValueDescription
TYPE_DEFAULTuint8_t0Invalid data type
TYPE_WAVuint8_t1WAV audio
TYPE_PCMuint8_t2PCM audio

Important

To ensure ASR accuracy and system stability, the raw audio stream sent to the SDK must strictly follow these requirements:

  • Sample rate: fixed at 16 kHz.
  • Channel configuration: mono only.
  • Bit depth: 16-bit signed.

The TTS module currently outputs WAV audio only. The audio enum values here are mainly used by the ASR module to identify different input audio types.

Log Level.enum LogLevel

AidVoice SDK provides logging-related interfaces, which are introduced later in this document. If you need to specify the current logging level used by AidVoice SDK, use this enum.

Member NameTypeValueDescription
TYPE_INFOuint8_t0Information
TYPE_WARNINGuint8_t1Warning
TYPE_ERRORuint8_t2Error
TYPE_FATALuint8_t3Fatal error
TYPE_DEBUGuint8_t4Debug
TYPE_OFFuint8_t5Disabled

Return Status.enum ResultStatus

ResultStatus defines the return status for all SDK operations. By checking this enum value, you can tell whether the current operation completed successfully. If the return value indicates failure, you can troubleshoot the issue based on the detailed error information.

Member NameTypeValueDescription
AV_OKuint8_t0Success
AV_ERR_INVALID_ARGuint8_t1Invalid argument
AV_ERR_LOAD_FILEuint8_t2Failed to load file
AV_ERR_RUN_FAILuint8_t3Runtime error
AV_ERR_UNSUPPORTEDuint8_t4Operation not supported
AV_ERR_GENERATE_OBJECTuint8_t5Failed to create object
AV_OTHERuint8_t6Other

Device Information.struct DeviceInfo

The DeviceInfo class is used to describe NPU information on the current device.

Member List

Member version
Type uint32_t
Default 0X00020301
Description Version of the current device information
Member id
Type uint32_t
Default 0
Description ID of the current device
Member type
Type uint32_t
Default 0
Description Type of the current device
Member cores_num
Type uint32_t
Default 1
Description Number of cores in the current device information (NPU cores)
Member cores_id
Type std::vector<uint32_t>
Default No default value
Description IDs of all cores in the current device information

Important

  • The current SDK device information class is only used to query NPU information on the device. For CPU, GPU, and other hardware information, please use other methods.

Global Configuration.class FeatureConfig

The FeatureConfig structure stores all configuration information required to build a specific feature object. Before initializing an SDK instance, you need to create this structure and set the feature type, model selection, and logging level based on your business needs.

Member List

FeatureConfig includes the following parameters:

Member feature_type
Type FeatureType
Default No default value
Description Specifies the SDK feature mode, such as speech recognition or speech synthesis
Member model_path
Type string
Default Empty string
Description Specifies the path of the selected model
Member log_type
Type LogLevel
Default TYPE_OFF
Description Specifies the log level
Member custom_device_info
Type std::vector<DeviceInfo>
Default Empty array
Description List of NPU device information

ASR Interfaces

This section describes the core API interfaces in AidVoice SDK for automatic speech recognition (ASR). With these interfaces, you can complete the full ASR workflow, from creating an ASR instance and sending audio data to retrieving recognition results.

Feature overview: The ASR module converts 16 kHz, mono, 16-bit raw audio streams into text in real time or offline. It currently supports mainstream inference models such as senseVoice_small and whisper_tiny, whisper_base, whisper_base_en, whisper_small, and whisper_medium.

ASR Mode.enum ASRMode

ASRMode defines how ASR results are returned. Depending on your real-time requirements, you can choose streaming mode for incremental feedback or non-streaming mode for complete sentence output.

Member NameTypeValueDescription
TYPE_STREAMuint8_t0Streaming output: can return intermediate transcription results
TYPE_NOSTREAMuint8_t1Non-streaming output: returns the final transcription result for each processing step

Important

Notes:

  • Streaming: temporary transcription results are generated in real time during audio processing. As more audio is fed in, the SDK continues to revise and update the intermediate text. This mode can return data before the full audio buffer is processed, which reduces first-token latency and improves the interaction experience.
  • Non-streaming: transcription is performed based on a fixed audio duration. If speech end is detected early, such as when the user stops input, or if the speech duration is shorter than expected, the system immediately returns the final transcription result.

Important

Different inference models have strict limits on the audio length for a single processing task. When configuring non-streaming transcription or preparing audio segments, keep the following limits in mind:

  • Whisper models: the maximum audio length for a single input is 24 seconds.
  • SenseVoice models: the maximum audio length for a single input is 15 seconds.
  • If the audio sent in one request exceeds these limits, the SDK truncates the input audio. The remaining audio after truncation automatically starts a new transcription task.

Speech Transcription Status.enum AsrStatus

AsrStatus indicates the state of the ASR text result in the current recognition round. By checking this status, you can tell whether the current text is an intermediate transcription that may still change or the final confirmed result.

Member NameTypeValueDescription
TYPE_PARTIALuint8_t0Intermediate transcription, recognition not finished
TYPE_FINALuint8_t1Final transcription result, recognition finished

Important

Notes:

  • Streaming mode (TYPE_STREAM): when the result status is PARTIAL, it means the SDK is returning an intermediate result before the current audio buffer has been fully processed. Only when the status is FINAL does it mean processing of the current buffer is complete.
  • Non-streaming mode (TYPE_NOSTREAM): in non-streaming mode, the status of each transcription result is always FINAL.
cpp
// Streaming mode:
I am.                                TYPE_PARTIAL
I am a boy.                          TYPE_PARTIAL
I am a boy. I like Aplux.            TYPE_FINAL
// Non-streaming mode:
I am a boy. I like Aplux.            TYPE_FINAL

ASR Result.class AsrResult

The AsrResult structure carries the ASR transcription result and its status. When the SDK finishes processing a segment of audio, it returns the recognized text together with its status, either intermediate or final, in this structure.

Member List

AsrResult includes the following parameters:

Member status
Type AsrStatus
Default No default value
Description Transcription status of the current result
Member text
Type std::string
Default No default value
Description Current transcription text
Member id
Type int
Default 0
Description Result ID

ASR Error.class AsrError

The AsrError structure carries ASR error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.

Member List

AsrError includes the following parameters:

Member status
Type ResultStatus
Default No default value
Description Error status
Member error_code
Type int
Default No default value
Description Error code
Member message
Type std::string
Default No default value
Description Error message

ASR Callback Interface.class ASRCallbacks

ASRCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive recognition results or error information.

Get Transcription Result.onResult()

This callback is triggered automatically when the ASR engine finishes processing a segment of audio and generates text.

API onResult
Description Speech recognition result callback
Parameters result: transcription result object that contains the current recognized text and result status
Returns void
API onError
Description Error callback. Used to receive and handle different exceptions that occur while ASR is running
Parameters error: error information object that contains the current error message, error code, and related details
Returns void
API onStop
Description Stop callback
Parameters result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback
Returns void
cpp
class ASRCallbacksImpl : public ASRCallbacks
{
public:
	void onResult(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
		total_echo = sid;
	}

	void onError(const AsrError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n");
	}
	~ASRCallbacksImpl() = default;
};

ASR Core Class.class AidVoiceASR

AidVoiceASR is the main ASR feature class in the SDK. It manages the full lifecycle of speech recognition. You use the interfaces provided by this class to load models, push audio data, and stop recognition tasks. This class must be initialized together with FeatureConfig.

Create Instance.create_asr()

This is the first interface you call when using the SDK. It creates and initializes a specific ASR object in memory based on the global configuration, such as feature type and model type.

API create_asr
Description Builds a specific ASR instance from the configuration object
Parameters cfg: global configuration used to specify the model type and log level
Returns Returns an AidVoiceASR instance pointer on success.
Returns nullptr on failure

Important

Notes:

  • Before calling this interface, make sure cfg.feature_type is set to TYPE_ASR.

Set Mode.set_mode()

This interface sets the working mode of ASR. Depending on your application needs, such as real-time speech interaction or offline long-form transcription, you can set it to streaming mode or non-streaming mode.

API set_mode
Description Sets the ASR recognition mode
Parameters mode: the recognition mode to use
Returns void

Set Callback.set_callback()

This interface registers a user-implemented callback listener instance with ASR. Once registered, the SDK uses this instance to asynchronously return transcription results through onResult and error information through onError.

API set_callback
Description Registers a callback listener object used to receive asynchronous recognition results
Parameters cb: pointer to an instance of a user-defined ASRCallbacks implementation
Returns void
cpp
// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();

// After registration, ownership of the object is transferred to AidVoice.
asr->set_callback(mASRCallbacks);

Important

Notes:

  • The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the ASR instance is destroyed.

Enable Special Token Output.set_special_tokens()

This interface controls whether special tokens are included in the output, such as the model start token, end token, and other tokens with special meanings. If enabled, the callback also returns these special tokens.

API set_special_tokens
Description Controls whether the model's special token characters are returned
Parameters is_add: whether to include them. The default value is false
Returns void

Set Maximum Audio Processing Duration.set_echo_ms()

This interface sets the maximum audio length, in milliseconds, that a single ASR inference task can accept.

API set_echo_ms
Description Sets the audio duration threshold for a single ASR inference task
Parameters echo_ms: duration threshold for a single processing task
Returns void

Important

When setting this value, you must follow the per-model processing limits:

  • Whisper models: must not exceed 24000 (24 s).
  • SenseVoice models: echo_ms must not exceed 15000 (15 s).

Set Streaming Feedback Interval.set_step_ms()

This interface is designed for streaming mode (TYPE_STREAM). It sets how often ASR returns intermediate transcription results with PARTIAL status.

API set_step_ms
Description Sets the callback interval for streaming transcription results in milliseconds. A callback is triggered each time the specified amount of audio has been processed
Parameters step_ms: time step for result feedback
Returns void

Important

Notes:

  • This setting only takes effect in streaming mode. In non-streaming mode, the system ignores this setting and returns the final result only after recognition is complete.
  • A smaller step_ms gives better real-time feedback. In live microphone input scenarios, setting step_ms too small may make the output less continuous. To preserve real-time responsiveness, the SDK uses an overwrite-based buffer strategy. If the model cannot process audio as fast as it is being fed in, older unprocessed data may be overwritten by newly received audio, which can lead to broken recognition results.

Save Input Audio.set_save_audio()

This interface is mainly used for live microphone input. When enabled, the SDK automatically captures the raw audio stream received from the microphone and saves it locally in WAV format.

API set_save_audio
Description Controls whether raw microphone input audio is saved locally
Parameters save_audio: boolean value. true enables saving.
false disables it, which is the default
Returns void

Get Device NPU Information.get_device_info()

This interface is used to query NPU information on the current device.

API get_device_info
Description Queries NPU information on the current device
Parameters device_info_list
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • You must call this interface after initialization has completed to get the current device NPU information.
cpp
	auto asr = AidLux::AidVoice::create_asr(cfg);
	asr->init();
	std::vector<DeviceInfo> device_info;
	asr->get_device_info(device_info);

Bind an NPU Device for Execution.set_device_info()

This interface configures the NPU binding information used by the SDK. By passing the target NPU device settings, you can bind SDK inference tasks to the selected NPU so that the model runs on the intended hardware.

API set_device_info
Description Sets the NPU device used by the SDK
Parameters device_info
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.
cpp
	auto asr = AidLux::AidVoice::create_asr(cfg);
	DeviceInfo device01;
	device01.id = 0;
	device01.type = 0;
	device01.cores_num = 1;
	device01.cores_id = {0};
	asr->set_device_info(device01);
	asr->init();

Initialize.init()

After the ASR object is created, you need to run initialization steps such as environment checks and resource setup.

API init
Description Completes the initialization work required for inference
Parameters void
Returns Returns 0 on successful initialization. Any non-zero value means the operation failed
cpp
// Initialize ASR. Any non-zero return value indicates an error.
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
	printf("asr->init() failure!\n");
	return EXIT_FAILURE;
}

Data Input.write()

After init() succeeds, you can use the write() interface to send audio data for recognition. The SDK supports multiple input sources to cover different scenarios, such as file transcription and streaming audio capture.

Use an Audio File as Input

This interface directly reads a local audio file for recognition.

API write
Description Passes the path of a 16 kHz WAV audio file. The SDK parses the file automatically and performs recognition
Parameters wav_16k_file: absolute or relative path to the local audio file
Returns Returns 0 on success. Any non-zero value means the operation failed
cpp
// Use an audio file as input. Any non-zero return value indicates an error.
std::string wave_path = "audio.wav";
int ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

Important

Notes:

The audio file must be mono, 16-bit WAV or PCM audio with a sample rate of 16000 Hz.

Use a Raw Byte Stream as Input

This interface accepts raw audio bytes stored in memory.

API write
Description Pushes a raw audio byte stream to ASR
Parameters data: pointer to the audio data buffer
len: byte length of the buffer data
Returns Returns 0 on success. Any non-zero value means the operation failed
cpp
// Use a raw byte stream as input. Any non-zero return value indicates an error.
char *data = new char[fileLen];
// ... Fill the audio data here ..
int ret = asr->write(data, data_size);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}
Use a float Array as Input

This interface accepts floating-point audio sample data. It is suitable for audio streams that have already been preprocessed and converted to the standard float format.

API write
Description Pushes a float array of audio samples to ASR
Parameters auido_data: float array containing audio sample points
Returns Returns 0 on success. Any non-zero value means the operation failed
cpp
// Use a float array as input. Any non-zero return value indicates an error.
std::vector<float> audio_;
// ... Fill the audio data here ..
int ret = asr->write(audio_);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

Real-Time Microphone Input.audio_microphone()

This interface uses the configured microphone ID to capture audio through the microphone driver and sends the captured data directly to the ASR engine.

API audio_microphone
Description Starts the microphone device with the specified ID and begins real-time speech recognition
Parameters id: hardware device ID of the microphone. The default device is 0
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • This interface supports streaming mode only (TYPE_STREAM). Before calling it, you must first call set_mode(TYPE_STREAM).
  • In a terminal environment, you can stop the input stream safely by capturing Ctrl + C. If the microphone device is disconnected during capture, ASR also stops input automatically.
cpp
// After this call, the microphone device with ID 1 starts and receives speech in real time.
asr->audio_microphone(1);

Stop ASR Input.stop()

This interface notifies the ASR engine that the audio stream has ended. After it is called, ASR immediately cuts off the audio input. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.

API stop
Description Stops audio input
Parameters void
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • This interface is intended for interrupted output scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop. It is suitable for fast interruption in streaming mode.

Destroy ASR Object.asr_destroy()

When all speech recognition tasks are finished and the ASR feature is no longer needed, you must call this interface. It fully releases all related resources.

API asr_destroy
Description Completely destroys the ASR instance and releases all related resources
Parameters void
Returns Returns 0 on success. Any non-zero value means the operation failed

TTS Interfaces

This section describes the core API interfaces in AidVoice SDK for text-to-speech (TTS). With these interfaces, you can complete the full TTS workflow, from creating a TTS instance and submitting text to retrieving the generated audio.

Feature overview: The TTS module converts input text into audio files. It currently supports two mainstream inference models: melotts_chinese and melotts_english.

TTS Mode.enum TTSMode

TTSMode is used to configure how synthesized audio is returned. Based on your real-time requirements, you can choose whole-output mode or fragment-output mode.

Member NameTypeValueDescription
TYPE_WHOLEuint8_t0Whole output: the complete audio is returned in one callback after the full sentence is synthesized
TYPE_FRAGMENTuint8_t1Fragment output: the text is split by punctuation or semantic pauses, and short sentences are returned as soon as they are synthesized

Important

Notes:

  • Whole output: the full input text is processed as one task. The result callback is triggered only after all audio data has been synthesized, which ensures the audio is returned as one complete output.
  • Fragment output: long text is intelligently split into multiple short sentences based on punctuation and semantic pauses. Each sentence is returned as soon as its audio has been synthesized.

Synthesized Audio Status.enum TTSStatus

TTSStatus indicates the current state of the audio result returned by the TTS engine. By checking this status, you can tell whether the returned audio is a partial segment produced during the task or the final result.

Member NameTypeValueDescription
TYPE_PARTIALuint8_t0Partial synthesized audio
TYPE_FINALuint8_t1Complete synthesized audio

Important

Notes:

  • Fragment output (TYPE_FRAGMENT): when the result status is PARTIAL, the current audio is only one intermediate segment of the full input text, such as a short sentence. Only when the status is FINAL does it mean the synthesis task for the current sentence has fully finished.
  • Whole output (TYPE_WHOLE): in whole mode, the status of each synthesized audio result is always FINAL.

TTS Result.class TTSResult

The TTSResult structure carries synthesized audio data and its status. When the SDK finishes synthesizing a segment of audio, it returns the related audio information and audio status, either partial or final, in this structure.

Member List

TTSResult includes the following parameters:

Member status
Type TTSStatus
Default No default value
Description Status of the current returned audio
Member audio_name
Type std::string
Default Empty string
Description File name of the current output audio
Member audio_data
Type vector<float >
Default No default value
Description Generated raw audio data. Output format: float, mono, 44100 Hz sample rate
Member audio_time
Type double
Default 0
Description Duration of the current output audio in seconds
Member seq
Type int
Default 1
Description Indicates which segment of the synthesis sequence this returned audio block belongs to
Member id
Type int
Default 0
Description Result ID

TTS Error.class TTSError

The TTSError structure carries TTS error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.

Member List

TTSError includes the following parameters:

Member status
Type ResultStatus
Default ResultStatus::AV_OTHER
Description Error status
Member error_code
Type int
Default -1
Description Error code
Member message
Type std::string
Default No default value
Description Error message

TTS Callback Interface.class TTSCallbacks

TTSCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive synthesis results or error information.

Get Synthesis Result.onResult()

This callback is triggered automatically when the TTS engine finishes processing a segment of text and generates audio.

API onResult
Description Speech synthesis result callback
Parameters result: synthesis result object that contains the current synthesized audio information and status
Returns void
API onError
Description Error callback. Used to receive and handle different exceptions that occur while TTS is running
Parameters error: error information object that contains the current error message, error code, and related details
Returns void
API onStop
Description Stop callback
Parameters result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback
Returns void
cpp
class TTSCallbacksImpl : public TTSCallbacks
{
public:
	void onResult(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}

	void onError(const TTSError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}
	~TTSCallbacksImpl() = default;
};

TTS Core Class.class AidVoiceTTS

AidVoiceTTS is the main TTS feature class in the SDK. It manages the full lifecycle of speech synthesis. You use the interfaces provided by this class to load models, submit text, and stop synthesis tasks. This class must be initialized together with FeatureConfig.

Create Instance.create_tts()

This is the first interface you call when using the SDK. It creates and initializes a specific TTS object in memory based on the global configuration, such as feature type and model type.

API create_tts
Description Builds a specific TTS instance from the configuration object
Parameters cfg: global configuration used to specify the model type and log level
Returns Returns an AidVoiceTTS instance pointer on success.
Returns nullptr on failure

Important

Notes:

  • Before calling this interface, make sure cfg.feature_type is set to TYPE_TTS.

Set Mode.set_mode()

This interface sets the working mode of TTS. Depending on your business needs, you can set it to whole-output mode or fragment-output mode. By default, TTS runs in whole mode (TYPE_WHOLE).

API set_mode
Description Sets the working mode of TTS
Parameters mode: the target working mode
Returns void

Set Callback.set_callback()

This interface registers a user-implemented callback listener instance with TTS. Once registered, the SDK uses this instance to asynchronously return synthesized audio information through onResult and error information through onError.

API set_callback
Description Registers a callback listener object used to receive asynchronously returned synthesized audio information
Parameters cb: pointer to an instance of a user-defined TTSCallbacks implementation
Returns void
cpp
// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();

// After registration, ownership of the object is transferred to AidVoice.
tts->set_callback(mTTSCallbacks);

Important

Notes:

  • The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the TTS instance is destroyed.

Enable Speaker Playback for Synthesized Audio.set_play_audio()

This interface is used to play the speech generated by TTS. You must specify the speaker device ID. If you do not know the correct speaker device ID, see audio device lookup.

API set_play_audio
Description Plays the synthesized speech
Parameters dev_id: audio device ID
Returns void

Important

Notes:

  • Audio playback is disabled by default. You need to set an audio device ID greater than 0 to enable the audio device.

Set Output Audio File Path.set_out_audio_path()

This interface sets the save path for the synthesized audio. It can be either a relative path or an absolute path. By default, no output audio path is set, so no audio file is generated.

API set_out_audio_path
Description Save path for synthesized audio
Parameters path: audio save path
Returns 0: path set successfully; otherwise, setting failed

Get Device NPU Information.get_device_info()

This interface is used to query NPU information on the current device.

API get_device_info
Description Queries NPU information on the current device
Parameters device_info_list
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • You must call this interface after initialization has completed to get the current device NPU information.
cpp
	auto tts = AidLux::AidVoice::create_tts(cfg);
	tts->init();
	std::vector<DeviceInfo> device_info;
	tts->get_device_info(device_info);

Bind an NPU Device for Execution.set_device_info()

This interface configures the NPU binding information used by the SDK. By passing the target NPU device settings, you can bind SDK inference tasks to the selected NPU so that the model runs on the intended hardware.

API set_device_info
Description Sets the NPU device used by the SDK
Parameters device_info
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.
cpp
	auto tts = AidLux::AidVoice::create_tts(cfg);
	DeviceInfo device01;
	device01.id = 0;
	device01.type = 0;
	device01.cores_num = 1;
	device01.cores_id = {0};
	tts->set_device_info(device01);
	tts->init();

Initialize.init()

After the TTS object is created, you need to run initialization steps such as environment checks and resource setup.

API init
Description Completes the initialization work required for inference
Parameters void
Returns Returns 0 on successful initialization. Any non-zero value means the operation failed
cpp
// Initialize TTS. Any non-zero return value indicates an error.
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
	printf("tts->init() failure!\n");
	return EXIT_FAILURE;
}

Data Input.write()

After the TTS instance is initialized successfully, meaning init() returns success, you can use write() to submit text for synthesis. This interface accepts a string array, vector<string>, so you can submit multiple independent text items in one call. The synthesized audio is returned asynchronously.

API write
Description Submits text data to TTS. The interface accepts a string array, vector<string>, and supports submitting multiple independent text segments in one call
Parameters Array of text to synthesize. Each element in the array is treated as an independent synthesis task
Returns Returns 0 on success. Any non-zero value means the operation failed
cpp
// Use a string array as input. Any non-zero return value indicates an error.
std::vector<std::string> str_vec = {"I am a boy.", "I like Aplux."};
int ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
	printf("tts->write() failure!\n");
	return EXIT_FAILURE;
}

Important

Notes:

After the text is submitted, the system returns the synthesized audio stream asynchronously through the callback interface. The output audio strictly follows these specifications:

  • File container: standard WAV format
  • Sample rate: 44100 Hz
  • Channels: mono

Stop TTS Input.stop()

This interface notifies the TTS engine that the input stream has ended. After it is called, TTS immediately cuts off the input data. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.

API stop
Description Formally closes the text input stream
Parameters void
Returns Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

  • This interface is intended for interrupted output scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop. It is suitable for quick interruption in streaming-like output scenarios.

Destroy TTS Object.tts_destroy()

When all audio synthesis tasks have finished and the application no longer needs the TTS feature during its lifecycle, you must call this interface. It fully releases the system resources used by the SDK.

API tts_destroy
Description Completely destroys the TTS instance and releases all related resources
Parameters void
Returns Returns 0 on success. Any non-zero value means the operation failed

Other Methods

In addition to the inference-related interfaces described above, the AidVoice SDK also provides the following helper interfaces.

Get Microphone List.show_microphone_dev()

Before calling audio_microphone(), it is recommended to call this interface first to list the available audio input devices on the current system, so you can get the correct device ID.

API show_microphone_dev
Description Lists all available microphone hardware devices in the system. This interface prints the device name and its corresponding ID to standard output or the logging system
Parameters void
Returns No return value

Get Current AidVoice SDK Version.get_library_version()

Gets version information for the current AidVoice SDK.

API get_library_version
Description Gets version information for the current AidVoice SDK
Parameters void
Returns string: version information

Get Current Log Level.get_log_level()

API get_log_level
Description Gets the current log level
Parameters void
Returns LogLevel: log level

Set Log Level.set_log_level()

API set_log_level
Description Sets the log level
Parameters LogLevel: log level
Returns Returns 0 by default

Output Logs to the Console.log_to_console()

API log_to_console
Description Sends log output to the standard error console
Parameters void
Returns Returns 0 by default

Output Logs to a Text File.log_to_file()

API log_to_file
Description Sends log output to the specified text file
Parameters path_and_prefix: path and filename prefix for log files
also_to_console: whether to also output logs to stderr. The default value is false
Returns Returns 0 on success. Any non-zero value means the operation failed

AidVoice C++ Sample Programs

AidVoice ASR Audio Recognition Sample

Using audio transcription as an example, a typical C++ sample for ASR includes the following parts:

cpp
// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_ASR;
cfg.model_path = "model_path";

// Build the ASR object
auto asr = AidLux::AidVoice::create_asr(cfg);
if (!asr)
{
	printf("create_asr failure!\n");
	return EXIT_FAILURE;
}

// Inherit the callback interface
class ASRCallbacksImpl : public ASRCallbacks
{
public:
   void onResult(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
		total_echo = sid;
	}

	void onError(const AsrError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
		total_echo = sid;
	}
	~ASRCallbacksImpl() = default;
};

// Create the callback object and register it
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
asr->set_callback(mASRCallbacks);

// Initialize the ASR object
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
	printf("asr->init() failure!\n");
	return EXIT_FAILURE;
}

// Send audio file data
ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

// Stop input
ret = asr->stop();
if (ret != EXIT_SUCCESS)
{
	printf("asr->stop() failure!\n");
	return EXIT_FAILURE;
}

// Destroy the object
ret = asr->asr_destroy();
if (ret != EXIT_SUCCESS)
{
	printf("asr->asr_destroy() failure!\n");
	return EXIT_FAILURE;
}

AidVoice TTS Audio Synthesis Sample

A typical C++ sample for audio synthesis includes the following parts:

cpp
// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_TTS;
cfg.model_path = "model_path"; // You can set the model path here or pass it through command-line arguments

// Build the TTS object
auto tts = AidLux::AidVoice::create_tts(cfg);
if (!tts)
{
	printf("create tts failure!\n");
	return EXIT_FAILURE;
}

// Set the TTS working mode
tts->set_mode(TTSMode::TYPE_WHOLE);

// Inherit the callback interface
class TTSCallbacksImpl : public TTSCallbacks
{
public:
		void onResult(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_len = result.audio_len;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_len: " << (double)audio_len << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}

	void onError(const TTSError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_len = result.audio_len;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_len: " << (double)audio_len << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}
	~TTSCallbacksImpl() = default;
};

// Create the callback object and register it
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
tts->set_callback(mTTSCallbacks);

// Initialize the TTS object
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
	printf("tts->init() failure!\n");
	return EXIT_FAILURE;
}

// Send text for synthesis
std::vector<std::string> str_vec = {"This is an example of text to speech using Melo for English. How does it sound?"};
ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
	printf("tts->write() failure!\n");
	return EXIT_FAILURE;
}

// Stop input
ret = tts->stop();
if (ret != EXIT_SUCCESS)
{
	printf("tts->stop() failure!\n");
	return EXIT_FAILURE;
}

// Destroy the object
ret = tts->tts_destroy();
if (ret != EXIT_SUCCESS)
{
	printf("tts->tts_destroy() failure!\n");
	return EXIT_FAILURE;
}

Important

More usage examples are available in the following locations:

  • C++ ASR sample path: /usr/local/share/aidvoice/examples/asr/cpp/
  • C++ TTS sample path: /usr/local/share/aidvoice/examples/tts/cpp/

This completes the full list of interfaces provided by the AidVoice SDK.