Get started with Gemma 3 LLM on Android now!

admin March 12, 2025

0 224 7 minutes read

Get started with Gemma 3 LLM on Android now!

This article lets you start using the new Gemma 3 model for device reasoning. Give you the easiest steps to start AI from Android.

The Gemma 3 1b is a new model in the Gemma family of open weight models. When deploying a small language model (SLM) in a production setup, the model must be small enough to download quickly, run fast enough to keep user attention and support a wide range of end-user devices.

The Gemma 3 1B is only 529MB in size and can be inferred by Google AI Edge’s LLM to run at 2585 Tok/sec before pre-filling, creating the ability to process content pages in less than a second.

Includes Gemma 3 1b In your app, you can use natural language to drive the app or generate content from in-app data or context, all of which are completely customizable and adjustable.

Inferring on your device is the super power of your app. It’s great to have the ability to leverage AI without any data without network connection.

This article shows you the steps you need to get Gemma 3 into the application and start launching. The idea of this post is a stepping stone in your AI journey. You need to get started, but once you have this prototype build, you need to keep using the idea and take these next steps to really give you AI.

Let’s get started.

Using any LLM on Android requires several steps, and although this tutorial works for Gemma 3, these steps are equivalent no matter what you choose.

Decide your model
We will use Gemma 3:1b IT.
- 1b refers to 1 billion parameters. In machine learning, parameters are variables that the model learns from training data. Generally, the more parameters a model has, the more complex the pattern it can learn. However, the more parameters the larger the size of the LLM, so this is a trade-off.
- It refers to “instruction tuning”, which means it has (pretraining plus) “fine-tuning” (specific tuning to talk to), while PT means “just ””pretraining”, which is the initial training (learning a lot) without fine-tuning. (You can find LLM for LLM in the community of model collaborators.)
The model is too big to be wrapped in the APK, and after installation you need to download the model. ((Imagine someone seeing your app on the AppStore at 700MB+ is not cool. The model is packaged as a *.task file. )
After downloading, use MediaPipe to infer the model. ((MediaPipe facilitates the integration of pre-trained machine learning models that may include components used in the LLM pipeline and integrate them into the application.)

OK, let’s do it!

..Well, we’ve chosen the Gemma 3. But seriously, Gemma3 is a powerful model for device LLM and is perfect for being able to run on mobile devices.

Download Gemini3

In production applications, you may have customized the LLM, or it may not be customized, but it is recommended that you download the file from your own server. You may also need to pull it before downloading, which will save about 100MB.

In this example, we will download the Gemma3 model directly from HuggingFace. You can do it with a URL like this: If you just click on the URL, you’ll notice it fails. First, you need the API token!

Register the hug face and go to: Create a token. Give it access to “read the content of all public closed repositories you can access”.

Once you have the token, add the key/value to your project local.properties document. (An alternative is to use environment variables for the build.)

huggingface.token = hf_rR12312MADEUP3423423rsdfsdf

We can then load this token from the local property into our buildConfig. Edit your application’s build.gradle to add the following in the DefaultConfig block (code here):

val localProperties = Properties().apply {
    val localPropertiesFile = rootProject.file("local.properties")
load(localPropertiesFile.inputStream())
}
val huggingFaceDownloadApiKey = localProperties.getProperty("huggingface.token")
buildConfigField("String", "HUGGINGFACE_DOWNLOAD_API_KEY", ""$huggingFaceDownloadApiKey"")

If not, don’t forget to open BuildConfig:

buildFeatures {
buildConfig = true
}

Although we are in this file, we might as well add the required dependencies (Use toml files in your product application) (code here):

implementation("com.google.mediapipe:tasks-genai:0.10.22")
implementation("com.squareup.okhttp3:okhttp:4.12.0")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.9.0")
implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.7")

MediaPipe is used to infer GEMMA3 models. Please note that the official documentation still points to 0.10.14 This is wrong And cannot support gemma3
Okhttp & Coroutines for downloading models
ViewModel-Compose helps to observe states from our ViewModel

The last (s) of the setting. With MediaPipe and Android 12, you need to declare new security requirements for native libraries, so add it to your list of application nodes: (here)


    android:name="libOpenCL.so"
android:required="false"
/>
    android:name="libOpenCL-car.so"
android:required="false"
/>
    android:name="libOpenCL-pixel.so"
android:required="false"
/>

And don’t forget the internet license! ((Classic Error)

OK, that’s all the setup required, now we just need to download the model file and run some LLM requests.

Download the code

The tutorial has nothing to do with best practices, but instead uses the bare metal work of Android apps so that you can get started with LLM. With that in mind, here is the code to download the model. It has been downloaded to private storage, so it should only happen once, not required: (code here)

/**
* Ends up at: /data/user/0/com.blundell.tut.gemma3/files/gemma3-1b-it-int4.task
*/
class GemmaDownload(
private val huggingFaceToken: String,
private val okHttpClient: OkHttpClient,
) {fun downloadGemmaModel(directory: File): Flow = flow {
val url = "https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/gemma3-1b-it-int4.task?download=true"
val fileName = "gemma3-1b-it-int4.task"
val file = File(directory, fileName)
if (file.exists()) {
Log.d("TUT", "File already exists, skipping download.")
emit(DownloadResult.Success(file))
return@flow // Skip the download
}
Log.d("TUT", "Download starting!")
try {
val response = okHttpClient
.newCall(
Request.Builder()
.url(url)
.header("Authorization", "Bearer $huggingFaceToken")
.build()
)
.execute()
Log.d("TUT", "Download ended!")
if (!response.isSuccessful) {
Log.e("TUT", "Download Not successful.")
emit(DownloadResult.Error("Download failed: ${response.code}"))
return@flow
}
val source = response.body?.source()
if (source == null) {
emit(DownloadResult.Error("Empty response body"))
} else {
file.sink().buffer().use { sink ->
source.readAll(sink)
}
Log.d("TUT", "Success!")
emit(DownloadResult.Success(file))
}
} catch (e: IOException) {
Log.e("TUT", "Download IO Error", e)
emit(DownloadResult.Error("Network error: ${e.message}"))
} catch (e: Exception) {
Log.e("TUT", "Download General Error", e)
emit(DownloadResult.Error("An unexpected error occurred: ${e.message}"))
}
}
}
sealed class DownloadResult {
data class Success(val file: File) : DownloadResult()
data class Error(val message: String) : DownloadResult()
}

The above code creates an okhttp request for the ungingface of gemma3 1b.int4 is the precision of the weight, which helps it to be a small model file), use your embrace face API tokens as authorization bearer. When it fails or succeeds, the traffic sends the result.

In our example, we downloaded the model when the ViewModel loads, i.e. when the user first enters the screen: (code here)

class MainViewModel(
gemmaDownload: GemmaDownload,
application: Application,
) : ViewModel() {
private val _mainState: MutableStateFlow = MutableStateFlow(MainState.Idle)
val mainState: StateFlow = _mainStateinit {
viewModelScope.launch {
_mainState.value = MainState.LoadingModel
gemmaDownload
.downloadGemmaModel(application.filesDir)
.flowOn(Dispatchers.IO)
.collect { result ->
when (result) {
is DownloadResult.Error -> {
Log.e("TUT", "Model download error ${result.message}.")
_mainState.value = MainState.Error
}
is DownloadResult.Success -> {
Log.d("TUT", "Model downloaded successfully to ${result.file}.")
// TODO we have a model now, lets use it!
}
}
}
}
}

Next, we will observe these changes and update our app status accordingly.

Now DownloadResult.Success Statement, we know we downloaded a model locally and we need to load it into the media pipeline. After configuring MediaPipe, we can update the status: (here)

Log.d("TUT", "Model downloaded successfully to ${result.file}.")
// Set the configuration options for the LLM Inference task
val interfaceOptions = LlmInference.LlmInferenceOptions.builder()
.setModelPath(result.file.path)
.setMaxTokens(1000) // This could be up to 32,768 with Gemma 1b
.setPreferredBackend(LlmInference.Backend.CPU) // To work on the emulator
.build()
// Create an instance of the LLM Inference task
val llmInference = LlmInference.createFromOptions(application, interfaceOptions)
val sessionOptions = LlmInferenceSession.LlmInferenceSessionOptions.builder()
.setTemperature(0.8f) // Temperature is creative it can be with answers
.setTopK(40) // Select from the top 40 possible next tokens
.setTopP(0.95f) // Also helps with creativity, consider the most probable tokens whose combined probability adds up to 95%
.build()
val llmInferenceSession = LlmInferenceSession.createFromOptions(llmInference, sessionOptions)_mainState.emit(
MainState.LoadedModel(
llmSession = llmInferenceSession,
latestResponse = "",
responding = false,
)
)

We set up LlmInferenceOptions Then create a LlmInference Then create LlmInferenceSessionOptionsFinally, I let myself one LlmInferenceSession From all previous configurations.

We can master the session because that’s why we can send inputs (and give us outputs).

once LoadedModel Has been launched, we can load the UI using TextField and input buttons, allowing us to send queries. When the button is pressed, we use LlmInteferenceSession To send any input from the user: (code here)

    fun sendQuery(inputPrompt: String) {
val state = _mainState.value
if (state !is MainState.LoadedModel) {
throw IllegalStateException("Cannot send query without a loaded model. Handle this better in a 'real' app.")
}
// Clear the previous answer
_mainState.value = state.copy(
latestResponse = "",
responding = true,
)
val llmInferenceSession = state.llmSession
llmInferenceSession.addQueryChunk(inputPrompt)
llmInferenceSession.generateResponseAsync { partialResult, done ->
val currentState = _mainState.value
if (currentState !is MainState.LoadedModel) {
throw IllegalStateException("Cannot send query without a loaded model. Handle this better in a 'real' app.")
}
val response = currentState.latestResponse + partialResult
if (done) {
Log.d("TUT", "Full response: $response")
_mainState.value = currentState.copy(
latestResponse = response,
responding = false,
)
} else {
_mainState.value = currentState.copy(
latestResponse = response,
)
}
}
}

use session.addQueryChunk(inputPrompt) Allow us to send queries to LLM and then observe generateResponseAsync{}We can wait for a response and take action. Note that the response is blocky, you can wait until all blocks to respond all, but this example will take up each block and update the UI in the case we get.

Finally, your UI needs to observe the state changes in the ViewModel and update accordingly. This tutorial does a very simple job with just one understandable job, don’t use it as a best practice 🙂 (code here):

@Composable
internal fun MainScreen(viewModel: MainViewModel) {
val mainState by viewModel.mainState.collectAsStateWithLifecycle()
when (val state = mainState) {
is MainState.Error -> {
Text("Something went wrong, check LogCat.")
}
is MainState.Idle -> {
Text("Hello World")
}
is MainState.LoadedModel -> {
val scrollableState = rememberScrollState()
Column(
modifier = Modifier
.verticalScroll(scrollableState)
.padding(8.dp)
.fillMaxSize()
) {
val latestResponse = state.latestResponse
if (latestResponse.isNotEmpty()) {
Text("Latest response: ")
Text(latestResponse)
}var text by remember { mutableStateOf("") }
Spacer(
modifier = Modifier
.weight(1f)
)
Spacer(
modifier = Modifier
.padding(8.dp)
)
Text("Enter a query")
TextField(
value = text,
onValueChange = { newText -> text = newText },
label = { Text("Enter text") },
modifier = Modifier
.fillMaxWidth()
)
Spacer(
modifier = Modifier
.padding(4.dp)
)
Button(
onClick = { viewModel.sendQuery(text) },
enabled = !state.responding,
modifier = Modifier
.fillMaxWidth()
) {
Text("Send")
}
}
}
is MainState.LoadingModel -> {
Column(
verticalArrangement = Arrangement.Center,
horizontalAlignment = Alignment.CenterHorizontally,
modifier = Modifier
.padding(8.dp)
.fillMaxSize()
) {
Text("Loading!")
CircularProgressIndicator()
}
}
}
}

in conclusion

That’s it! You downloaded the Gemma 3 1B model on your device. Now you can run local offline private LLM in the app!

All code is available on GitHub: Remember, if you want to see the differences that match this blog, remember the last commit: /commit /c6f07262309e8d9cc7bfa5635b4967823b9d3ca7

One last thing I should say! This tutorial shows you how to load and interact with Gemma 3 as LLM. Just because it proves to be a chatbot, it doesn’t mean you should make a chatbot, using the model for: text classification, information extraction, limited question and answers, summary and anything you can think of! ..This is not a chatbot 🙖 Enjoy!