love this. So many layers deep, we just had a good laugh.
deepdarkforest 19 hours ago [-]
> However, both are platform-specific and only support specific models from the company
This is not true, as you are for sure aware. Google AI edge supports a lot models, including any Litert model from huggingface, pytorch ones etc. [0]. Additionally, it's not even platform specific, works for iOS [1].
Why lie? I understand that your framework does more stuff like MCP, but I'm sure that's coming for Google's as well. I guess if the UX is really better it can work, but i would also say Ollama's use cases are quite different because on desktop there's a big community of hobbyists that cook up their own little pipelines/just chat to LLMs with local models (apart from the desktop app devs). But on phones, imo that segment is much smaller. App devs are more likely to use the 1st party frameworks, rather than 3rd party. I wouldnt even be surprised if apple locks down at some points some API's for safety/security reasons.
Thanks for the feedback. You're right to point out that Google AI Edge is cross-platform and more flexible than our phrasing suggested.
The core distinction is in the ecosystem: Google AI Edge runs tflite models, whereas Cactus is built for GGUF. This is a critical difference for developers who want to use the latest open-source models.
One major outcome of this is model availability. New open source models are released in GGUF format almost immediately. Finding or reliably converting them to tflite is often a pain. With Cactus, you can run new GGUF models on the day they drop on Huggingface.
Quantization level also plays a role. GGUF has mature support for quantization far below 8-bit. This is effectively essential for mobile. Sub-8-bit support in TFLite is still highly experimental and not broadly applicable.
Last, Cactus excels at CPU inference. While tflite is great, its peak performance often relies on specific hardware accelerators (GPUs, DSPs). GGUF is designed for exceptional performance on standard CPUs, offering a more consistent baseline across the wide variety of devices that app developers have to support.
deepdarkforest 18 hours ago [-]
No worries.
GGUF is more suitable for the latest open-source models, i agree there. Quant2/Q4 will probably be critical as well, if we don't see a jump in ram. But then again I wonder when/If mediapipe will support GGUF as well.
PS, I see you are in the latest YC batch? (below you mentioned BF). Good luck and have fun!
blks 6 hours ago [-]
First paragraph reads like chat gpt response.
poly2it 4 hours ago [-]
Not just the first paragraph, the whole response reads like LLM output.
DarmokJalad1701 18 hours ago [-]
I would say that while Google's MediaPipe can technically run any tflite model, it turned out to be a lot more difficult to do in practice with third-party models compared to the "officially supported" models like Gemma-3n. I was trying to set up a VLM inference pipeline using a SmolVLM model. Even after converting it to a tfilte-compatible binary, I struggled to get it working and then once it did work, it was super slow and was obviously missing some hardware acceleration.
I have not looked at OP's work yet, but if it makes the task easier, I would opt for that instead of Google's "MediaPipe" API.
pj_mukh 11 hours ago [-]
Does Google AI Edge have React Native support? Doesn't seem like it? Cactus does though.
awaseem 39 minutes ago [-]
This is actually crazy. The API is so simple! I tried to do this on Swift using LLM.swift and it went okay, excited to try this on RN
azinman2 10 hours ago [-]
“ Is available in Flutter, React-Native & Kotlin Multi-platform for cross-platform developers, since most apps are built with these today.”
Is this really true? Where are these stats coming from?
pzo 7 hours ago [-]
Probably they mean new apps. Since kotlin multiplatform on android is just native android and android share is like 70% if devices it already at least 50% market share of mobile apps. If you add flutter and react native there is not much left: only games like unity and unreal. I see much less iOS jobs these days.
anupj 5 hours ago [-]
Running LLMs, VLMs, and TTS models locally on smartphones is quietly redefining what 'edge AI' means suddenly, the edge is in your pocket, not just at the network boundary. The next wave of apps will be built by those who treat mobile as the new AI server
neurostimulant 6 hours ago [-]
This is great!
It would be great if the local llm have access to local tools you can enable/disable as needed (e.g. via customizable profiles). Simple tools such as fetch url, file access, messaging, calendar, etc would be very useful, though I'm not sure if the input token limit is large enough to allow this. Even better if it can somehow do web search but I understand it would be hard to do for free.
Also, how cool it would be if you can expose openai compatible api that can be accessed from other devices in your local network? Imagine turning your old phones into local llm servers. That would be very cool.
By the way, I can't figure out how to clear previous chats data. Is it hidden somewhere?
throw777373 18 hours ago [-]
Ollama runs on Android just fine via Termux. I use it with 5GB models. They even recently added ollama package, there is no longer need to compile it from source code.
v5v3 6 hours ago [-]
Didn't know that. Thanks
rshemet 18 hours ago [-]
True - but Cactus is not just an app.
We are a dev toolkit to run LLMs cross-platform locally in any app you like.
jadbox 18 hours ago [-]
How does it work? How does one model on the device get shared to many apps? Does each app have it's own inference sdk running or is there one inference engine shared to many apps (like ollama does). If it's the later, what's the communication protocol to the inference engine?
rshemet 17 hours ago [-]
Great question. Currently, each app is sandboxed - so each model file is downloaded inside each app's sandbox. We're working on enabling file sharing across multiple apps so you don't have to redownload the model.
With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.
The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)
pogue 10 hours ago [-]
I would like to see it as an app, tbh! If I could run it as an APK with a nice GUI interface for picking different models to run, that would be a killer feature.
pzo 7 hours ago [-]
Is this using only llama.cpp as inference engine? How is this days support there on NPU and GPU? Not sure if LLM can run on NPU but many models like STT and TTS and vision often can run much faster on Apple NPU
Right now we have a desktop version with ollama support, but we want to build a mobile chromium fork with local LLM support. Will check out cactus!
rshemet 19 hours ago [-]
great stuff. (good timing for a post given all the comet news too :) )
DM me on BF - let's talk!
smcleod 16 hours ago [-]
FYI I see you have SmolLM2, this was replaced with SmolLM 3 this week!
Would be great to have a few larger models to choose from too, Qwen 3 4b, 8b etc
v5v3 6 hours ago [-]
Do the community tools in Ollama work in Cactus? (Just python scripts I think).
pj_mukh 17 hours ago [-]
Amazing, this is so so useful.
Thank you especially for the phone model vs tok/s breakdown. Do you have such tables for more models? For models even leaner than Gemma3 1B. How low can you go? Say if I wanted to tweak out 45toks/s on an iPhone 13?
P.S: Also, I'm assuming the speeds stay consistent with react-native vs. flutter etc?
rshemet 17 hours ago [-]
thank you! We're continue to add performance metrics as more data comes in.
A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.
Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.
pickettd 16 hours ago [-]
I also want to add on that I really appreciate the benchmarks.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.
Reebz 16 hours ago [-]
Looking at the current benchmarks table, I was curious: what do you think is wrong with Samsung S25 Ultra?
Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).
Thanks for sharing your project.
rshemet 16 hours ago [-]
great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).
S25 is an outlier that surprised us too.
I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
ttouch 20 hours ago [-]
very good project!
can you tell us more about the use cases that you have in mind? I saw that you're able to run 1-4B models (which is impressive!)
rshemet 20 hours ago [-]
Thank you! it goes without saying that the field is rapidly developing, so the use cases range from private AI assistant/companion apps to internet connectivity-independent copilots to powering private wearables, etc.
We're currently working with a few projects in the space.
For other applications, join the discord and stay tuned! :)
politelemon 20 hours ago [-]
Very nice, good work. I think you should add the chat app links on the readme, so that visitors get a good idea of what the framework is capable of.
The performance is quite good, even on CPU.
However I'm now trying it on a pixel, and it's not using GPU if I enable it.
I do like this idea as I've been running models in termux until now.
Is the plan to make this app something similar to lmstudio for phones?
rshemet 19 hours ago [-]
appreciate the feedback! Made the demo links more prominent on the README.
Some Android models won't support GPU hardware; we'll be addressing that as we move to our own kernels.
The app itself is just a demonstration of Cactus performance. The underlying framework gives you the tools to build any local mobile AI experience you'd like.
matthewolfe 19 hours ago [-]
For argument's sake, suppose we live in a world where many high-quality models can be run on-device. Is there any concern from companies/model developers about exposing their proprietary weights to the end user? It's generally not difficult to intercept traffic (weights) sent to and app, or just reverse the app itself.
rshemet 19 hours ago [-]
So far, our focus is on supporting models with fully open-sourced weights. Providers who are sensitive about their weights typically lock those weights up in their cloud and don't run their models locally on consumer devices anyway.
I believe there are some frameworks pioneering model encryption, but i think we're a few steps away from wide adoption.
bangaladore 18 hours ago [-]
Simple answer is they won't send the model to the end user if they don't want it used outside their app.
This isn't really anything novel to LLMs of AI models. Part of the reason for many previously desktop applications being cloud or requiring cloud access is keeping their sensitive IP off the end users' device.
It is fantastic. Compared to another program I had installed a year ago, the speed of processing and answering is really good and accurate. Was able to ask mathematical questions, basic translation between different languages and even trivia about movies released almost 30 years ago.
Things to improve: 1) sometimes the question would get stuck on the last phrase and keep repeating it without end. 2) The chat does not scroll the window to follow the answer and we have to scroll manually.
In either case, excellent start. It is without the fastest offline LLM that I've seen working on this phone.
ipsum2 15 hours ago [-]
GGUF is easy to implement, but you'd probably find better performance with tflite on mobile for their custom XNNPACK kernels. Performance is pretty critical on low-power devices.
HenryNdubuaku 11 hours ago [-]
We are writing our own backend, but tflite (now called LiteRT) was not faster than GGML when we tested and GGML is already well supported. But we are moving away completely anyway.
Very cool. Looks like it might be practical to run 7b models at Q4 on my phone,
That would make it truly useful!
20 hours ago [-]
khalel 19 hours ago [-]
What do you think about security? I mean, a model with full (or partial) access to the smartphone and internet. Even if it runs locally, isn't there still a risk that these models could gain full access to the internet and the device?
rshemet 19 hours ago [-]
The models themselves live in an isolated sandbox. On top of that, each mobile app has its own sandbox - isolated from the phone's data or tools.
Both the model and the app only have access to the tools or data that you choose to give it. If you choose to give the model access to web search - sure, it'll have (read-only) access to internet data.
ekianjo 4 hours ago [-]
appreciate if you can provide a apk that does not require google play services to run...
tderflinger 11 hours ago [-]
Great project! I will try it out. :)
refulgentis 18 hours ago [-]
Beware of this, it's a two weeks old project.
Idk who these people are and I am sure they have good intentions, but they're wrapping llama.cpp.
That's what "like Ollama" means when you're writing code. That's also why there's a ton of comments asking if it's a server or app or what (it's a framework that an app would be built to use, you can't have an app with a localhost server like ollama on Android & iOS)
There's plenty of projects much further ahead, and I don't appreciate the amount of times I've seen this project come up in conversation the past 24 hours, due to misleading assertions that looked LLM-written, and a rush to make marketing claims that are just stuff llama.cpp does for you.
HenryNdubuaku 14 hours ago [-]
Thanks for the comment, but:
1) The commit history goes back to April.
2) LlaMa.cpp licence is included in the Repo where necessary like Ollama, until it is deprecated.
3) Flutter isolates behave like servers, and Cactus codes use that.
refulgentis 14 hours ago [-]
What does #3 mean?
Flutter isolates are like threads, and servers may use multithreading to handle requests, and Ollama is like a server in that it provides an API, and since we've shown both are servers, it's like Ollama?
Please do educate me on this, I'm fascinated.
When you're done there, let's say Flutter having isolates does mean you have a React Native and Flutter local LLM server.
What's your plan for your Android & iOS-only framework being a system server? Or alternatively, available at localhost for all apps to contact?
HenryNdubuaku 13 hours ago [-]
We are following Ollama's design, but not verbatim due to apps being sandboxed.
Phones are resource-constrained, we saw significant battery overhead with in-process HTTP listeners so we stuck with simple stateful isolates in Flutter and exploring standalone server app others can talk to for React.
For model sharing with the current setup:
iOS - We are working towards writing the model into an App Group container, tricky but working around it.
Android - We are working towards prompting the user once for a SAF directory (e.g., /Download/llm_models), save the model there, then publish a ContentProvider URI for zero-copy reads.
We are already writing more mobile-friendly kernels and Tensors, but GGML/GGUF is widely supported, porting it is an easy way to get started and collect feedback, but we will completely move away from in < 2 months.
Anything else you would like to know?
refulgentis 13 hours ago [-]
How does writing a model into an App Group container enable your framework to enable an app to enable a local LLM server that 3rd party apps can make calls to on iOS?[^1]
How does writing a model into a shared directory on Android enable a local LLM server that 3rd party apps can make calls to?[^2]
How does writing your own kernels get you off GGUF in 2 months? GGUF is a storage format. You use kernels to do things with the numbers you get from it.
I thought GGUF was an advantage? Now it's something you're basically done using?
I don't think you should continue this conversation. As easy it as it is to get your work out there, it's just as easy to build a record of stretching truth over and over again.
Best of luck, and I mean it. Just, memento mori: be honest and humble along the way. This is something you will look back on in a year and grimace.
[^1] App group containers only work between apps signed from the same Apple developer account. Additionally, that is shared storage, not a way to provide APIs to other apps.
[^2] SAF = Storage Access Framework, that is shared storage, not a way to provide APIs to other apps.
HenryNdubuaku 10 hours ago [-]
I was merely replying to questions, thinking you were genuinely asking. On reviewing this thread, I feel you are angry for some reason and taking my responses out of context. I clearly explained that running a server came with battery overhead, so we ditched it in favour of shared app group for model weight sharing. Anyway, Thanks and have a nice day
jeffhuys 7 hours ago [-]
The best way to go about this is realizing that there are more people reading this thread that make their own assumptions.
Not staying professional and just answering the questions, and just doing "aight im outta here" when it gets a little bit harder is not a good look; it seems like you can't defend your own project.
Just FYI.
refulgentis 10 hours ago [-]
I'm not sure having wrong answers then making up reasons not to answer is the same thing as answering.
Good luck!
rshemet 18 hours ago [-]
reminds me of
- "You are, undoubtedly, the worst pirate i have ever heard of"
- "Ah, but you have heard of me"
Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)
Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.
We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.
PS - we're up to good (source: trust us)
Scene_Cast2 15 hours ago [-]
Does this support openrouter?
rshemet 15 hours ago [-]
hot off the press in our latest feature release :)
we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.
max-privatevoid 20 hours ago [-]
They literally vendored llama.cpp and they STILL called it "Ollama for *". Georgi cannot be vindicated hard enough.
rshemet 20 hours ago [-]
didn't Ollama vendor Llama cpp too?
Most projects typically start with llama.cpp and then move away to proprietary kernels
yrcyrc 19 hours ago [-]
how do i add RAG / personal assistant features on iOS?
rshemet 19 hours ago [-]
you can plug in a vector DB and run Cactus embeddings for retrieval. Assuming you're using React Native, here's an example:
Does this download models at runtime? I would have expected a different API for that. I understand that you don’t want to include a multi-gig model in your app. But the mobile flow is usually to block functionality with a progress bar on first run. Downloading inline doesn’t integrate well into that.
You’d want an API for downloading OR pulling from a cache. Return an identifier from that and plug it into the inference API.
rshemet 18 hours ago [-]
Very good point - we've heard this before.
We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.
wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).
We'll update the download API, thanks again.
teaearlgraycold 17 hours ago [-]
Sounds good!
xnx 20 hours ago [-]
Is there an .apk for Android?
rshemet 20 hours ago [-]
Cactus is a framework - not the app itself. If you're looking for an Android demo, you can go to
This is not true, as you are for sure aware. Google AI edge supports a lot models, including any Litert model from huggingface, pytorch ones etc. [0]. Additionally, it's not even platform specific, works for iOS [1].
Why lie? I understand that your framework does more stuff like MCP, but I'm sure that's coming for Google's as well. I guess if the UX is really better it can work, but i would also say Ollama's use cases are quite different because on desktop there's a big community of hobbyists that cook up their own little pipelines/just chat to LLMs with local models (apart from the desktop app devs). But on phones, imo that segment is much smaller. App devs are more likely to use the 1st party frameworks, rather than 3rd party. I wouldnt even be surprised if apple locks down at some points some API's for safety/security reasons.
[0] https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inf...
[1] https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inf...
The core distinction is in the ecosystem: Google AI Edge runs tflite models, whereas Cactus is built for GGUF. This is a critical difference for developers who want to use the latest open-source models.
One major outcome of this is model availability. New open source models are released in GGUF format almost immediately. Finding or reliably converting them to tflite is often a pain. With Cactus, you can run new GGUF models on the day they drop on Huggingface.
Quantization level also plays a role. GGUF has mature support for quantization far below 8-bit. This is effectively essential for mobile. Sub-8-bit support in TFLite is still highly experimental and not broadly applicable.
Last, Cactus excels at CPU inference. While tflite is great, its peak performance often relies on specific hardware accelerators (GPUs, DSPs). GGUF is designed for exceptional performance on standard CPUs, offering a more consistent baseline across the wide variety of devices that app developers have to support.
GGUF is more suitable for the latest open-source models, i agree there. Quant2/Q4 will probably be critical as well, if we don't see a jump in ram. But then again I wonder when/If mediapipe will support GGUF as well.
PS, I see you are in the latest YC batch? (below you mentioned BF). Good luck and have fun!
I have not looked at OP's work yet, but if it makes the task easier, I would opt for that instead of Google's "MediaPipe" API.
Is this really true? Where are these stats coming from?
It would be great if the local llm have access to local tools you can enable/disable as needed (e.g. via customizable profiles). Simple tools such as fetch url, file access, messaging, calendar, etc would be very useful, though I'm not sure if the input token limit is large enough to allow this. Even better if it can somehow do web search but I understand it would be hard to do for free.
Also, how cool it would be if you can expose openai compatible api that can be accessed from other devices in your local network? Imagine turning your old phones into local llm servers. That would be very cool.
By the way, I can't figure out how to clear previous chats data. Is it hidden somewhere?
We are a dev toolkit to run LLMs cross-platform locally in any app you like.
With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.
The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)
We are working on agentic browser (also launched today https://news.ycombinator.com/item?id=44523409 :))
Right now we have a desktop version with ollama support, but we want to build a mobile chromium fork with local LLM support. Will check out cactus!
DM me on BF - let's talk!
Would be great to have a few larger models to choose from too, Qwen 3 4b, 8b etc
Thank you especially for the phone model vs tok/s breakdown. Do you have such tables for more models? For models even leaner than Gemma3 1B. How low can you go? Say if I wanted to tweak out 45toks/s on an iPhone 13?
P.S: Also, I'm assuming the speeds stay consistent with react-native vs. flutter etc?
A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.
Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.
Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).
Thanks for sharing your project.
S25 is an outlier that surprised us too.
I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
can you tell us more about the use cases that you have in mind? I saw that you're able to run 1-4B models (which is impressive!)
We're currently working with a few projects in the space.
For a demo of a familiar chat interface, download https://apps.apple.com/gb/app/cactus-chat/id6744444212 or https://play.google.com/store/apps/details?id=com.rshemetsub...
For other applications, join the discord and stay tuned! :)
The performance is quite good, even on CPU.
However I'm now trying it on a pixel, and it's not using GPU if I enable it.
I do like this idea as I've been running models in termux until now.
Is the plan to make this app something similar to lmstudio for phones?
Some Android models won't support GPU hardware; we'll be addressing that as we move to our own kernels.
The app itself is just a demonstration of Cactus performance. The underlying framework gives you the tools to build any local mobile AI experience you'd like.
I believe there are some frameworks pioneering model encryption, but i think we're a few steps away from wide adoption.
This isn't really anything novel to LLMs of AI models. Part of the reason for many previously desktop applications being cloud or requiring cloud access is keeping their sensitive IP off the end users' device.
It is fantastic. Compared to another program I had installed a year ago, the speed of processing and answering is really good and accurate. Was able to ask mathematical questions, basic translation between different languages and even trivia about movies released almost 30 years ago.
Things to improve: 1) sometimes the question would get stuck on the last phrase and keep repeating it without end. 2) The chat does not scroll the window to follow the answer and we have to scroll manually.
In either case, excellent start. It is without the fastest offline LLM that I've seen working on this phone.
Both the model and the app only have access to the tools or data that you choose to give it. If you choose to give the model access to web search - sure, it'll have (read-only) access to internet data.
Idk who these people are and I am sure they have good intentions, but they're wrapping llama.cpp.
That's what "like Ollama" means when you're writing code. That's also why there's a ton of comments asking if it's a server or app or what (it's a framework that an app would be built to use, you can't have an app with a localhost server like ollama on Android & iOS)
There's plenty of projects much further ahead, and I don't appreciate the amount of times I've seen this project come up in conversation the past 24 hours, due to misleading assertions that looked LLM-written, and a rush to make marketing claims that are just stuff llama.cpp does for you.
1) The commit history goes back to April.
2) LlaMa.cpp licence is included in the Repo where necessary like Ollama, until it is deprecated.
3) Flutter isolates behave like servers, and Cactus codes use that.
Flutter isolates are like threads, and servers may use multithreading to handle requests, and Ollama is like a server in that it provides an API, and since we've shown both are servers, it's like Ollama?
Please do educate me on this, I'm fascinated.
When you're done there, let's say Flutter having isolates does mean you have a React Native and Flutter local LLM server.
What's your plan for your Android & iOS-only framework being a system server? Or alternatively, available at localhost for all apps to contact?
Phones are resource-constrained, we saw significant battery overhead with in-process HTTP listeners so we stuck with simple stateful isolates in Flutter and exploring standalone server app others can talk to for React.
For model sharing with the current setup:
iOS - We are working towards writing the model into an App Group container, tricky but working around it.
Android - We are working towards prompting the user once for a SAF directory (e.g., /Download/llm_models), save the model there, then publish a ContentProvider URI for zero-copy reads.
We are already writing more mobile-friendly kernels and Tensors, but GGML/GGUF is widely supported, porting it is an easy way to get started and collect feedback, but we will completely move away from in < 2 months.
Anything else you would like to know?
How does writing a model into a shared directory on Android enable a local LLM server that 3rd party apps can make calls to?[^2]
How does writing your own kernels get you off GGUF in 2 months? GGUF is a storage format. You use kernels to do things with the numbers you get from it.
I thought GGUF was an advantage? Now it's something you're basically done using?
I don't think you should continue this conversation. As easy it as it is to get your work out there, it's just as easy to build a record of stretching truth over and over again.
Best of luck, and I mean it. Just, memento mori: be honest and humble along the way. This is something you will look back on in a year and grimace.
[^1] App group containers only work between apps signed from the same Apple developer account. Additionally, that is shared storage, not a way to provide APIs to other apps.
[^2] SAF = Storage Access Framework, that is shared storage, not a way to provide APIs to other apps.
Not staying professional and just answering the questions, and just doing "aight im outta here" when it gets a little bit harder is not a good look; it seems like you can't defend your own project.
Just FYI.
Good luck!
- "You are, undoubtedly, the worst pirate i have ever heard of" - "Ah, but you have heard of me"
Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)
Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.
We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.
PS - we're up to good (source: trust us)
we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.
Most projects typically start with llama.cpp and then move away to proprietary kernels
https://github.com/cactus-compute/cactus/tree/main/react#emb...
(Flutter works the same way)
What are you building?
You’d want an API for downloading OR pulling from a cache. Return an identifier from that and plug it into the inference API.
We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.
wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).
We'll update the download API, thanks again.
https://play.google.com/store/apps/details?id=com.rshemetsub...
Otherwise, it's easy to build any of the example apps from the repo:
cd react/example && yarn && npx expo run:android
or
cd flutter/example && flutter pub get && flutter run