I'm no GPU programmer, but seems easy to use even for someone like me. I pulled together a quick demo of using the GPU vs the CPU, based on what I could find (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...) which gave these results (after downloading 2.6GB of dependencies of course):
Creating 100 random matrices of size 5000x5000 on CPU...
Adding matrices using CPU...
CPU matrix addition completed in 0.6541 seconds
CPU result matrix shape: (5000, 5000)
Creating 100 random matrices of size 5000x5000 on GPU...
Adding matrices using GPU...
GPU matrix addition completed in 0.1480 seconds
GPU result matrix shape: (5000, 5000)
Definitely worth digging into more, as the API is really simple to use, at least for basic things like these. CUDA programming seems like a big chore without something higher level like this.
ashvardanian 203 days ago [-]
CuPy has been available for years and has always worked great. The article is about the next wave of Python-oriented JIT toolchains, that will allow writing actual GPU kernels in a Pythonic-style instead of calling an existing precompiled GEMM implementation in CuPy (like in that snippet) or even JIT-ing CUDA C++ kernels from a Python source, that has also been available for years: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
almostgotcaught 203 days ago [-]
it's funny - people around here really do not have a clue about the GPU ecosystem even though everyone is always talking about AI:
> The article is about the next wave of Python-oriented JIT toolchains
the article is content marketing (for whatever) but the actual product has literally has nothing to do with kernels or jitting or anything
The mistake you seem to be making is confusing the existing product (which has been available for many years) with the upcoming new features for that product just announced at GTC, which are not addressed at all on the page for the existing product, but are addressed in the article about the GTC announcement.
almostgotcaught 203 days ago [-]
> The mistake you seem to be making is confusing the existing product
i'm not making any such mistake - i'm just able to actually read and comprehend what i'm reading rather than perform hype:
> Over the last year, NVIDIA made CUDA Core, which Jones said is a “Pythonic reimagining of the CUDA runtime to be naturally and natively Python.”
so the article is about cuda-core, not whatever you think it's about - so i'm responding directly to what the article is about.
> CUDA Core has the execution flow of Python, which is fully in process and leans heavily into JIT compilation.
this is bullshit/hype about Python's new JIT which womp womp womp isn't all that great (yet). this has absolutely nothing to do with any other JIT e.g., the cutile kernel driver JIT (which also has absolutely nothing to do with what you think it does).
dragonwriter 203 days ago [-]
> i'm just able to actually read and comprehend what i'm reading rather than perform hype:
The evidence of that is lacking.
> so the article is about cuda-core, not whatever you think it's about
cuda.core (a relatively new, rapidly developing, library whose entire API is experimental) is one of several things (NVMath is another) mentioned in the article, but the newer and as yet unreleased piece mentioned in the article and the GTC announcement, and a key part of the “Native Python” in the headline, is the CuTile model [0]:
“The new programming model, called CuTile interface, is being developed first for Pythonic CUDA with an extension for C++ CUDA coming later.”
> this is bullshit/hype about Python's new JIT
No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.
[0] The article only has fairly vague qualitative description of what CuTile is, but (without having to watch the whole talk from GTC), one could look at this tweet for a preview of what the Python code using the model is expected to look like when it is released: https://x.com/blelbach/status/1902113767066103949?t=uihk0M8V...
almostgotcaught 203 days ago [-]
> No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.
my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes
> Support Python 3.13
> Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
> Add optional dependencies on CUDA NVRTC and nvJitLink wheels
do you understand what "bindings" and "optional dependencies on..." means? it means there's nothing happening in this library and these are... just bindings to existing libraries. specifically that means you cannot jit python using this thing (except via the python 3.13 jit interpreter) and can only do what you've always already been able to do with eg cupy (compile and run C/C++ CUDA code).
EDIT: y'all realize that
1. calling a compiler for your entire source file
2. loading and running that compiled code
is not at all a JIT? y'all understand that right?
squeaky-clean 203 days ago [-]
> my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes
Those aren't the release notes for the native python thing being announced. CuTile has not been publicly released yet. Based on what the devs are saying on Twitter it probably won't be released before the SciPy 2025 conference in July.
musicale 202 days ago [-]
JIT as an adjective means just-in-time, as opposed to AOT, ahead-of-time. What Nvidia discussed at GTC was a software stack that will enable you to generate new CUDA kernels dynamically at runtime using Python API calls. It is a just-in-time (runtime, dynamic) compiler system rather than an ahead-of-time (pre-runtime, static) compiler.
saagarjha 203 days ago [-]
cuTile is basically Nvidia’s Triton (no, not that Triton, OpenAI’s Triton) competitor. It takes your Python code and generates kernels at runtime. CUTLASS has a new Python interface that does the same thing.
wahnfrieden 203 days ago [-]
[flagged]
squeaky-clean 203 days ago [-]
Isn't the main announcement of the article CuTile? Which has not been released yet.
Also the cuda-core JIT stuff has nothing to do with Python's new JIT, it's referring to integrating nvJitLink with python, which you can see an example of in cuda_core/examples/jit_lto_fractal.py
ashvardanian 203 days ago [-]
In case someone is looking for some performance examples & testimonials, even on RTX 3090 vs a 64-core AMD Epy/Threadripper, even a couple of years ago, CuPy was a blast. I have a couple of recorded sessions with roughly identical slides/numbers:
- San Francisco Python meetup in 2023: https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw
- Yerevan PyData meetup in 2022: https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u
Of the more remarkable results:
- 1000x sorting speedup switching from NumPy to CuPy.
- 50x performance improvements switching from Pandas to CuDF on the New York Taxi Rides queries.
- 20x GEMM speedup switching from NumPy to CuPy.
CuGraph is also definitely worth checking out. At that time, Intel wasn't in as bad of a position as they are now and was trying to push Modin, but the difference in performance and quality of implementation was mind-boggling.
ladberg 203 days ago [-]
The main release highlighted by the article is cuTile which is certainly about jitting kernels from Python code
almostgotcaught 203 days ago [-]
> main release
there is no release of cutile (yet). so the only substantive thing that the article can be describing is cuda-core - which it does describe and is a recent/new addition to the ecosystem.
man i can't fathom glazing a random blog this hard just because it's tangentially related to some other thing (NV GPUs) that clearly people only vaguely understand.
throwaway314155 202 days ago [-]
christ man lighten the fuck up. there's zero need to be _so_ god damn patronizing and disrespectful.
yieldcrv 203 days ago [-]
I just want to see benchmarks. is this new one faster than CuPy or not
moffkalast 203 days ago [-]
Only 4x speed seems rather low for GPU acceleration, does numpy already use AVX2 or anything SIMD?
For comparison, doing something similar with torch on CPU and torch on GPU will get you like 100x speed difference.
diggan 203 days ago [-]
It's a microbenchmark (if even that), take it with a grain of salt. You'd probably see a bigger difference with bigger/more/more complicated tasks,
wiredfool 203 days ago [-]
Curious what the timing would be if it included the memory transfer time, e.g.
matricies = [np.random(...) for _ in range]
time_start = time.time()
cp_matricies = [cp.array(m) for m in matrices]
add_(cp_matricies)
sync
time_end = time.time()
nickysielicki 203 days ago [-]
I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:
PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.
If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?
If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”
nickysielicki 203 days ago [-]
If you care enough to time it, you should care enough to time it correctly.
bee_rider 203 days ago [-]
I described the correct way to time it when using the card as a black-box accelerator.
nickysielicki 203 days ago [-]
You can create metrics for whatever you want! Go ahead!
But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.
bee_rider 203 days ago [-]
Sure, it’s easy to say “there are bigger problems.” There are always bigger problems.
But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
doctorpangloss 203 days ago [-]
You're arguing with people who have no idea what they're talking about on a forum that is a circular "increase in acceleration" of a personality trait that gets co-opted into arguing incorrectly about everything - a trait that everyone else knows is defective.
gavinray 203 days ago [-]
One of the wisest things I've read all week.
I authored one of the primary tools for GraphQL server benchmarks.
I learned about the Coordinated Omission problem and formats like HDR Histograms during the implementation.
My takeaway from that project is that not only is benchmarking anything correctly difficult, but they all ought to come with disclaimers of:
"These are the results obtained on X machine, running at Y time, with Z resources."
jms55 203 days ago [-]
Never used CUDA, but I'm guessing these map to the same underlying stuff as timestamp queries in graphics APIs, yes?
saagarjha 203 days ago [-]
I mean you can definitely use it in a pinch if you know what you’re doing. But yes the event APIs are better.
hnuser123456 203 days ago [-]
I think it does?: (the comment is in the original source)
print("Adding matrices using GPU...")
start_time = time.time()
gpu_result = add_matrices(gpu_matrices)
cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
elapsed_time = time.time() - start_time
I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?
apbytes 203 days ago [-]
When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.
So if you need to wait for an op to finish, you need to `synchronize` as shown above.
`get_current_stream` because the queue mentioned above is actually called stream in cuda.
If you want to run many independent ops concurrently, you can use several streams.
Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.
Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.
claytonjy 203 days ago [-]
I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?
Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?
ImprobableTruth 203 days ago [-]
The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):
b = foo(a)
c = bar(b)
d = baz(c)
synchronize()
With coroutines/async await, something like this
b = await foo(a)
c = await bar(b)
d = await baz(c)
would synchronize after every step, being much more inefficient.
hackernudes 203 days ago [-]
Pretty sure you want it to do it the first way in all cases (not just with GPUs)!
halter73 203 days ago [-]
It really depends on if you're dealing with an async stream or a single async result as the input to the next function. If a is an access token needed to access resource b, you cannot access a and b at the same time. You have to serialize your operations.
alanfranz 203 days ago [-]
Well you can and should create multiple coroutine/tasks and then gather them. If you replace cuda with network calls, it’s exactly the same problem. Nothing to do with asyncio.
ImprobableTruth 203 days ago [-]
No, that's a different scenario. In the one I gave there's explicitly a dependency between requests. If you use gather, the network requests would be executed in parallel. If you have dependencies they're sequential by nature because later ones depend on values of former ones.
The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.
apbytes 203 days ago [-]
Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.
hnuser123456 203 days ago [-]
Thank you kindly!
203 days ago [-]
rahimnathwani 203 days ago [-]
Thank you. I scrolled up and down the article hoping they included a code sample.
diggan 203 days ago [-]
Yeah, I figured I wasn't alone in doing just that :)
rahimnathwani 203 days ago [-]
EDIT: Just realized the code doesn't seem to be using the GPU for the addition.
aixpert 203 days ago [-]
thank God, Pytorch gained so much momentum before this came out, Now we have a true platform independent semi standard For parallel computations. We are not stuck with NVIDIA specifics.
It's great that parts of pie torch which concern the NVIDIA backend can now be implemented in Python directly, The important part that it doesn't really matter or shouldn't matter for end users / Developers
that being said, maybe this new platform will extend the whole concept of on GPU computation via Python to even more domains like maybe games.
Imagine running rust the Game performantly mainly on the GPU via Python
disgruntledphd2 203 days ago [-]
This just makes it much, much easier for people to build numeric stuff on GPU, which is great.
I'm totally with you that it's better that this took so long, so we have things like PyTorch abstracting most of this away, but I'm looking forward to (in my non-existent free time :/ ) playing with this.
wafngar 203 days ago [-]
Why not use torch.compile()?
ashvardanian 203 days ago [-]
CuTile, in many ways, feels like a successor to OpenAI's Triton... And not only are we getting tile/block-level primitives and TileIR, but also a proper SIMT programming model in CuPy, which I don't think enough people noticed even at this year's GTC. Very cool stuff!
That said, there were almost no announcements or talks related to CPUs, despite the Grace CPUs being announced quite some time ago. It doesn't feel like we're going to see generalizable abstractions that work seamlessly across Nvidia CPUs and GPUs anytime soon. For someone working on parallel algorithms daily, this is an issue: debugging with NSight and CUDA-GDB still isn't the same as raw GDB, and it's much easier to design algorithms on CPUs first and then port them to GPUs.
Of all the teams in the compiler space, Modular seems to be among the few that aren't entirely consumed by the LLM craze, actively building abstractions and languages spanning multiple platforms. Given the landscape, that's increasingly valuable. I'd love to see more people experimenting with Mojo — perhaps it can finally bridge the CPU-GPU gap that many of us face daily!
jms55 203 days ago [-]
> And not only are we getting tile/block-level primitives and TileIR
As someone working on graphics programming, it always frustrates me to see so much investment in GPU APIs _for AI_, but almost nothing for GPU APIs for rendering.
Block level primitives would be great for graphics! PyTorch-like JIT kernels programmed from the CPU would be great for graphics! ...But there's no money to be made, so no one works on it.
And for some reason, GPU APIs for AI are treated like an entirely separate thing, rather than having one API used for AI and rendering.
saagarjha 203 days ago [-]
I mean, it doesn’t really make sense to unify them. CPUs and GPUs have very different performance characteristics and you design for them differently depending on what they let you do. There’s obviously a common ground where you can design mostly good interfaces to do things ok (I’ll argue PyTorch is that) but it’s not really reasonable to write an algorithm that is hobbled on CPUs for no reason because it assumes that synchronizing between execution contexts is super expensive.
gymbeaux 203 days ago [-]
This is huge. Anyone who was considering AMD + ROCm as an alternative to NVIDIA in the AI space isn’t anymore.
I’m one of those people who can’t (won’t) learn C++ to the extent required to effectively write code for GPU execution…. But to have a direct pipeline to the GPU via Python. Wow.
The efficiency implications are huge, not just for Python libraries like PyTorch, but also anything we write that runs on an NVIDIA GPU.
I love seeing anything that improves efficiency because we are constantly hearing about how many nuclear power plants OpenAI and Google are going to need to power all their GPUs.
ferguess_k 203 days ago [-]
Just curious why can't AMD do the same thing?
bigyabai 203 days ago [-]
It can be argued that they already did. AMD and Apple worked with Khronos to build OpenCL as a general competitor. The industry didn't come together to support it though, and eventually major stakeholders abandoned it altogether. Those ~10 wasted years were spent on Nvidia's side refining their software offerings and redesigning their GPU architecture to prioritize AI performance over raster optimization. Meanwhile Apple and AMD were pulling the rope in the opposite direction, trying to optimize raster performance at all costs.
This means that Nvidia is selling a relatively unique architecture with a fully-developed SDK, industry buy-in and relevant market demand. Getting AMD up to the same spot would force them to reevaluate their priorities and demand a clean-slate architecture to-boot.
pjmlp 203 days ago [-]
Maybe because Apple got pissed on how Khronos took over OpenCL, AMD and Intel never offered tooling on par with CUDA in terms of IDE integration, graphical debuggers and library ecosystem.
Khronos also never saw the need to support a polyglot ecosystem with C++, Fortran and anything else that the industry could feel like using on a GPU.
When Khronos finally remember to at least add C++ support and SPIR, again Intel and AMD failed to deliver, and OpenCL 3.0 is basically OpenCL 1.0 rebranded.
Followed by SYCL efforts, which only Intel seems to care, with their own extensions on top via DPC++, nowadays openAPI. And only after acquiring Codeplay, which was actually the first company to deliver on SYCL tooling.
However contrary to AMD, at least Intel does get that unless everyone gets to play with their software stack, no one will bother to actually learn it.
bigyabai 203 days ago [-]
Well, Apple has done nothing to replace the common standard they abandoned. They failed to develop their proprietary alternatives into a competitive position and now can't even use their own TSMC dies (imported at great expense) for training: https://www.eteknix.com/apple-set-to-invest-1-billion-in-nvi...
However you want to paint the picture today, you can't say the industry didn't try to resist CUDA. The stakeholders shot each other in a 4-way Mexican standoff, and Nvidia whistled showtunes all the way to the bank. If OpenCL was treated with the same importance Vulkan was, we might see a very different market today.
pjmlp 203 days ago [-]
Yes they did, it is called Metal Compute, and everyone using Apple devices has to use it.
Vulkan you say?
It is only relevant on GNU/Linux and Android, because Google is pushing it, and still most folks still keep using OpenGL ES, no one else cares about it, and already turned into the same spaghetti mess as OpenGL, to the point that there was a roadmap talk at Vulkanised 2025 on how to sort things out.
NVidia and AMD keep designing their cards with Microsoft for DirectX first, and Vulkan, eventually.
jms55 203 days ago [-]
> NVidia and AMD keep designing their cards with Microsoft for DirectX first, and Vulkan, eventually.
Not really. For instance NVIDIA released day 1 Vulkan extensions for their new raytracing and neural net tech (VK_NV_cluster_acceleration_structure, VK_NV_partitioned_tlas, VK_NV_cooperative_vector), as well as equivalent NVAPI extensions for DirectX12. Equal support, although DirectX12 is technically worse as you need to use NVAPI and rely on a prerelease version of DXC, as unlike Vulkan and SPIR-V, DirectX12 has no mechanism for vendor-specific extensions (for good or bad).
Meanwhile the APIs, both at a surface level and how the driver implements them under the hood, are basically identical. So identical in fact, that NVIDIA has the nvrhi project which provides a thin wrapper over Vulkan/DirectX12 so that you can run on multiple platforms via one API.
pjmlp 203 days ago [-]
An exception that doesn't change the rule, where are the Vulkan extensions for DirectX neural shaders, and RTX kit?
As a more recent example, not feeling like enumerating all of them since DirectX 8 shader model introduction, and collaboration with NVidia where Cg became HLSL foundation.
Exactly, proprietary APIs don't have extension spaghetti like Khronos APIs, that always end up out of control, hence Vulkan 2025 roadmap plans.
Khronos got lucky that Google and Samsung decided to embrace Vulkan as the API to be on Android, Valve for their Steam Deck, and IoT displays, basically.
Everywhere else it is middleware engines that support all major 3D APIs, with WebGPU becoming also middleware outside of the browser due to the ways of Vulkan.
jms55 203 days ago [-]
> An exception that doesn't change the rule, where are the Vulkan extensions for DirectX neural shaders, and RTX kit?
DirectX "neural shaders" is literately the VK_NV_cooperative_vector extension I mentioned previously, which is actually easier to use in Vulkan at the moment since you don't need a custom prelease version of DXC. Same for all the RTX kit stuff, e.g. https://github.com/NVIDIA-RTX/RTXGI has both VK and DX12 support.
pjmlp 203 days ago [-]
And how does that prove that NVidia has not designed that together with Microsoft first in DirectX prototype?
Additionally, naturally Intel and AMD will come up with their extensions, if ever, followed by a Khronos common one. Not counting mobile units into this extension frenzy.
So then we will have the pleasure to chose between four extensions for a feature, depending on the card's vendor, with possible incompatible semantics, as it has happened so many times.
bigyabai 203 days ago [-]
> it is called Metal Compute, and everyone using Apple devices has to use it.
Sounds like a submarket absolutely teeming with competition. Like, you have Metal Compute, and Apple Accelerate Framework and MLX all sitting there in the same spot! Apple is really outdoing themselves, albeit in a fairly literal sense.
> It is only relevant on GNU/Linux and Android
Hmm... someone ought to remind me of the first stage of grief, I've forgotten it suddenly.
saagarjha 203 days ago [-]
They did; they are active contributors to OpenAI’s Triton compiler which has a very similar execution model.
dismalaf 203 days ago [-]
> But to have a direct pipeline to the GPU via Python
Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan, etc...) with a scripting language?
It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU. CUDA is still fundamentally it's own thing.
There's also been CUDA bindings to scripting languages for at least 15 years... Most people will probably still use Torch or higher level things built on top of it.
Also, here's Nvidia's own advertisement and some instructions for Python on their GPUs:
Reality is kind of boring, and the article posted here is just clickbait.
dragonwriter 203 days ago [-]
> It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU.
While its not exactly normal Python code, there are Python libraries that allow writing GPU kernels in internal DSLs that are normal-ish Python (e.g., Numba for CUDA specifically via the @cuda.jit decorator; or Taichi which has multiple backends supporting the same application code—Vulkan, Metal, CUDA, OpenGL, OpenGL ES, and CPU.)
Apparently, nVidia is now doing this first party in CUDA Python, including adding a new paradigm for CUDA code (CuTile) that is going to be in Python before C++; possibly trying to get ahead of things like Taichi (which, because it is cross-platform, commoditizes the underlying GPU).
> Also, here's Nvidia's own advertisement for Python on their GPUs
That (and the documentation linked there) does not address the new upcoming native functionality announced at GTC; existing CUDA Python has kernels written in C++ in inline strings.
freeone3000 203 days ago [-]
OpenCL and OpenGL are basically already scripting languages that you happen to type into a C compiler. The CUDA advantage was actually having meaningful types and compilation errors, without the intense boilerplate of Vulkan. But this is 100% a python-for-CUDA-C replacement on the GPU, for people who prefer a slightly different bracketing syntax.
dismalaf 203 days ago [-]
> But this is 100% a python-for-CUDA-C replacement on the GPU
Ish. It's a Python maths library made by Nvidia, an eDSL and a collection of curated libraries. It's not significantly different than stuff like Numpy, Triton, etc..., apart from being made by Nvidia and bundled with their tools.
gymbeaux 201 days ago [-]
I’m mainly interested in the performance implications. The less shit between me and the hardware, theoretically the better the performance. In a world where these companies want to build nuclear power plants just to power NVIDIA GPU data centers, I feel like we need to be optimizing the code where possible.
pjmlp 203 days ago [-]
Yes, shading languages which are more productive without the gotchas from those languages, as they were designed from the ground up for compute devices.
The polyglot nature of CUDA is one of the plus points versus the original "we do only C99 dialect around here" from OpenCL, until it was too late.
ErrorNoBrain 203 days ago [-]
They are, if they cant find an nvidia card
pjmlp 203 days ago [-]
NVidia cards are everywhere, the biggest difference to AMD is that even my lousy laptop GeForce cards can be used for CUDA.
No need for a RTX for learning and getting into CUDA programming.
gymbeaux 201 days ago [-]
True, although I believe Maxwell is the oldest supported architecture for the current CUDA 12.x. Maxwell (eg GTX 980) came out around 2013, if memory serves. 10+ years of support is not bad at all considering ROCm supports only like 3 consumer AMD GPUs.
So your lousy laptop GTX 750Ti ehhh probably can’t practically be used for CUDA. But your lousy 1050Ti Max-Q? Sure.
crazygringo 203 days ago [-]
Very curious how this compares to JAX [1].
JAX lets you write Python code that executes on Nvidia, but also GPUs of other brands (support varies). It similarly has drop-in replacements for NumPy functions.
This only supports Nvidia. But can it do things JAX can't? It is easier to use? Is it less fixed-size-array-oriented? Is it worth locking yourself into one brand of GPU?
Well, the idea is that you’d be writing low level CUDA kernels that implement operations not already implemented by JAX/CUDA and integrate them into existing projects. Numba[1] is probably the closest thing I can think of that currently exists. (In fact, looking at it right now, it seems this effort from Nvidia is actually based on Numba)
Rust support next? RN I am manually [de]serializing my data structures as byte arrays to/from the kernels. It would be nice to have truly shared data structures like CUDA gives you in C++!
KeplerBoy 203 days ago [-]
Isn't Rust still very seldomly used in the areas where CUDA shines (e.g. number crunching of any kind, let it be simulations or linear algebra)? Imo C++ or even Fortran are perfectly fine choices for those things, since the memory allocation pattern aren't that complicated.
IshKebab 203 days ago [-]
Mainly because number crunching code tends to be very long-lived (hence why FORTRAN is still in use).
nine_k 203 days ago [-]
Not only that. Fortran is very good for writing number-crunching code. Modern Fortran is a pretty ergonomic language, it gives you a really easy way to auto-parallelize things in many ways, and new Fortran code is being produce unironically. Of course it normally uses the treasure trove of existing numerical Fortran code. (Source: a friend who worked at CERN.)
pjmlp 203 days ago [-]
Yes, and the new kid in town, slang has more chances of adoption.
KeplerBoy 203 days ago [-]
sorry, could you link to the project? Seems there are quite a few languages called slang.
_0ffh 203 days ago [-]
I guess he might mean this one https://shader-slang.org/ though at first glance at least it looks more graphics than GPGPU oriented.
Yes that is the one, and all shader languages also support compute as well, not only graphics.
_0ffh 203 days ago [-]
Thanks yes. Though I did not mean the bare possibility, but intended use case, which may lead to different design choices.
chasely 203 days ago [-]
The Rust-CUDA project just recently started up again [0], I've started digging into it a little bit and am hoping to contribute to it since the summers are a little slower for me.
Still broken though! Has been for years. In a recent GH issue regarding desires for the reboot, I asked: "Try it on a few different machines (OS, GPUs, CUDA versions etc), make it work on modern RustC and CUDA versions without errors." The response was "That will be quite some work." Meanwhile, Cudarc works...
edit: I'm still showing the latest release as from 2022, which I've already verified doesn't work.
chasely 203 days ago [-]
Totally, it's going to take a minute to get it all working. On a positive note, they recently got some sponsorship from Modal [0], who is supplying GPUs for CI/CD so they should be able to expand their hardware coverage.
What do you think of the Burn framework? (Honest question, I have no clue what I’m talking about)
airstrike 203 days ago [-]
I used it to train my own mini-GPT and I liked it quite a bit. I tend to favor a different style of Rust with fewer generics but maybe that just can't be avoided given the goals of that project.
The crate seems to have a lot of momentum, with many new features, releases, active communities on GH and Discord. I expect it to continue to get better.
the__alchemist 203 days ago [-]
Have not heard of it. Looked it up. Seems orthogonal?
I am using Cudarc.
taminka 203 days ago [-]
even putting aside how rust ownership semantics map poorly onto gpu programming, ml researchers will never learn rust, this will never ever happen...
pjmlp 203 days ago [-]
While I agree in principle, CUDA is more than only AI, as people keep forgetting.
taminka 203 days ago [-]
everyone else who uses cuda isn't going to learn rust either
pjmlp 203 days ago [-]
First Rust needs to have tier 1 support for CUDA, in a way that doesn't feel like yak shaving when coding for CUDA.
int_19h 203 days ago [-]
The ML researchers are relying on libraries written by someone else. Today, those libraries are mostly C++, and they would benefit from Rust same as most other C++ codebases.
malcolmgreaves 203 days ago [-]
ML reachers don’t write code, they ask ChatGPT to make a horribly inefficient, non-portable notebook that has to be rewritten from scratch :)
staunton 203 days ago [-]
It's made easier by that notebook only having to work just once, to produce some plots for the paper/press release/demo.
saagarjha 203 days ago [-]
I don’t think this is true. It seems to me more that nobody has put in a serious effort to make a nice interface build using Rust.
the__alchemist 203 days ago [-]
GPGPU programming != ML.
chrisrodrigue 203 days ago [-]
Python is really shaping up to be the lingua franca of programming languages. Its adoption is soaring in this FOSS renaissance and I think it's the closest thing to a golden hammer that we've ever had.
The PEP model is a good vehicle for self-improvement and standardization. Packaging and deployment will soon be solved problems thanks to projects such as uv and BeeWare, and I'm confident that we're going to see continued performance improvements year over year.
silisili 203 days ago [-]
> Packaging and deployment will soon be solved problems
I really hope you're right. I love Python as a language, but for any sufficiently large project, those items become an absolute nightmare without something like Docker. And even with, there seems to be multiple ways people solve it. I wish they'd put something in at the language level or bless an 'official' one. Go has spoiled me there.
horsawlarway 203 days ago [-]
Honestly, I'm still incredibly shocked at just how bad Python is on this front.
I'm plenty familiar with packaging solutions that are painful to work with, but the state of python was shocking when I hopped back in because of the available ML tooling.
UV seems to be at least somewhat better, but damn - watching pip literally download 20+ 800MB torch wheels over and over trying to resolve deps only to waste 25GB of bandwidth and finally completely fail after taking nearly an hour was absolutely staggering.
SJC_Hacker 203 days ago [-]
Python was not taken seriously as something you actually shipped to non-devs. The solution was normally "install the correct version of Python on the host system". In the Linux world, this could be handled through Docker, pyenv. For Windows users, this meant installing a several GB distro and hoping it didn't conflict with what was already on the system.
int_19h 203 days ago [-]
AI-generated code is going to be a major influence going forward. Regardless of how you feel about its quality (I'm a pessimist myself), it's happening anyway, and it's going to cement the dominant position of those languages which LLMs understand / can write the best. Which correlates strongly to their amount in the training set, which means that Python and JavaScript in particular are here to stay now, and will likely be increasingly shoved into more and more niches - even those they aren't well-suited to - solely because LLMs can write them.
gymbeaux 201 days ago [-]
I haven’t run into anything where Python either couldn’t be used or shouldn’t be used, except for the browser of course.
I think software engineers with any significant amount of experience recognize you can build an application that does X in just about any language. To me, the largest difference, the greatest factor in which language to choose, is the existing packages. Simple example- there are several packages in Python for extracting text from PDFs (using tesseract or not). C# has maybe one tesseract wrapper? I recall working with PDFs in .NET being a nightmare. I think we had to buy a license to some software because there wasn’t a free offering. Python has several.
This is VERY important because we as software engineers, even if we wanted to reinvent the wheel sometimes, have very limited time. It takes an obscene number of man hours to develop a SalesForce or a Facebook or even something smaller like a Linux distro.
ergonaught 203 days ago [-]
> Packaging and deployment will soon be solved problems ...
I hope so. Every time I engage in a "Why I began using Go aeons ago" conversation, half of the motivation was this. The reason I stopped engaging in them is because most of the participants apparently cannot see that this is even a problem. Performance was always the second problem (with Python); this was always the first.
pjmlp 203 days ago [-]
Is the new BASIC, Pascal and Lisp.
Now if only CPython also got a world class JIT, V8 style.
screye 203 days ago [-]
Never heard of Beeware, but Astral's products have transformed my python workflow (uv, ruff).
Is Beeware that transformational ? What does Beeware do and what is its maturity level?
whycome 203 days ago [-]
Would you say Python is a good language to learn as a beginner?
airstrike 203 days ago [-]
As someone who spent nearly a decade with Python, I'd say 90% of people will answer "yes", so I'd like to offer a different perspective.
IMHO if you want to pick it up for a couple toy projects just to get a feel of what coding is like, then by all means try it out. But eventually you'll benefit tremendously from exploring other languages.
Python will teach you a lot of bad habits. You will feel like you know what you're doing, but only because you don't know all of the ways in which it is handwaving a lot of complexity that is inherent to writing code which you should be very much aware of.
Knowing what I know now, I wish Rust existed when I started out so that it could have been my first language. I'm never giving up the borrow checker and the type system that come with it.
But you don't have to do Rust. It's fine to work on a couple of projects in Python, then maybe something small in C (though the tooling can feel arcane and frustrating), then maybe switch it up and go with some more functional programming (FP) flavored like Lisp or F#.
I know Rust has a lot of zealots and a lot of haters, but I'm not pushing an agenda. I just think it strikes that perfect balance between being extremely expressive, clear to read (after maybe a month of writing it daily), strong type system, lots of FP elements, no OOP clutter but super powerful traits, the borrow checker which you'll invariably learn to love, and more...
This will give you a strong foundation upon which you'll be able to continuously build knowledge. And even if you start with Rust, you should definitely explore Python, C, Lisp and F# later (or maybe Haskell instead of F#)
system2 203 days ago [-]
With the help of GPT, I think the bad habit part is non-existent anymore. Learning it from GPT really helps people nowadays. Ask ChatGPT 4.0 some questions, and you will be shocked by how well it describes the code.
Just don't ask to fix indentations because it will do it line by line for hours. But it finds mistakes quickly and points you in the right direction.
And of course, it comes up with random non-existent modules once in a while which is cute to me.
airstrike 203 days ago [-]
The bad habits I was thinking about were more in the line of not understanding how memory is being used (even something as simple as stack vs. heap allocation), not having a type system that forces you to think about the types of data structure you have in your system, and overall just being forced to design before coding
system2 203 days ago [-]
I ask for help from ChatGPT before I go all in, and it creates these old-fashioned ASCII graphics to show how the flow will be. I think newcomers will not have the bad habits we have/had.
airstrike 202 days ago [-]
That's really not enough or the same, trust me. You can't outsource your entire understanding to ChatGPT, and I say this as someone who's keen on getting help from AI assistants to write boilerplate code, rubberduck bugs or debate design decisions
throwaway314155 202 days ago [-]
Respectfully, many of those things aren't a concern from Python's point of view. And why should they be? If your program runs imperceptibly slower, or using an insignificant amount of extra memory, any attempts to fix this are considered a premature optimization that gets in the way of what is more important to pythonistas - developer experience and high level abstractions.
Frankly, the comparison with Rust doesn't even really make sense. They are different tools for very different problems.
airstrike 194 days ago [-]
Except it's not imperceptibly slower, it's orders of magnitude slower.
Barrin92 203 days ago [-]
>Knowing what I know now, I wish Rust existed when I started out so that it could have been my first language
No offense but I don't think this makes any sense (or only if you take the first part of that sentence literally). It's like jumping into Calculus 3 to introduce a kid to maths. From a teaching standpoint, if you're a beginner, you can't even understand what problem Rust solves. Someone who doesn't know what manual memory management, a heap and a stack is should not be handed a borrow checker.
You can either start from the top, the old school way, teach a lisp or python as a more modern alternative and teach people symbolic computing, or you can start with C and teach people from the bottom up how computers work, but frankly throwing you into a language that basically exists to solve problems professional C++ developers have in large projects is kind of wild
airstrike 202 days ago [-]
People were learning C and malloc long before Python came along. You don't need to start with a high level language.
Rust does way more than "solve problems professional C++ developers have". That's not a fair or accurate read of the language. I think you're misinformed.
Barrin92 202 days ago [-]
>You don't need to start with a high level language.
I didn't say that. I said you can start with a high or low level language. I did literally mention C in my own post as a decent starting point.
Rust however is not a beginner friendly language because again, the thing that sets it apart is that it aims to solve a particular domain specific problem of programming, which is memory management, in a unique way that means nothing to a person who has never been exposed to the problem in the first place.
chrisrodrigue 203 days ago [-]
Yeah, definitely. It's basically executable pseudocode and it's really simple for a beginner to pick up and hit the ground running for a variety of use cases.
Some people will tell you to start with C or C++ to get a better intuition for what's actually happening under the hood in Python, but that's not really necessary for most use cases unless you're doing something niche. Some of the most popular use cases for Python are webapps, data analysis, or general automation. For the 1% of use cases that Python isn't the right fit for, you can still use it to prototype or glue things together.
There are a lot of great resources out there for learning Python, but they won't necessarily teach you how to make great software. You can't go wrong with the official tutorial. https://learn.scientific-python.org/development/ is pretty terse and incorporates a lot of best practices.
somethingsome 203 days ago [-]
I was teaching python long ago to very beginners in programming.
Honestly, the language became kinda harsh for newcomers, what we see as developpers is 'it's like pseudocode that runs'.
But a beginner is often left behind the billions of methods in each class. He is not used to documentation, and spend quite a huge amount of time learning by heart stupid things like 'len()' in this case it's '.len()' here it's '.length',etc.. For many meany methods that all have their idiosyncracies.
At least in c/(easy)c++, you need to build yourself most of it, helping the understanding.
I'm not completely against python as a first language, but it need to be teached well, and that could include working with a very minimal set of functions on every objects. Then you can expand and incorporate more and more methods that make life easier.
silisili 203 days ago [-]
I go back and forth on this. A lot of people make good points.
In the end, my final answer is - yes. I say that because I believe it's the easiest programming language to get something working in. And getting something working is what motivates people to keep going.
If you sit them down and say 'well before you learn python you need to learn how a computer really works, here's an ASM x86 book', they're gonna probably read 10 pages, say this is boring, then go do something else. I think that because I went through that as a kid - I started reading a C++ book with no knowledge and gave up. It wasn't until I found qbasic and VB, by all marks a terrible language, that I really got motivated to learn and keep going because progress was so easy.
Python will teach you the basics - control flow, loops, variables, functions, libraries, etc. Those apply to almost every language. Then when you move to a different language, you at least know the basics and can focus on what's different or added that you didn't have or know before.
SJC_Hacker 203 days ago [-]
Yeah, Python or Javascript should be first languages for most people.
People like flashy things, and Python and Javascript are just 10x easier to get that working. Console I/O doesn't really cut it anymore.
Later on you can deal with memory allocation, bit-twiddling, 2's complement arithmetic, lower level OS details etc.
dpkirchner 203 days ago [-]
Not the person you replied to but I'd say definitely not. It'd be easy to pick up bad habits from python (untyped variables) and try to carry them over to other languages. It's also the king of runtime errors, which will frustrate newbies.
I think a compiled language is a better choice for people just getting started. Java is good, IMO, because it is verbose. Eventually the beginner may get tired of the verbosity and move on to something else, but at least they'll understand the value of explicit types and compile-time errors.
SJC_Hacker 203 days ago [-]
Huh? Python variables definitely have a type. Its just determined at runtime.
The only untyped language I know, at least modern ones is assembler.
Well and C, if you make everything void*.
IshKebab 203 days ago [-]
I would personally recommend Javascript/Typescript over Python, but Python is a reasonable option. Especially now that we have uv so you don't have to crawl through the bush of thorns that Python's terrible tooling (pip, venv etc) surrounds you with.
I would just encourage you to move on from Python fairly quickly. It's like... a balance bike. Easy to learn and teach you how to balance but you don't want to actually use it to get around.
airstrike 203 days ago [-]
Python is too high level, slow and duck-typed to even be considered for a huge number of projects.
There is no one-size-fits-all programming language.
jmward01 203 days ago [-]
This will probably lead to what, I think, python has led to in general: A lot more things tried quicker and targeted things that stay in a faster language. All in all this is a great move. I am looking forward to playing with it for sure.
math_dandy 203 days ago [-]
Is NVIDIA's JIT-based approach here similar JAX's, except targeting CUDA directly rather than XLA? Would like to know how these different JIT compilers relate to one another.
lunarboy 203 days ago [-]
Doesn't this mean AMD could make the same python API that targets their hardware, and now Nvidia GPUs aren't as sticky?
KeplerBoy 203 days ago [-]
Nothing really changed. AMD already has a c++ (HIP) dialect very similar to CUDA, even with some automatic porting efforts (hipify).
AMD is held back by the combination of a lot of things. They have a counterpart to almost everything that exists on the other side.
The things on the AMD side are just less mature with worse documentation and not as easily testable on consumer hardware.
pjmlp 203 days ago [-]
They could, AMD's problem is that they keep failing on delivery.
DeathArrow 203 days ago [-]
>In 2024, Python became the most popular programming language in the world — overtaking JavaScript — according to GitHub’s 2024 open source survey.
I wonder why Python take over the world? Of course, it's easy to learn, it might be easy to read and understand. But it also has a few downsides: low performance, single threaded, lack of static typing.
lenerdenator 203 days ago [-]
I do backend web server development using FastAPI/Starlette and Django.
If I were a Ruby developer, I'd be using Rails, and I'd also be describing 90% of Ruby development.
However, I do Python. What I'm describing is a tiny fraction of Python development.
If you want to do something with computer code - data analysis, ML, web development, duct-taping together parts of a #NIX system, even some game development - you can do it reasonably well, if not better, in Python. The paths that you can take are limitless, and that gets people interested.
EnergyAmy 203 days ago [-]
There's the pithy saying that "Python is the second-best language for anything", and that's kind of its superpower.
greenavocado 203 days ago [-]
If you are trying to do Ruby but fast you're supposed to use Crystal
hbn 203 days ago [-]
It's "easy to learn" and you get all the downsides that come with that.
At work right now we're integrating with scoring models hosted in Amazon SageMaker written by a "modelling team" and as far as I can tell they follow absolutely no basic coding practices. They give us the API and are asking us to send English strings of text for names of things instead of any real keys, and they're just comparing against plain strings and magic numbers everywhere so if they're asked to make any change like renaming something it's a herculean task that breaks a bunch of other things. Something will break when a field is null and then they'll tell us instead of sending null if we have no data to send -9999999. One time something broke and it turned out to be because we sent them "MB" (Manitoba) as someone's province, and whoever wrote it was just plain-text checking against a list of province codes as strings and didn't remember to include Manitoba.
I know this is still mainly a business/management issue that they're allowing people who don't know how to code to write code, but I'm sure this is happening at other companies, and I think Python's level of accessibility at the beginner level has been a real blight to software quality.
timschmidt 203 days ago [-]
Universities seem to have settled on it for CSE 101 courses in the post-Java academic programming era.
diggan 203 days ago [-]
> I wonder why Python take over the world?
Not sure what "most popular programming language in the world" even means, in terms of existing projects? In terms of developers who consider it their main language? In terms of existing actually active projects? According to new projects created on GitHub that are also public?
My guess is that it's the last one, which probably isn't what one would expect when hearing "the most popular language in the world", so worth keeping in mind.
But considering that AI/ML is the hype today, and everyone want to get their piece of the pie, it makes sense that there is more public Python projects created on GitHub today compared to other languages, as most AI/ML is Python.
georgeecollins 203 days ago [-]
Less typing-- I mean keystrokes.
All the things that are not great about it make it easier to learn. No static typing, no control of memory, no threads.
When I started there was a language like BASIC or Visual BASIC that was easy to learn (or also quick to use) and C or C++ that was performant. If the world now is Python and Rust or Go, I think that it is just a better word for programmers. I say that as someone comfortable with C/ C++ / Java. They had their time and will still be with us, but the improvement is real.
PeterStuer 203 days ago [-]
It's the ecosystem, specifically the huge amount of packages available for everything under the sun.
nhumrich 203 days ago [-]
Perhaps performance, multi threading, and static typing are not the #1 things that make a language great.
My guess: it's the community.
chupasaurus 203 days ago [-]
All 3 are achieved in Python with a simple import ctypes /sarcasm
owebmaster 203 days ago [-]
In this case, it's because JS ecosystem is now divided between JavaScript and TypeScript
timschmidt 203 days ago [-]
As soon as WASM has native bindings for the DOM, I think you're going to see a lot of the energy in the JS ecosystem drain back into other languages.
lenkite 203 days ago [-]
> I wonder why Python take over the world
Because data-science/ML/LLM's have taken over the world now and no other language offers best-in-breed libraries and frameworks.
Other languages need to get off their ass and start offering options soon or be relegated to niche domains.
TechDebtDevin 203 days ago [-]
I don't know. It absolutely annoys me. Go is more readable, easier to learn, more efficient, more fun to write but doesn't have all the math/ml packages people want. I'd like to get involved in catching Go up to Python in the ML space but Go is so behind.
leosanchez 203 days ago [-]
> more fun to write
Go is definitely not fun to write. The rest I agree.
CapsAdmin 203 days ago [-]
Slightly related, I had a go at doing llama 3 inference in luajit using cuda as one compute backend for just doing matrix multiplication
Technically speaking, all of this exists (including the existing library integration and whatnot) through CuPy and Numba already, but the fact that it’s getting official support is cool.
ryao 203 days ago [-]
CUDA was born from C and C++
It would be nice if they actually implemented a C variant of CUDA instead of extending C++ and calling it CUDA C.
pjmlp 203 days ago [-]
First of all they extend C, and with CUDA 3.0, initial support was added for C++, afterwards they bought PGI and added Fortran into the mix.
Alongside for the ride, they fostered an ecosystem from compiled language backends targeting CUDA.
Additionally modern CUDA supports standard C++ as well, with frameworks that hide the original extensions.
Most critics don't really get the CUDA ecosystem.
ryao 203 days ago [-]
They replaced C with C++. For example, try passing a function pointer as a void pointer argument without a cast. C says this should work. C++ says it should not. There are plenty of other differences that make it C++ and not C, if you know to look for them. The fact that C++ symbol names are used for one, which means you need to specify extern “C” if you want to reference them from the CUDA driver API. Then there is the fact that it will happily compile C++ classes where a pure C compiler will not. There is no stdc option for the compiler.
pjmlp 203 days ago [-]
They replaced most of the documentation with C++ examples, given the benefits the language has over C, that was already obvious to me in 1993.
As for the language extensions required by CUDA C, it is kind of interesting that clang and GCC extensions are praised and people keep referring to them as C, while everyone else's extensions are never C or C++ under the same measure.
With OpenAAC directives, an HPC industry standard, you can make use of plain old C11 with traditional #pragmas,
I used to like C++. Then it caused me headaches one too many times because of things it implemented that C did not have. Now I prefer to use C whenever I can, since it avoids entire classes of headaches that only exist in C++.
swyx 203 days ago [-]
why is that impt to you? just trying to understand the problem you couldnt solve without a C-like
ryao 203 days ago [-]
I want to write C code, not C++ code. Even if I try to write C style C++, it is more verbose and less readable, because of various C++isms. For example, having to specify extern “C” to get sane ABI names for the Nvidia CUDA driver API:
Not to mention that C++ does not support neat features like variable sized arrays on the stack.
pjmlp 203 days ago [-]
A neat feature that is so neat Google paid to get it irradicated from Linux kernel, and became optional after C11.
ryao 200 days ago [-]
I think you replied to the wrong person.
kevmo314 203 days ago [-]
A strict C variant would indeed be quite nice. I've wanted to write CUDA kernels in Go apps before so the Go app can handle the concurrency on the CPU side. Right now, I have to write a C wrapper and more often than not, I end up writing more code in C++ instead.
But then I end up finding myself juggling mutexes and wishing I had some newer language features.
no_wizard 203 days ago [-]
What makes Python such a target for these kind of things?
I've noticed alot of projects add Python support like this. Does the Python codebase allow for it to compile down to different targets easier than others?
saagarjha 203 days ago [-]
There’s a lot of existing Python code in this space and many ML researchers are comfortable in Python.
It's a holistic approach to all levels of the stack, from high-level frameworks to low-level bindings, some of which is highlighting existing libraries, and some of which are completely newly announced.
One of the big things seems to be a brand new Tile IR, at the level of PTX and supported with a driver level JIT compiler, and designed for Python-first semantics via a new cuTile library.
Really exciting stuff, though with the new IR it further widens the gap that projects like https://github.com/vosen/ZLUDA and AMD's own tooling are trying to bridge. But vendor lock-in isn't something we can complain about when it arises from the vendor continuing to push the boundaries of developer experience.
skavi 203 days ago [-]
i’m curious what advantage is derived from this existing independently of the PTX stack? i.e. why doesn’t cuTile produce PTX via a bundled compiler like Triton or (iirc) Warp?
Even if there is some impedance mismatch, could PTX itself not have been updated?
cavisne 203 days ago [-]
In the presentation they said eventually kernels can share SIMT (PTX) and TileIR but not at launch. It seems pretty mysterious why they don't just emit PTX, I would guess they are either taking the opportunity to clean things up for ML tensorcore workloads or there is some HW specific features coming that they only want to enable through TileIR.
skavi 201 days ago [-]
if i were to lean into cynicism, i might suggest this choice was meant to increase the effort required to reimplement cuda for other cards.
> The article is about the next wave of Python-oriented JIT toolchains
the article is content marketing (for whatever) but the actual product has literally has nothing to do with kernels or jitting or anything
https://github.com/NVIDIA/cuda-python
literally just cython bindings to CUDA runtime and CUB.
for once CUDA is aping ROCm:
https://github.com/ROCm/hip-python
i'm not making any such mistake - i'm just able to actually read and comprehend what i'm reading rather than perform hype:
> Over the last year, NVIDIA made CUDA Core, which Jones said is a “Pythonic reimagining of the CUDA runtime to be naturally and natively Python.”
so the article is about cuda-core, not whatever you think it's about - so i'm responding directly to what the article is about.
> CUDA Core has the execution flow of Python, which is fully in process and leans heavily into JIT compilation.
this is bullshit/hype about Python's new JIT which womp womp womp isn't all that great (yet). this has absolutely nothing to do with any other JIT e.g., the cutile kernel driver JIT (which also has absolutely nothing to do with what you think it does).
The evidence of that is lacking.
> so the article is about cuda-core, not whatever you think it's about
cuda.core (a relatively new, rapidly developing, library whose entire API is experimental) is one of several things (NVMath is another) mentioned in the article, but the newer and as yet unreleased piece mentioned in the article and the GTC announcement, and a key part of the “Native Python” in the headline, is the CuTile model [0]:
“The new programming model, called CuTile interface, is being developed first for Pythonic CUDA with an extension for C++ CUDA coming later.”
> this is bullshit/hype about Python's new JIT
No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.
[0] The article only has fairly vague qualitative description of what CuTile is, but (without having to watch the whole talk from GTC), one could look at this tweet for a preview of what the Python code using the model is expected to look like when it is released: https://x.com/blelbach/status/1902113767066103949?t=uihk0M8V...
my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes
> Support Python 3.13
> Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
> Add optional dependencies on CUDA NVRTC and nvJitLink wheels
https://nvidia.github.io/cuda-python/latest/release/12.8.0-n...
do you understand what "bindings" and "optional dependencies on..." means? it means there's nothing happening in this library and these are... just bindings to existing libraries. specifically that means you cannot jit python using this thing (except via the python 3.13 jit interpreter) and can only do what you've always already been able to do with eg cupy (compile and run C/C++ CUDA code).
EDIT: y'all realize that
1. calling a compiler for your entire source file
2. loading and running that compiled code
is not at all a JIT? y'all understand that right?
Those aren't the release notes for the native python thing being announced. CuTile has not been publicly released yet. Based on what the devs are saying on Twitter it probably won't be released before the SciPy 2025 conference in July.
Also the cuda-core JIT stuff has nothing to do with Python's new JIT, it's referring to integrating nvJitLink with python, which you can see an example of in cuda_core/examples/jit_lto_fractal.py
there is no release of cutile (yet). so the only substantive thing that the article can be describing is cuda-core - which it does describe and is a recent/new addition to the ecosystem.
man i can't fathom glazing a random blog this hard just because it's tangentially related to some other thing (NV GPUs) that clearly people only vaguely understand.
For comparison, doing something similar with torch on CPU and torch on GPU will get you like 100x speed difference.
PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...
If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”
But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.
But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
I authored one of the primary tools for GraphQL server benchmarks.
I learned about the Coordinated Omission problem and formats like HDR Histograms during the implementation.
My takeaway from that project is that not only is benchmarking anything correctly difficult, but they all ought to come with disclaimers of:
"These are the results obtained on X machine, running at Y time, with Z resources."
So if you need to wait for an op to finish, you need to `synchronize` as shown above.
`get_current_stream` because the queue mentioned above is actually called stream in cuda.
If you want to run many independent ops concurrently, you can use several streams.
Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.
Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.
Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?
The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.
It's great that parts of pie torch which concern the NVIDIA backend can now be implemented in Python directly, The important part that it doesn't really matter or shouldn't matter for end users / Developers
that being said, maybe this new platform will extend the whole concept of on GPU computation via Python to even more domains like maybe games.
Imagine running rust the Game performantly mainly on the GPU via Python
I'm totally with you that it's better that this took so long, so we have things like PyTorch abstracting most of this away, but I'm looking forward to (in my non-existent free time :/ ) playing with this.
That said, there were almost no announcements or talks related to CPUs, despite the Grace CPUs being announced quite some time ago. It doesn't feel like we're going to see generalizable abstractions that work seamlessly across Nvidia CPUs and GPUs anytime soon. For someone working on parallel algorithms daily, this is an issue: debugging with NSight and CUDA-GDB still isn't the same as raw GDB, and it's much easier to design algorithms on CPUs first and then port them to GPUs.
Of all the teams in the compiler space, Modular seems to be among the few that aren't entirely consumed by the LLM craze, actively building abstractions and languages spanning multiple platforms. Given the landscape, that's increasingly valuable. I'd love to see more people experimenting with Mojo — perhaps it can finally bridge the CPU-GPU gap that many of us face daily!
As someone working on graphics programming, it always frustrates me to see so much investment in GPU APIs _for AI_, but almost nothing for GPU APIs for rendering.
Block level primitives would be great for graphics! PyTorch-like JIT kernels programmed from the CPU would be great for graphics! ...But there's no money to be made, so no one works on it.
And for some reason, GPU APIs for AI are treated like an entirely separate thing, rather than having one API used for AI and rendering.
I’m one of those people who can’t (won’t) learn C++ to the extent required to effectively write code for GPU execution…. But to have a direct pipeline to the GPU via Python. Wow.
The efficiency implications are huge, not just for Python libraries like PyTorch, but also anything we write that runs on an NVIDIA GPU.
I love seeing anything that improves efficiency because we are constantly hearing about how many nuclear power plants OpenAI and Google are going to need to power all their GPUs.
This means that Nvidia is selling a relatively unique architecture with a fully-developed SDK, industry buy-in and relevant market demand. Getting AMD up to the same spot would force them to reevaluate their priorities and demand a clean-slate architecture to-boot.
Khronos also never saw the need to support a polyglot ecosystem with C++, Fortran and anything else that the industry could feel like using on a GPU.
When Khronos finally remember to at least add C++ support and SPIR, again Intel and AMD failed to deliver, and OpenCL 3.0 is basically OpenCL 1.0 rebranded.
Followed by SYCL efforts, which only Intel seems to care, with their own extensions on top via DPC++, nowadays openAPI. And only after acquiring Codeplay, which was actually the first company to deliver on SYCL tooling.
However contrary to AMD, at least Intel does get that unless everyone gets to play with their software stack, no one will bother to actually learn it.
However you want to paint the picture today, you can't say the industry didn't try to resist CUDA. The stakeholders shot each other in a 4-way Mexican standoff, and Nvidia whistled showtunes all the way to the bank. If OpenCL was treated with the same importance Vulkan was, we might see a very different market today.
Vulkan you say?
It is only relevant on GNU/Linux and Android, because Google is pushing it, and still most folks still keep using OpenGL ES, no one else cares about it, and already turned into the same spaghetti mess as OpenGL, to the point that there was a roadmap talk at Vulkanised 2025 on how to sort things out.
NVidia and AMD keep designing their cards with Microsoft for DirectX first, and Vulkan, eventually.
Not really. For instance NVIDIA released day 1 Vulkan extensions for their new raytracing and neural net tech (VK_NV_cluster_acceleration_structure, VK_NV_partitioned_tlas, VK_NV_cooperative_vector), as well as equivalent NVAPI extensions for DirectX12. Equal support, although DirectX12 is technically worse as you need to use NVAPI and rely on a prerelease version of DXC, as unlike Vulkan and SPIR-V, DirectX12 has no mechanism for vendor-specific extensions (for good or bad).
Meanwhile the APIs, both at a surface level and how the driver implements them under the hood, are basically identical. So identical in fact, that NVIDIA has the nvrhi project which provides a thin wrapper over Vulkan/DirectX12 so that you can run on multiple platforms via one API.
As a more recent example, not feeling like enumerating all of them since DirectX 8 shader model introduction, and collaboration with NVidia where Cg became HLSL foundation.
Exactly, proprietary APIs don't have extension spaghetti like Khronos APIs, that always end up out of control, hence Vulkan 2025 roadmap plans.
Khronos got lucky that Google and Samsung decided to embrace Vulkan as the API to be on Android, Valve for their Steam Deck, and IoT displays, basically.
Everywhere else it is middleware engines that support all major 3D APIs, with WebGPU becoming also middleware outside of the browser due to the ways of Vulkan.
DirectX "neural shaders" is literately the VK_NV_cooperative_vector extension I mentioned previously, which is actually easier to use in Vulkan at the moment since you don't need a custom prelease version of DXC. Same for all the RTX kit stuff, e.g. https://github.com/NVIDIA-RTX/RTXGI has both VK and DX12 support.
Additionally, naturally Intel and AMD will come up with their extensions, if ever, followed by a Khronos common one. Not counting mobile units into this extension frenzy.
So then we will have the pleasure to chose between four extensions for a feature, depending on the card's vendor, with possible incompatible semantics, as it has happened so many times.
Sounds like a submarket absolutely teeming with competition. Like, you have Metal Compute, and Apple Accelerate Framework and MLX all sitting there in the same spot! Apple is really outdoing themselves, albeit in a fairly literal sense.
> It is only relevant on GNU/Linux and Android
Hmm... someone ought to remind me of the first stage of grief, I've forgotten it suddenly.
Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan, etc...) with a scripting language?
It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU. CUDA is still fundamentally it's own thing.
There's also been CUDA bindings to scripting languages for at least 15 years... Most people will probably still use Torch or higher level things built on top of it.
Also, here's Nvidia's own advertisement and some instructions for Python on their GPUs:
- https://developer.nvidia.com/cuda-python
- https://developer.nvidia.com/how-to-cuda-python
Reality is kind of boring, and the article posted here is just clickbait.
While its not exactly normal Python code, there are Python libraries that allow writing GPU kernels in internal DSLs that are normal-ish Python (e.g., Numba for CUDA specifically via the @cuda.jit decorator; or Taichi which has multiple backends supporting the same application code—Vulkan, Metal, CUDA, OpenGL, OpenGL ES, and CPU.)
Apparently, nVidia is now doing this first party in CUDA Python, including adding a new paradigm for CUDA code (CuTile) that is going to be in Python before C++; possibly trying to get ahead of things like Taichi (which, because it is cross-platform, commoditizes the underlying GPU).
> Also, here's Nvidia's own advertisement for Python on their GPUs
That (and the documentation linked there) does not address the new upcoming native functionality announced at GTC; existing CUDA Python has kernels written in C++ in inline strings.
Ish. It's a Python maths library made by Nvidia, an eDSL and a collection of curated libraries. It's not significantly different than stuff like Numpy, Triton, etc..., apart from being made by Nvidia and bundled with their tools.
The polyglot nature of CUDA is one of the plus points versus the original "we do only C99 dialect around here" from OpenCL, until it was too late.
No need for a RTX for learning and getting into CUDA programming.
So your lousy laptop GTX 750Ti ehhh probably can’t practically be used for CUDA. But your lousy 1050Ti Max-Q? Sure.
JAX lets you write Python code that executes on Nvidia, but also GPUs of other brands (support varies). It similarly has drop-in replacements for NumPy functions.
This only supports Nvidia. But can it do things JAX can't? It is easier to use? Is it less fixed-size-array-oriented? Is it worth locking yourself into one brand of GPU?
[1] https://github.com/jax-ml/jax
[1]: https://numba.readthedocs.io/en/stable/cuda/overview.html
Edit: Hmm, this part of the same project looks general purpose-y and apparently integrates with PyTorch https://slangpy.shader-slang.org/en/latest/
[0] https://github.com/rust-gpu/rust-cuda
edit: I'm still showing the latest release as from 2022, which I've already verified doesn't work.
The crate seems to have a lot of momentum, with many new features, releases, active communities on GH and Discord. I expect it to continue to get better.
I am using Cudarc.
The PEP model is a good vehicle for self-improvement and standardization. Packaging and deployment will soon be solved problems thanks to projects such as uv and BeeWare, and I'm confident that we're going to see continued performance improvements year over year.
I really hope you're right. I love Python as a language, but for any sufficiently large project, those items become an absolute nightmare without something like Docker. And even with, there seems to be multiple ways people solve it. I wish they'd put something in at the language level or bless an 'official' one. Go has spoiled me there.
I'm plenty familiar with packaging solutions that are painful to work with, but the state of python was shocking when I hopped back in because of the available ML tooling.
UV seems to be at least somewhat better, but damn - watching pip literally download 20+ 800MB torch wheels over and over trying to resolve deps only to waste 25GB of bandwidth and finally completely fail after taking nearly an hour was absolutely staggering.
I think software engineers with any significant amount of experience recognize you can build an application that does X in just about any language. To me, the largest difference, the greatest factor in which language to choose, is the existing packages. Simple example- there are several packages in Python for extracting text from PDFs (using tesseract or not). C# has maybe one tesseract wrapper? I recall working with PDFs in .NET being a nightmare. I think we had to buy a license to some software because there wasn’t a free offering. Python has several.
This is VERY important because we as software engineers, even if we wanted to reinvent the wheel sometimes, have very limited time. It takes an obscene number of man hours to develop a SalesForce or a Facebook or even something smaller like a Linux distro.
I hope so. Every time I engage in a "Why I began using Go aeons ago" conversation, half of the motivation was this. The reason I stopped engaging in them is because most of the participants apparently cannot see that this is even a problem. Performance was always the second problem (with Python); this was always the first.
Now if only CPython also got a world class JIT, V8 style.
Is Beeware that transformational ? What does Beeware do and what is its maturity level?
IMHO if you want to pick it up for a couple toy projects just to get a feel of what coding is like, then by all means try it out. But eventually you'll benefit tremendously from exploring other languages.
Python will teach you a lot of bad habits. You will feel like you know what you're doing, but only because you don't know all of the ways in which it is handwaving a lot of complexity that is inherent to writing code which you should be very much aware of.
Knowing what I know now, I wish Rust existed when I started out so that it could have been my first language. I'm never giving up the borrow checker and the type system that come with it.
But you don't have to do Rust. It's fine to work on a couple of projects in Python, then maybe something small in C (though the tooling can feel arcane and frustrating), then maybe switch it up and go with some more functional programming (FP) flavored like Lisp or F#.
I know Rust has a lot of zealots and a lot of haters, but I'm not pushing an agenda. I just think it strikes that perfect balance between being extremely expressive, clear to read (after maybe a month of writing it daily), strong type system, lots of FP elements, no OOP clutter but super powerful traits, the borrow checker which you'll invariably learn to love, and more...
This will give you a strong foundation upon which you'll be able to continuously build knowledge. And even if you start with Rust, you should definitely explore Python, C, Lisp and F# later (or maybe Haskell instead of F#)
Just don't ask to fix indentations because it will do it line by line for hours. But it finds mistakes quickly and points you in the right direction.
And of course, it comes up with random non-existent modules once in a while which is cute to me.
Frankly, the comparison with Rust doesn't even really make sense. They are different tools for very different problems.
No offense but I don't think this makes any sense (or only if you take the first part of that sentence literally). It's like jumping into Calculus 3 to introduce a kid to maths. From a teaching standpoint, if you're a beginner, you can't even understand what problem Rust solves. Someone who doesn't know what manual memory management, a heap and a stack is should not be handed a borrow checker.
You can either start from the top, the old school way, teach a lisp or python as a more modern alternative and teach people symbolic computing, or you can start with C and teach people from the bottom up how computers work, but frankly throwing you into a language that basically exists to solve problems professional C++ developers have in large projects is kind of wild
Rust does way more than "solve problems professional C++ developers have". That's not a fair or accurate read of the language. I think you're misinformed.
I didn't say that. I said you can start with a high or low level language. I did literally mention C in my own post as a decent starting point.
Rust however is not a beginner friendly language because again, the thing that sets it apart is that it aims to solve a particular domain specific problem of programming, which is memory management, in a unique way that means nothing to a person who has never been exposed to the problem in the first place.
Some people will tell you to start with C or C++ to get a better intuition for what's actually happening under the hood in Python, but that's not really necessary for most use cases unless you're doing something niche. Some of the most popular use cases for Python are webapps, data analysis, or general automation. For the 1% of use cases that Python isn't the right fit for, you can still use it to prototype or glue things together.
There are a lot of great resources out there for learning Python, but they won't necessarily teach you how to make great software. You can't go wrong with the official tutorial. https://learn.scientific-python.org/development/ is pretty terse and incorporates a lot of best practices.
Honestly, the language became kinda harsh for newcomers, what we see as developpers is 'it's like pseudocode that runs'.
But a beginner is often left behind the billions of methods in each class. He is not used to documentation, and spend quite a huge amount of time learning by heart stupid things like 'len()' in this case it's '.len()' here it's '.length',etc.. For many meany methods that all have their idiosyncracies.
At least in c/(easy)c++, you need to build yourself most of it, helping the understanding.
I'm not completely against python as a first language, but it need to be teached well, and that could include working with a very minimal set of functions on every objects. Then you can expand and incorporate more and more methods that make life easier.
In the end, my final answer is - yes. I say that because I believe it's the easiest programming language to get something working in. And getting something working is what motivates people to keep going.
If you sit them down and say 'well before you learn python you need to learn how a computer really works, here's an ASM x86 book', they're gonna probably read 10 pages, say this is boring, then go do something else. I think that because I went through that as a kid - I started reading a C++ book with no knowledge and gave up. It wasn't until I found qbasic and VB, by all marks a terrible language, that I really got motivated to learn and keep going because progress was so easy.
Python will teach you the basics - control flow, loops, variables, functions, libraries, etc. Those apply to almost every language. Then when you move to a different language, you at least know the basics and can focus on what's different or added that you didn't have or know before.
People like flashy things, and Python and Javascript are just 10x easier to get that working. Console I/O doesn't really cut it anymore.
Later on you can deal with memory allocation, bit-twiddling, 2's complement arithmetic, lower level OS details etc.
I think a compiled language is a better choice for people just getting started. Java is good, IMO, because it is verbose. Eventually the beginner may get tired of the verbosity and move on to something else, but at least they'll understand the value of explicit types and compile-time errors.
The only untyped language I know, at least modern ones is assembler.
Well and C, if you make everything void*.
I would just encourage you to move on from Python fairly quickly. It's like... a balance bike. Easy to learn and teach you how to balance but you don't want to actually use it to get around.
There is no one-size-fits-all programming language.
AMD is held back by the combination of a lot of things. They have a counterpart to almost everything that exists on the other side. The things on the AMD side are just less mature with worse documentation and not as easily testable on consumer hardware.
I wonder why Python take over the world? Of course, it's easy to learn, it might be easy to read and understand. But it also has a few downsides: low performance, single threaded, lack of static typing.
If I were a Ruby developer, I'd be using Rails, and I'd also be describing 90% of Ruby development.
However, I do Python. What I'm describing is a tiny fraction of Python development.
If you want to do something with computer code - data analysis, ML, web development, duct-taping together parts of a #NIX system, even some game development - you can do it reasonably well, if not better, in Python. The paths that you can take are limitless, and that gets people interested.
At work right now we're integrating with scoring models hosted in Amazon SageMaker written by a "modelling team" and as far as I can tell they follow absolutely no basic coding practices. They give us the API and are asking us to send English strings of text for names of things instead of any real keys, and they're just comparing against plain strings and magic numbers everywhere so if they're asked to make any change like renaming something it's a herculean task that breaks a bunch of other things. Something will break when a field is null and then they'll tell us instead of sending null if we have no data to send -9999999. One time something broke and it turned out to be because we sent them "MB" (Manitoba) as someone's province, and whoever wrote it was just plain-text checking against a list of province codes as strings and didn't remember to include Manitoba.
I know this is still mainly a business/management issue that they're allowing people who don't know how to code to write code, but I'm sure this is happening at other companies, and I think Python's level of accessibility at the beginner level has been a real blight to software quality.
Not sure what "most popular programming language in the world" even means, in terms of existing projects? In terms of developers who consider it their main language? In terms of existing actually active projects? According to new projects created on GitHub that are also public?
My guess is that it's the last one, which probably isn't what one would expect when hearing "the most popular language in the world", so worth keeping in mind.
But considering that AI/ML is the hype today, and everyone want to get their piece of the pie, it makes sense that there is more public Python projects created on GitHub today compared to other languages, as most AI/ML is Python.
All the things that are not great about it make it easier to learn. No static typing, no control of memory, no threads.
When I started there was a language like BASIC or Visual BASIC that was easy to learn (or also quick to use) and C or C++ that was performant. If the world now is Python and Rust or Go, I think that it is just a better word for programmers. I say that as someone comfortable with C/ C++ / Java. They had their time and will still be with us, but the improvement is real.
My guess: it's the community.
Because data-science/ML/LLM's have taken over the world now and no other language offers best-in-breed libraries and frameworks.
Other languages need to get off their ass and start offering options soon or be relegated to niche domains.
Go is definitely not fun to write. The rest I agree.
https://github.com/CapsAdmin/luajit-llama3/blob/main/compute...
While obviously not complete, it was less than I thought was needed.
It was a bit annoying trying to figure out which version of the function (_v2 suffix) I have to use for which driver I was running.
Also sometimes a bit annoying is the stateful nature of the api. Very similar to opengl. Hard to debug at times as to why something refuse to compile.
1. https://docs.modular.com/mojo/stdlib/gpu/
Alongside for the ride, they fostered an ecosystem from compiled language backends targeting CUDA.
Additionally modern CUDA supports standard C++ as well, with frameworks that hide the original extensions.
Most critics don't really get the CUDA ecosystem.
As for the language extensions required by CUDA C, it is kind of interesting that clang and GCC extensions are praised and people keep referring to them as C, while everyone else's extensions are never C or C++ under the same measure.
With OpenAAC directives, an HPC industry standard, you can make use of plain old C11 with traditional #pragmas,
https://developer.nvidia.com/openacc
https://docs.nvidia.com/cuda/cuda-driver-api/index.html
Not to mention that C++ does not support neat features like variable sized arrays on the stack.
But then I end up finding myself juggling mutexes and wishing I had some newer language features.
I've noticed alot of projects add Python support like this. Does the Python codebase allow for it to compile down to different targets easier than others?
Greater Processing Unit
Giant Processing Unit
Galloping Processing Unit
Grape Processing Unit
Gorge Processing Unit
Gaggle Processing Unit
Grand Processing Unit
Giraffe Processing Unit
Gaping Processing Unit
it's only the beginning, there is no need to create new programming languages anymore
There will be new shiny things, but of course, my choice is Python too.
https://nvidia.github.io/cuda-python/cuda-core/latest/ https://developer.nvidia.com/nvmath-python
https://developer.nvidia.com/how-to-cuda-python
https://cupy.dev/
And
"Zero to Hero: Programming Nvidia Hopper Tensor Core with MLIR's NVGPU Dialect" from 2024 EuroLLVM.
https://www.youtube.com/watch?v=V3Q9IjsgXvA
Reverse-engineered python-only GPU API, works with not only CUDA but Also AMD's ROCm
Other runtimes: https://docs.tinygrad.org/runtime/#runtimes
It's a holistic approach to all levels of the stack, from high-level frameworks to low-level bindings, some of which is highlighting existing libraries, and some of which are completely newly announced.
One of the big things seems to be a brand new Tile IR, at the level of PTX and supported with a driver level JIT compiler, and designed for Python-first semantics via a new cuTile library.
https://x.com/JokerEph/status/1902758983116657112 (without login: https://xcancel.com/JokerEph/status/1902758983116657112 )
Example of proposed syntax: https://pbs.twimg.com/media/GmWqYiXa8AAdrl3?format=jpg&name=...
Really exciting stuff, though with the new IR it further widens the gap that projects like https://github.com/vosen/ZLUDA and AMD's own tooling are trying to bridge. But vendor lock-in isn't something we can complain about when it arises from the vendor continuing to push the boundaries of developer experience.
Even if there is some impedance mismatch, could PTX itself not have been updated?