▲Show HN: Rocky – Rust SQL engine with branches, replay, column lineagegithub.com

84 points by hugocorreia90 23 hours ago | 23 comments

Xiaoher-C 23 minutes ago [-]

The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality.

Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.

Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.

data_ders 12 minutes ago [-]

hiya, anders from dbt here. cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day. was it dbt's lack of those feature that inspired you to start this project? It also seems you have an aversion to Jinja, which, believe me, I get!

FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).

I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).

Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!

[1]: https://github.com/dbt-labs/dbt-fusion

ramon156 4 hours ago [-]

If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run?

hugocorreia90 3 hours ago [-]

Yeah, fair pushback, and yes the intro was AI-assisted. Marketing is not my strength nor I am a native english speaker. I built this in about a month with heavy LLM tooling and the seed comment is part of that. I'm not going to pretend otherwise.

The code is what it is. `cargo test --workspace` runs across 19 crates. CI on 5 platforms (macOS ARM/Intel, Linux x86/ARM, Windows). JSON output schemas are codegen-checked in CI so docs can't drift from the binary.

If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).

I'd rather hear "the code is bad" than "the post sounds AI-written".

DoctorOW 29 minutes ago [-]

> I'd rather hear "the code is bad" than "the post sounds AI-written".

Of course you would. Reading through and judging the quality of AI output is the largest amount of effort in a world where you can get everything else by prompting. Please internalize this: If you want to be respected you will have to put in effort yourself. There is no way around this.

austinthetaco 56 minutes ago [-]

This comment itself is likely written by AI by the sounds of it. It may be worth your time writing it out in your own words in your native language and then finding a competent translation tool to translate your words.

FrustratedMonky 1 hours ago [-]

Not sure why you are downvoted here.

'A-Lot' of side projects, hobby projects, etc.. are all using AI tools now. Also for marketing, every sales/marketing firm is using AI. So why critisize this guy inparticular.

AI is pervasive, the train has left the station. So that is not a reason to criticize this project. There might be other reasons, I'm not sure, but not that an AI was used.

ModernMech 49 minutes ago [-]

Because "Yeah, fair pushback" is AI smell. Either everything this person does is passed through an AI from code to blogs to even their HN comments and submissions; or they use AI so much they're starting to talk like it colloquially. Either way no one has time for that.

FrustratedMonky 44 minutes ago [-]

"Yeah, fair pushback"

Really hard to tell. Because that used to be a common phrase that real people would use.

So now I have to change my own language in order to not appear like I'm an AI? We are getting in a weird place where Humans have to act/sound increasingly 'odd', to appear not 'perfect' like an AI.

ModernMech 32 minutes ago [-]

It's really not hard to tell. It's the "How do you do fellow kids" of AI-isms. The presence of "fair pushback" and a single em dash reads as 99% AI generated as far as I am concerned.

Yes, if you don't want to sound like you're cargo culting AI, you do have to change the way you talk because people aren't going to care otherwise. At the very least just because it's boring. That's always been the nature of slang and lingo.

cmrdporcupine 56 minutes ago [-]

It's really a weird world now.

I do think the author is doing a disservice to themselves by writing the post and comments using LLM, even if the code is mostly agent built. People can tell right away, all the LLM shibboleths are there... it feels cheap. Just write naturally and then Google translate, don't let the LLM speak on your behalf.

What's going to distinguish projects that are built this way is the ability to explain, document, support, and maintain said projects over the long term. That will be the crucible. Gone are the days of "build it and they will come", and I feel a bit sad about that.

It's so easy to let the code grow under you beyond what you have the capacity to do the above for.

I've got the same thing going on. Eschewing paid work and grinding 16, 17 hours a day boiling the sea to build the whole universe from scratch (also a database, but of a different sort than this project) integrating all my favourite DB research papers and ideas that I've accumulated over the last 30 years. Outperforms postgres 2-4x or more, has a battery of correctness tests, Lean proofs, benchmarks, etc. etc.

But frankly I'd be nervous to share. Especially here. I don't even know where it ends up. Not least because if I'm doing it, so are 50 other people, probably.

hugocorreia90 8 minutes ago [-]

I totally acknowledge that. The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly.

All the engine architecture decisions are mine though and this project came up to solve a real problem I currently have at work with a zero-touch data pipeline leveraging FiveTran, Dagster, dbt and Databricks. This is a data pipeline that servers multiple agencies and data producers who work with data from more than 300 clients and multiple connectors.

Rocky essentially was built based on all the time spent awaken at night thinking about all these problems and how could they be addressed differently, considering that dbt is not suiting well this particular use-case.

I decided to open Rocky to public for free because of two simple reasons: 1st is that it might help others and I fullfill my ego of having built something other people like and use. 2nd is that I'm the solo maintainer. A project can only get proper traction if more people contributes to it.

PeterWhittaker 53 minutes ago [-]

Congrats on the work, but have you considered another name? Naming is hard and always will be: When I first scanned the headline, my initial thought was "that's an interesting area for the Rocky Linux team to explore". After a moment, "wait, no, that's confusing, it's some other Rocky".

hugocorreia90 27 minutes ago [-]

Thanks Peter. All my side-projects are named after my pets. I had a dog named Rocky and given this project is also an underdog competing with well-established tools such as dbt and sqlmesh, I decided to keep Rocky when opening it to public. But I'm happy to get some suggestions for a better name to this tool :)

PeterWhittaker 6 minutes ago [-]

I love that! I am inspired to create Terry, Tizzie, Topé, Bubba, and Roxy (the three Ts are in my office right now), the last two are no longer with us but for the hole in my heart.

I have no idea what these projects would be, but based on personalities, Roxy would chew through CPU and memory like a beaver (she loved turning large branches into small chunks), Bubba would inspire calm and peacefulness but walk into things (he was one-eyed and a little clumsy), Terry would stick like glue (an eBPF program, maybe?), Tizzie would work well most of the time then destroy your stuff (an AI agent?), and Topé would always be there, but never quite willing to participate (a bad Windows driver?).

I don't the area well enough at all to suggest an alternate name, but maybe Wiley, which is an indirect reference to Dag from Barnyard via Wile E. Coyote?

mollerhoj 4 hours ago [-]

Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks includes jobs and pipelines, so it very much owns the DAG, no?

hugocorreia90 3 hours ago [-]

Fair point. Databricks owns a scheduling DAG (Workflows, DLT). What I meant by "owns the DAG" is the semantic DAG: model-to-model dependencies with column-level types that the compiler builds.

Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.

I'll be more careful with that framing.

hasyimibhar 5 hours ago [-]

Looks cool, I've been waiting for someone to build this since dbt and SQLMesh acquisition. It would be great to have model versioning and support for ClickHouse SQL.

hugocorreia90 3 hours ago [-]

Thanks. On model versioning — what's the use case you have in mind? A few options that map to different designs:

- dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)

Different design choice for each, so it helps to know which problem you're trying to solve.

ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.

kjuulh 2 hours ago [-]

fyi, llm written comments are discouraged on hackernews.

https://news.ycombinator.com/item?id=47340079

Not saying yours are, but them -- dashes certainly looks like it ;)

hugocorreia90 2 hours ago [-]

Fair. I just use it for tidying up my replies as I'm not a native English speaker.

mergisi 5 hours ago [-]

* * *

hugocorreia90 3 hours ago [-]

Thanks for the careful read. The "what breaks if I rename this column" question is exactly what column lineage from the compiler is meant to answer, and you said it better than I did in the post.

On the schema-grounded AI angle: agreed. The failure mode you describe — structurally valid SQL that joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship — is exactly what the compiler is positioned to catch. AI-generated SQL runs through the type checker before it can land, so suggestions that don't validate against the actual DAG never reach the user. NL-to-SQL tools that integrate a compile step would close exactly the gap you're pointing at.

On your two questions:

1. Branch isolation for stateful models — mixed answer, and worth being honest about:

   - Incremental: isolated. The watermark `state_key` includes the resolved schema, and `rocky branch create` swaps the schema prefix. So a branch run reads/writes a different redb key than main and they don't advance each other.

   - Snapshot: not yet. Today `rocky branch create` only writes a branch record; it doesn't copy warehouse tables. A snapshot model on a branch starts with an empty table (CREATE TABLE IF NOT EXISTS in the branch schema) and accumulates from the first branch run, with no inherited history from main. That's the gap. The fix is the next wave: native Delta SHALLOW CLONE / Snowflake zero-copy at branch creation, which gives point-in-time snapshot semantics without copy-on-write overhead.

2. Cost attribution. Both bytes scanned and duration are captured per-model in the run record (`bytes_scanned` and duration on `RunRecord`). Budget gating today is on cost (USD) and duration — `max_usd` and `max_duration_ms` in `[budget]` blocks in `rocky.toml`, as independent thresholds. A direct bytes-scanned budget threshold isn't gateable today; the bytes are in the run record for analysis but you can't currently fail CI on "this run scanned more than N TB". Reasonable extension if there's demand.

   To your Snowflake point: the warehouse-size × duration credit model and the scan volume tell genuinely different stories, so they're tracked separately rather than rolled into a single number.

4 hours ago [-]

23 hours ago [-]

Loading comments...

Xiaoher-C 23 minutes ago [-]

Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.

Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.

data_ders 12 minutes ago [-]

Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!

[1]: https://github.com/dbt-labs/dbt-fusion

ramon156 4 hours ago [-]

If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run?

hugocorreia90 3 hours ago [-]

If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).

I'd rather hear "the code is bad" than "the post sounds AI-written".

DoctorOW 29 minutes ago [-]

> I'd rather hear "the code is bad" than "the post sounds AI-written".

austinthetaco 56 minutes ago [-]

FrustratedMonky 1 hours ago [-]

Not sure why you are downvoted here.

'A-Lot' of side projects, hobby projects, etc.. are all using AI tools now. Also for marketing, every sales/marketing firm is using AI. So why critisize this guy inparticular.

AI is pervasive, the train has left the station. So that is not a reason to criticize this project. There might be other reasons, I'm not sure, but not that an AI was used.

ModernMech 49 minutes ago [-]

FrustratedMonky 44 minutes ago [-]

"Yeah, fair pushback"

Really hard to tell. Because that used to be a common phrase that real people would use.

So now I have to change my own language in order to not appear like I'm an AI? We are getting in a weird place where Humans have to act/sound increasingly 'odd', to appear not 'perfect' like an AI.

ModernMech 32 minutes ago [-]

It's really not hard to tell. It's the "How do you do fellow kids" of AI-isms. The presence of "fair pushback" and a single em dash reads as 99% AI generated as far as I am concerned.

cmrdporcupine 56 minutes ago [-]

It's really a weird world now.

It's so easy to let the code grow under you beyond what you have the capacity to do the above for.

But frankly I'd be nervous to share. Especially here. I don't even know where it ends up. Not least because if I'm doing it, so are 50 other people, probably.

hugocorreia90 8 minutes ago [-]

I totally acknowledge that. The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly.

PeterWhittaker 53 minutes ago [-]

hugocorreia90 27 minutes ago [-]

PeterWhittaker 6 minutes ago [-]

I love that! I am inspired to create Terry, Tizzie, Topé, Bubba, and Roxy (the three Ts are in my office right now), the last two are no longer with us but for the hole in my heart.

I don't the area well enough at all to suggest an alternate name, but maybe Wiley, which is an indirect reference to Dag from Barnyard via Wile E. Coyote?

mollerhoj 4 hours ago [-]

hugocorreia90 3 hours ago [-]

Fair point. Databricks owns a scheduling DAG (Workflows, DLT). What I meant by "owns the DAG" is the semantic DAG: model-to-model dependencies with column-level types that the compiler builds.

Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.

I'll be more careful with that framing.

hasyimibhar 5 hours ago [-]

Looks cool, I've been waiting for someone to build this since dbt and SQLMesh acquisition. It would be great to have model versioning and support for ClickHouse SQL.

hugocorreia90 3 hours ago [-]

Thanks. On model versioning — what's the use case you have in mind? A few options that map to different designs:

- dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)

Different design choice for each, so it helps to know which problem you're trying to solve.

ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.

kjuulh 2 hours ago [-]

fyi, llm written comments are discouraged on hackernews.

https://news.ycombinator.com/item?id=47340079

Not saying yours are, but them -- dashes certainly looks like it ;)

hugocorreia90 2 hours ago [-]

Fair. I just use it for tidying up my replies as I'm not a native English speaker.

mergisi 5 hours ago [-]

* * *

hugocorreia90 3 hours ago [-]

Thanks for the careful read. The "what breaks if I rename this column" question is exactly what column lineage from the compiler is meant to answer, and you said it better than I did in the post.

On your two questions:

1. Branch isolation for stateful models — mixed answer, and worth being honest about:

   - Incremental: isolated. The watermark `state_key` includes the resolved schema, and `rocky branch create` swaps the schema prefix. So a branch run reads/writes a different redb key than main and they don't advance each other.

   - Snapshot: not yet. Today `rocky branch create` only writes a branch record; it doesn't copy warehouse tables. A snapshot model on a branch starts with an empty table (CREATE TABLE IF NOT EXISTS in the branch schema) and accumulates from the first branch run, with no inherited history from main. That's the gap. The fix is the next wave: native Delta SHALLOW CLONE / Snowflake zero-copy at branch creation, which gives point-in-time snapshot semantics without copy-on-write overhead.

   To your Snowflake point: the warehouse-size × duration credit model and the scan volume tell genuinely different stories, so they're tracked separately rather than rolled into a single number.

4 hours ago [-]

23 hours ago [-]