46 points by forza_user 2 days ago | 18 comments

sandreas 1 hours ago [-]

Well, first of all: I'm not trying to bash BTRFS at all, it probably is just not meant for me. However, I'm trying to gain information it is really considered stable (like rock solid) or it might have been a hardware Problem on my system.

I used cryptsetup with BTRFS because I encrypt all of my stuff. One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]). Not a real problem because I had a recent backup, but somehow I lost trust in BTRFS that day. Anyone experienced something like that?

Since then I switched to ZFS (on the same hardware) and never had problems - while it was a real pain to setup until I finished my script [2], which still is kind of a collection of dirty hacks :-)

1: https://forum.cgsecurity.org/phpBB3/viewtopic.php?t=13013

2: https://github.com/sandreas/zarch

ghostly_s 15 minutes ago [-]

Yes, my story with btrfs is quite similar- used it for a couple years, suddenly threw some undocumented error and refused to mount, asked about it on the dev irc channel and was told apparently it was a known issue with no solution, have fun rebuilding from backups. No suggestion that anyone was interested in documenting this issue, let alone fixing it.

These same people are the only ones in the world suggesting btrfs is "basically" stable. I'll never touch this project again with a ten foot pole, afaic it's run by children. I'll trust adults with my data.

riku_iki 13 minutes ago [-]

> One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]).

it looks like you didn't use raid, so any FS could fail in case of disk corruption.

forza_user 2 days ago [-]

I was surprised of the new attempt for performance profiles/device roles/hints when we already have a very good patch set maintained by kakra.

- https://github.com/kakra/linux/pull/36

- https://wiki.tnonline.net/w/Btrfs/Allocator_Hints

What do you think?

dontdoxxme 9 hours ago [-]

> One of the reasons why these patches are not included in the kernel is that the free space calculations do not work properly.

It seems these patches possibly fix that.

bjoli 8 hours ago [-]

I wonder if I can use a smaller SSD for this and make it avoid HDD wakeups due to some process reading metadata. That alone would make me love this feature.

the8472 7 hours ago [-]

I think you'd rather want a cache device (or some more complicated storage tiering) for that so that both metadata and frequently accessed files get moved to that dynamically based on access patterns. Afaik btrfs doesn't support that. LVM, bcache, device mapper, bcachefs and zfs support that (though zfs would require separate caches for reading and synchronous write). And idk which of these let you control the writeback interval.

viraptor 4 hours ago [-]

Bcache allows lots of writeback configuration, including intervals https://www.kernel.org/doc/html/latest/admin-guide/bcache.ht...

bionade24 7 hours ago [-]

Most likely yes, but the also envisioned periodically repacking oft multiple small data extents into one big that gets written to the HDD would wake up the HDD. And if you'd make the SSD "metadata only", browser cache and logging will keep the HDD spinning.

This feature is for performance, not the case you described.

ajross 6 hours ago [-]

Just buy more RAM and you get that for free. Really I guess that's my sense of patches like this in general: while sure, filesystem research has a long and storied history and it's a very hard problem in general that attracts some of the smartest people in the field to do genius-tier work...

Does it really matter in the modern world where a vanilla two-socket rack unit has a terabyte of DRAM? Everything at scale happens in RAM these days. Everything. Replicating across datacenters gets you all the reliability you need, with none of the fussing about storage latency and block device I/O strategy.

bayindirh 4 hours ago [-]

Actually, it doesn't work like that.

Sun's ZFS7420 had a terabyte of RAM per controller, and these work in tandem, and after a certain pressure, the thing can't keep up even though it also uses specialized SSDs to reduce HDD array access during requests, and these were blazingly fast boxes for their time.

When you drive a couple thousand physical nodes with a some-petabytes sized volumes, no amount of RAM can save you. This is why Lustre divides metadata servers and volumes from file ones. You can keep very small files in metadata area (a-la Apple's 0-sized, data-in-resource-fork implementation), but for bigger data, you need to have good filesystems. There are no workarounds from this.

If you want to go faster, take a look at Weka and GPUDirect. Again, when you are pumping tons of data to your GPUs to keep them training/inferring, no amount of RAM can keep that data (or sustain the throughput) during that chaotic access for you.

When we talked about performance, we used to say GB/sec. Now a single SSD provides that IOPS and throughput provided by storage clusters. Instead, we talk about TB/sec in some cases. You can casually connect terabit Ethernet (or Infiniband if you prefer that) to a server with a couple of cables.

ajross 3 hours ago [-]

> When you drive a couple thousand physical nodes with a some-petabytes sized volumes

You aren't doing that with ZFS or btrfs, though. Datacenter-scale storage solutions (c.f. Lustre, which you mention) have long since abandoned traditional filesystem techniques like the one in the linked article. And they rely almost exclusively on RAM behavior for their performance characteristics, not the underlying storage (which usually ends up being something analogous to a pickled transaction log, it's not the format you're expected to manage per-operation)

bayindirh 3 hours ago [-]

> You aren't doing that with ZFS or btrfs, though.

ZFS can, and is actually designed to, handle that kind of workloads, though. At full configuration, ZFS7420 is a 84U configuration. Every disk box has its own set of "log" SSDs and 10 additional HDDs. Plus it was one of the rare systems which supported Infiniband access natively, and was able to saturate all of its Infiniband links under immense load.

Lustre's performance is not RAM bound when driving that kind of loads, this is why MDT arrays are smaller and generally full-flash while OSTs can be selected from a mix of technologies. As I said, when driving that number of clients from a relatively small number of servers, it's not possible to keep all the metadata and query it from the RAM. Yes, Lustre recommends high RAM and core count for servers driving OSTs, but it's for file content throughput when many clients are requesting files, and we're discussing file metadata access primarily.

ajross 3 hours ago [-]

Again I think we're talking past each other. I'm saying "traditional filesystem-based storage management is not performance-limited at scale where everything is in RAM, so I don't see value to optimizations like that". You seem to be taking as a prior that at scale everything doesn't fit in RAM, so traditional filesystem-based storage management is still needed.

But... everything does fit in RAM at scale. I mean, Cloudflare basically runs a billion dollar business who's product is essentially "We store the internet in RAM in every city". The whole tech world is aflutter right now over a technology base that amounts to "We put the whole of human experience into GPU RAM so we can train our new overlords". It's RAM. Everything is RAM.

I'm not saying there is "no" home for excessively tuned genius-tier filesystem-over-persistent-storage code. I'm just saying that it's not a very big home, that the market has mostly passed the technology over, and that frankly patches like the linked article seem like a waste of effort to me vs. going to Amazon and buying more RAM.

bayindirh 11 minutes ago [-]

No, it doesn't. You think in a very static manner. Yes, you can fit websites in RAM, but you can't fit the databases powering them. Yes, you can fit some part of the videos or images you're working on or serving on RAM, but you can't store whole catalogs in RAM.

Moreover, you again give examples from the end product. Finished sites, compacted JS files, compressed videos, compiled models...

There's much more than that. The model is in RAM, but you need to rake tons of data over that GPU. Sometimes terabytes of data. You have raw images to process, raw video to color-grade, unfiltered scientific data to sift through. These files are huge.

A well processed JPG from my camera is around 5MB, but RAW version I process is 25MB per frame, and it's a 24MP image, puny for today's standards. Your run of the mill 2K video takes a couple of GBs after final render at movie length. RAWs take 10s of terabytes, at minimum. Unfiltered scientific data again comes in terabytes to petabytes range depending on your project and instruments you work on, and multiple such groups pull their own big datasets to process real-time.

In my world, nothing fits in RAM except the runtime data, and that's your application plus some intermediate data structures. The rest is read from small to gigantic files and written in files of unknown sizes, by multiple groups, simultaneously. These systems experience the real meaning of "saturation", and they would really swear at us at some cases.

Sorry, but you can't solve this problem by buying more RAM, because these workloads can't be carried to clouds. They need to be local, transparent and fast. IOW, you need disk systems which feel like RAM. Again, look what Weka (https://www.weka.io/) does. It's one of the most visible companies which make systems behave like a huge RAM, but with multiple machines and tons of cutting edge SSDs, because what they process doesn't fit in RAM.

Lastly, oh, there's a law which I forget its name every time, which tells you if you cache 10 most used files, you can serve up to 90% of your requests from that cache, if your request pattern is static. In cases I cite, there's no "popular" file. Everybody wants their own popular files which makes access "truly random".

j16sdiz 1 hours ago [-]

These patches came from oracle. Pretty sure they have a client somewhere needs this.

guenthert 4 hours ago [-]

Some time ago (back when we were using spinning rust) I was wondering whether one could bypass the latency of disk access when replicating to multiple hosts. I mean, how likely is it, that two hosts crash at the same time? Well, it turns out that there are some causes which take out multiple hosts simultaneously (a way too common occurrence seems to be diesel generators which fail to start after power failure). I think the good fellas at Amazon, Meta and Google even have stories to tell about a whole data center failing. So you need replication across data centers, but then network latency bites ya. Current NVMe storage devices are then faster (and for some access patterns nearly as fast as RAM).

And that's just at the largest scale. I'm pretty sure banks still insist that the data is written to (multiple) disks (aka "stable storage") before completing a transaction.

homebrewer 6 hours ago [-]

> Does it really matter in the modern world

Considering that multiple ZFS developers get paid to make ZFS work well on petabyte-sized disk arrays with SSD caching, and one of them often reports on progress in this area in his podcasts (2.5admins.com and bsdnow if you're interested) .. then yes?

2 days ago [-]

Loading comments...

sandreas 1 hours ago [-]

Since then I switched to ZFS (on the same hardware) and never had problems - while it was a real pain to setup until I finished my script [2], which still is kind of a collection of dirty hacks :-)

1: https://forum.cgsecurity.org/phpBB3/viewtopic.php?t=13013

2: https://github.com/sandreas/zarch

ghostly_s 15 minutes ago [-]

riku_iki 13 minutes ago [-]

> One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]).

it looks like you didn't use raid, so any FS could fail in case of disk corruption.

forza_user 2 days ago [-]

I was surprised of the new attempt for performance profiles/device roles/hints when we already have a very good patch set maintained by kakra.

- https://github.com/kakra/linux/pull/36

- https://wiki.tnonline.net/w/Btrfs/Allocator_Hints

What do you think?

dontdoxxme 9 hours ago [-]

> One of the reasons why these patches are not included in the kernel is that the free space calculations do not work properly.

It seems these patches possibly fix that.

bjoli 8 hours ago [-]

I wonder if I can use a smaller SSD for this and make it avoid HDD wakeups due to some process reading metadata. That alone would make me love this feature.

the8472 7 hours ago [-]

viraptor 4 hours ago [-]

Bcache allows lots of writeback configuration, including intervals https://www.kernel.org/doc/html/latest/admin-guide/bcache.ht...

bionade24 7 hours ago [-]

This feature is for performance, not the case you described.

ajross 6 hours ago [-]

bayindirh 4 hours ago [-]

Actually, it doesn't work like that.

ajross 3 hours ago [-]

> When you drive a couple thousand physical nodes with a some-petabytes sized volumes

bayindirh 3 hours ago [-]

> You aren't doing that with ZFS or btrfs, though.

ajross 3 hours ago [-]

bayindirh 11 minutes ago [-]

Moreover, you again give examples from the end product. Finished sites, compacted JS files, compressed videos, compiled models...

j16sdiz 1 hours ago [-]

These patches came from oracle. Pretty sure they have a client somewhere needs this.

guenthert 4 hours ago [-]

And that's just at the largest scale. I'm pretty sure banks still insist that the data is written to (multiple) disks (aka "stable storage") before completing a transaction.

homebrewer 6 hours ago [-]

> Does it really matter in the modern world

2 days ago [-]