Search logs: #osdev2 - 17 May 2022

channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched

#osdev2 = #osdev @ Libera from 23may2021 to present

#osdev @ OPN/FreeNode from 3apr2001 to 23may2021

all other channels are on OPN/FreeNode from 2004 to present

http://bespin.org/~qz/search/?view=1&c=osdev2&y=22&m=5&d=17

Tuesday, 17 May 2022

00:00:00 <clever> kingoffrance: ive also seen bugs, where the length of the shebang must fit within a certain number of bytes
00:00:00 <heat> that's not a bug
00:00:00 <heat> linux does that for instance
00:00:00 <clever> has it changed recently?
00:00:00 <klange> it's a security feature™
00:00:00 <heat> no
00:01:00 <heat> always has been like that
00:01:00 <clever> i remember something about ti changing
00:01:00 <heat> the limit has changed over time, but that was like 10 years ago
00:01:00 <klange> https://www.in-ulm.de/~mascheck/various/shebang/ has some collected details on different platforms
00:01:00 <bslsk05> www.in-ulm.de: The #! magic, details about the shebang/hash-bang mechanism
00:02:00 <clever> https://github.com/NixOS/nixpkgs/pull/55786
00:02:00 <bslsk05> github.com: improve perl shebang lines by switching to `use lib ...;` by cleverca22 · Pull Request #55786 · NixOS/nixpkgs · GitHub
00:02:00 <clever> apparently, the shebang in these perl scripts was reaching 30kb
00:02:00 <clever> and it worked for a while, and then broke
00:03:00 <clever> i cant remember what changed to make it stop working
00:05:00 <heat> apparently it's 256 now, used to be 128
00:05:00 <heat> mine is 100
00:06:00 <heat> https://github.com/heatd/Onyx/blob/master/kernel/kernel/binfmt/shebang.cpp <-- horrible, horrible code that I wrote while looking at linux's shebang code
00:06:00 <bslsk05> github.com: Onyx/shebang.cpp at master · heatd/Onyx · GitHub
00:06:00 <heat> it's pretty similar but a bit more readable IMO
00:08:00 <clever> heat: from what i can remember, i think the error might have been that linux started enforcing that it found a \n within the first N bytes?
00:08:00 <clever> when previously, it would truncate, and execute whatever was left
00:09:00 <heat> maybe
00:09:00 <heat> 30kb shebangs were never a thing
00:09:00 <clever> for nixpkgs, the perl include path was in the shebang
00:09:00 <clever> which is why it turned out to be 30kb
00:25:00 <heat> clever, did perl just inject arguments when reading the shebang then?
00:37:00 <clever> heat: something like that
00:40:00 <Clockface> does anyone here have a way to check if DOS is loaded and functioning
00:40:00 <Clockface> the program initially does not know if it was booted by the BIOS or loaded as a .COM file by DOS
00:43:00 <clever> Clockface: i think bios only loads the first sector, to 7c00 i think it was, while dos loads the entire .com file
00:43:00 <clever> so you coudl check the addr and if a magic# is present in the 2nd sector of the binary
00:46:00 <heat> try a dos interrupt and see if it works
00:47:00 <klange> clever: true for an MBR, not true for the channel's favorite meme: EL TORITO!
00:48:00 <geist> EL TORITO!
00:51:00 <heat> el tor
00:51:00 <heat> ito
00:51:00 <clever> klange: ah, i wasnt thinking about a bios with cdrom support
00:54:00 <heat> the best way to know you weren't booted by the BIOS is to get booted by UEFI
00:54:00 <heat> thank me later
00:55:00 <wxwisiasdf> ah well just threw autotools outta the window and now i am using make :D
00:55:00 <geist> huh the #! lore page was interesting
00:57:00 <heat> wxwisiasdf, tip: don't actually switch until you're sure that's what you want
00:57:00 <heat> rewritting build files is horrible
00:57:00 <heat> since I don't know what build system i'm actually going for, i'm keeping the current one for now
00:57:00 <heat> maybe i'll rewrite it bit by bit
00:57:00 <Clockface> i like the magic number
00:57:00 <Clockface> ill do that
00:58:00 <heat> but it doesn't work
00:58:00 <wxwisiasdf> no i already did it
00:58:00 <heat> that fast?
00:58:00 <wxwisiasdf> i just did it in one shot and it broke everything and my os now has like 2 ubsan bugs and stuff
00:58:00 <wxwisiasdf> but hey i am not using make
00:59:00 <wxwisiasdf> now*
01:01:00 <wxwisiasdf> fortunely i threw like a thousand assertions everywhere so hopefully i can get it back to working
01:02:00 <geist> well, thats still a bug for you to fix
01:02:00 <wxwisiasdf> yes :)
01:02:00 <klange> reason #532 to not use autoconf, it's too automagical and you have no idea what it's actually doing, so you have no control over your compiler
01:02:00 <geist> probably something to do with the order things got linked maybe (assuming the compile switches are the same)
01:02:00 <wxwisiasdf> worst part is that i can't use gdb because s390 is kinda sus when it comes to proper emulators
01:04:00 <wxwisiasdf> geist: it can be anything really - i've been told the z/arch compiler is pretty buggy
01:05:00 <geist> klange: oh you might be interested in this, someone submitted a patch to LK that does something i thought about but never actually tried. hypothetically it would maybe work on x86 too
01:05:00 <wxwisiasdf> i've already crashed the xtensa ld once :^)
01:05:00 <geist> basically when enabling the mmu and branching to the high kernel address
01:05:00 <geist> instead of having to have a unity mapped low mmu thing
01:06:00 <geist> set the VBAR (in the case of arm64) or the IDT to the high address, enable the mmu and then wait for the cpu to fault
01:06:00 <geist> then catch it
01:06:00 <klange> that sounds evil
01:06:00 <geist> i thought about it on arm64 before, but figured it'd be too risky to try
01:06:00 <geist> but they tried it and says it works fine on real hardware and emultor
01:07:00 <geist> x86 maybe would work too, though would have to think about it
01:07:00 <geist> it's probably technically UB though
01:07:00 <clever> what about prefetch and a couple opcodes having come from the phys space?
01:07:00 <heat> that's horrific and brilliant
01:07:00 <clever> maybe a `b .` to trap it in the physical domain, until it faults?
01:07:00 <geist> yah
01:08:00 <geist> https://github.com/littlekernel/lk/pull/327 for inspiration
01:08:00 <bslsk05> github.com: [arch][arm64] replace the trampoline translation table with a trampol… by pcc · Pull Request #327 · littlekernel/lk · GitHub
01:08:00 <geist> could also have a br to the virtual address just to be safe i guess
01:13:00 <geist> so you could i guess do the same thing on x86 by arranging for the IDT to point such that it's #PF handler points to the next instruction in VA space
01:13:00 <geist> would have to point at a IDT entry that points to the next thing i guess
01:13:00 <geist> bt wouldn't have to reserve a huge table, just enough to get to the 14th
01:13:00 <clever> yeah, this basically deletes the need for an identity mapping
01:14:00 <geist> right
01:14:00 <clever> and on the spectre/meltdown front, can the IDT change the paging tables upon fault?
01:14:00 <geist> i'm sure it's not the first time anyone has thought of it, but it does kinda simplify things if its safe to use
01:14:00 <geist> on x86-64 though you'd be basically simultaneously turning on the mmu, entering long mode, and faulting
01:14:00 <geist> so that's a real push
01:15:00 <clever> oh, nice
01:15:00 <clever> thats exactly what i originally joined #osdev for, lol
01:15:00 <clever> and i was cheating, by having qemu pre-create the paging tables for me
01:16:00 <clever> so with this trick, my asm just has to set the IDT addr, and then turn on mmu+long, and fault into the real _start
01:16:00 <clever> but that project has long been retired
01:16:00 <geist> yah though also. you probably want a temporary IDT etc
01:16:00 <heat> how did that qemu page table thing work?
01:16:00 <geist> in this case this PR above basically burns 1K of text for a temporary arm64 exception table which is grody
01:16:00 <clever> heat: i was modifying qemu, so it could run xen unikernels, with the xen hypercall api
01:17:00 <geist> but can probalby arrange for it to point at something offset such that the VBAR's Nth entry goes where you want
01:17:00 <clever> heat: so i just populated the guest ram with a paging table, before the cpu came out of reset, and had a custom bios blob
01:17:00 <geist> but here are also alignment constraints for the arm vbar, etc
01:17:00 <clever> the idea being to spent as little time in real-mode as possible
01:49:00 <gamozo> Spending as little time in real-mode is probably one of my favorite hobbies
03:34:00 <klys> yipe a new unikernel
03:41:00 <energizer> did anything end up happening with unikernel linux? i thought that was a good idea
03:42:00 <energizer> this one https://www.bu.edu/rhcollab/files/2019/04/unikernel.pdf
04:42:00 <heat> how are you supposed to pick the number of queues and queue depth for an nvme device?
04:43:00 <heat> like what's the heuristic
04:43:00 <geist> number of queues may be more based on number of cpus than anything else
04:43:00 <geist> since i think it's common to at least run some number in parallel, per cpu, up to some point
04:45:00 <heat> and the queue depth?
04:45:00 <geist> i dunno that's a good question
04:45:00 <geist> like a lot of these things, probably just a good guess, with some ability for the sysadmin to adjust it possibly
04:45:00 <heat> do you just allocate a page by default? do you allocate the whole thing (the whole thing may be too much, 4MB for a single queue)
04:46:00 <geist> probably something more reasonable, like say 256 or 512 or so entries
04:46:00 <geist> or a single page yeah
04:46:00 <heat> when I'm done with mine I should rework fuchsia's driver
04:46:00 <heat> it's very limited
04:47:00 <geist> funny you say that, yes yes it is
04:47:00 <geist> and i think we need someone to work on it
04:47:00 <heat> single queue, single page for each queue
04:47:00 <geist> i literally have a machine coming in on fedex tomorrow because it is known to not work with our nvme driver
04:47:00 <geist> i said i'd take a look at it
04:48:00 <heat> ooh
04:48:00 <heat> do you have logs?
04:49:00 <heat> the driver looks fine from the spec's POV
04:50:00 <heat> the only totally wrong thing I found is that the timeouts may technically be too short, but 5s should still be enough
04:50:00 <geist> it fails some transaction early on and then falls over
04:50:00 <geist> i dont have them handy
04:50:00 <geist> it gets an unhandled error from the device i think
04:50:00 <geist> the device is some sort of cheapo hynix thing i think
04:50:00 <gamozo> Weird, it works with Linux or some other environment?
04:58:00 <heat> well if you need help you know where to find me on the interwebz
04:59:00 <geist> heat: yah it's just on my work computer which is there and not here
05:01:00 <heat> it is indeed not in two locations at once
05:01:00 <wxwisiasdf> how do i tell ubsan that a NULLPTR i am writing to is okay?
05:01:00 <heat> you don't, you can't do that
05:01:00 <heat> you remap the page and try to write there
05:01:00 <wxwisiasdf> rip
05:02:00 <heat> you could technically try to fool the compiler but the compiler is smart
05:02:00 <heat> so maybe do it in assembly I guess
05:02:00 <wxwisiasdf> oh of course gcc is likely x1000 super smarter - i guess having an asm glue won't hurt too much
05:16:00 <No_File> Good Morning!
05:17:00 <geist> okay!
05:17:00 <geist> morning!
06:07:00 <mrvn> Never expect hardware to actually follow the specs
06:15:00 <gamozo> Morning @No_File!
06:17:00 <Mutabah> <<No such user `No_File`>>
06:18:00 <gamozo> RIP
06:18:00 <gamozo> Too used to discord at this point I guess
06:18:00 <gamozo> Haven't been on IRC in years
06:22:00 <geist> okay, stuffed in an old first gen ryzen in the server
06:22:00 <geist> see if it is stable now
06:22:00 <geist> if it is, that doesn't mean much, because this cpu draws less power
06:22:00 <geist> so its possible it'll not stress out the VREGs as hard
06:22:00 <geist> and thus is stable
07:03:00 <sikkiladho> How can one implement PSCI_CPU_ON at hypervisor for secondary cpus, code in Trusted Firmware-A is lot complex to replicate. Any examples and docs would be great.
07:06:00 <clever> sikkiladho: to start with, you need to gain control of the other cores, via whatever mechanism the platform supports, dont even bother looking at PSCI until you have your code running on all 4 cores
07:07:00 <clever> all in hypervisor mode, with the mmu configured the same way
07:08:00 <clever> if you choose to run under the ATF, then you send it a normal PSCI, if you choose to run with the official arm stub then you poke the spintables and sev
07:08:00 <sikkiladho> what if I've just booted up and other cores are in reset(or any platform-specific mode for RPi4)? Can I implement PSCI at hyp level to bring up secondary cores?
07:09:00 <clever> the job of PSCI is to convert the platform specific stuff into a standard api
07:09:00 <sikkiladho> So it's possible with spin-tables and not PSCI. I think they're different?
07:09:00 <clever> in the case of the rpi4, coming out of reset, all 4 cores just execute whatever is at PC=0
07:09:00 <clever> and you have no way to wake a core up after it has died
07:10:00 <clever> for the pi4, the job of ATF or a hypervisor, is to ensure a core never actually dies, and just sits in an idle loop, waiting for an inter-core message
07:10:00 <clever> when using the official arm stub, 3 of the cores will park themselves, and wait for an addr in the spintables
07:11:00 <sikkiladho> and with ATF-A?
07:11:00 <clever> ATF will gain control of the cores (probably by living at addr 0) on startup, and then it will park 3 of them in its own idle loop
07:12:00 <clever> and wait for a message from itself (sent by core0, in reaction to a PSCI cmd)
07:13:00 <clever> so when your hypervisor on core0 sends a PSCI command to wake core1, that just acts as a function call into ATF, forcing a switch into EL3
07:13:00 <clever> ATF then sends an IPI interrupt to core1, to wake the ATF thread on core1
07:13:00 <clever> core1 then reads the message, and executes your code in EL2 on core1
07:17:00 <clever> and you need to do the same when implementing a hypervisor
07:17:00 <sikkiladho> Thank you, I got it. Secondary cores are in control of ATF so the SMC must be forwarded to EL3(ATF).
07:18:00 <clever> but you cant just blindly forward the SMC
07:18:00 <clever> you must first gain control of those cores in hypervisor mode
07:18:00 <clever> and then setup the guest, the same way you did on core0
07:19:00 <sikkiladho> Yeah, I would trap the smc and replace the entry-point addres with my own, so that core1 jumps to my address.
07:20:00 <sikkiladho> and preserve the one sent by linux ofcourse.
07:21:00 <sikkiladho> So I should gain control of the CPUS before loading the linux.
07:22:00 <geist> i think in general it's assumed that if you're building a hypervisor, it's a full SMP system
07:23:00 <geist> so basically the fgirst thing the hypervisor nedes to do is bring up the secondary cores and make them part of the hypervisor itself
07:23:00 <geist> you say ATF is hard to replicate, well a hypervisor is much more sophisticated
07:24:00 <geist> since usually they're more or less a full kernel
07:24:00 <geist> so really, i ask, what are you trying to do here?
07:25:00 <sikkiladho> @geist thank you. I will try to get control of secondary cpus, before setting up the guest.
07:26:00 <sikkiladho> I think ATF was hard to replicate because it's for multiple platforms and my hypervisor right now is very simple , but I don't think I have to replicate it in this case. thank you.
07:27:00 <sikkiladho> @geist I'm building a simple hobby hypervisor for rpi4 which just loads a single linux kernel and sits underneath. At first, that's it.
07:27:00 <geist> ah
07:28:00 <geist> well in that case you'll have to be prepared for hypervisor traps from each of the cpus, so though you may not be implementing a complex hypervisor you'll probably need to implement some amount of locking or whatnot internally
07:28:00 <geist> so in that respect you'll have to handle effectively a SMP hypervisor, even if it's very simple
07:30:00 <clever> personally, i would just use LK as a base
07:30:00 <clever> modify the mmu code to support running in EL2 instead of EL1
07:30:00 <clever> and then use a core-pinned thread for each guest core
07:31:00 <clever> whenever the LK scheduler thinks it can, it will run that thread, which will then drop down to EL1 and run the guest
07:31:00 <clever> and when the guest throws an exception/smc, control returns back to that thread in EL2/lk
07:31:00 <clever> if you want a second guest, just spin up more threads, and let the LK scheduler deal with it
07:32:00 <clever> pre-empting a guest? ensure timers can force a switch back to EL2!
07:32:00 <clever> geist: does that all seem sound?
07:33:00 <clever> hardest part i can see, is just having an LK thread "resume" after it dropped to EL1, like the drop had simply returned
07:34:00 <geist> yah i think that'd be pretty doable
07:36:00 <clever> it also loosely reminds me of the linux kvm api
07:37:00 <clever> where you just have a "run the guest" ioctl
07:37:00 <clever> and when anything goes wrong and the kernel cant deal with it (hypercalls, faults), the ioctl returns, and your code is left to deal with it
07:38:00 <geist> yah i have always thought that'd be a fun project
07:39:00 <geist> just build a pure type 1 hypervisor and run other stuff in it
07:39:00 <clever> you could similarly implement ATF, by just modifying LK to run in EL3
07:39:00 <geist> that's my way to assert dominance here: run everyone's hobby OS under mine
07:39:00 <clever> but i think EL3 is mmu-less?
07:39:00 <geist> no it has its own, it just doesn't nest
07:39:00 <clever> ah
07:40:00 <clever> so you would just have to modify the mmu code to support running under EL3/EL2/EL1, and to not drop to EL1 immediately
07:40:00 <clever> and then compile-time configure what EL you want it to drop to and run under
07:40:00 <geist> yah the hard part is there is a bunch of code tat accesses _EL1 fairly explicitly, so would have to at least macroize that stuff
07:40:00 <geist> right
07:40:00 <clever> and then normal thread/app stuff can deal with running guests at lower levels
07:40:00 <clever> and setting up secure vs non-secure guests
07:41:00 <geist> also IIRC EL3 and EL2 MMUs are funny: they only map the bottom part (ie, one of the two TTBRs) *Except* if you have a core that supports the EL2 extensions
07:41:00 <geist> so implicitly if you're EL3 or EL2 only on v8.0 you're limited to bottom half mmu
07:41:00 <clever> so you would have to change the kernel base
07:41:00 <geist> right
07:42:00 <clever> and change it into using TTBR0 for the kernel
07:42:00 <geist> right
07:42:00 <clever> i think lk always uses TTBR1
07:42:00 <clever> because it assumes its in the high half, and leaves 0 free for a userland
07:43:00 <geist> right
07:44:00 <clever> i think access to _EL1 regs will also work from EL2/EL3?
07:44:00 <clever> because the hypervisor/tf may want to modify EL1 state
07:44:00 <clever> so you cant rely on faults to tell you when your using the wrong regs
07:44:00 <mrvn> One more reason to run a lower half kernel / higher half user :)
07:45:00 <clever> and would have to audit the output asm
07:45:00 <mrvn> Where user in this case would be the linux kernel
07:45:00 <geist> sure
07:45:00 <clever> mrvn: oh, random thought, a high half userland, means that null pointers are "safer", even with a +3gig offset, lol
07:45:00 <geist> clever: that's right (re: _EL1 access)
07:46:00 <clever> you would need a massive positive offset, for it to clear over the kernel, and hit userland
07:47:00 <geist> but yeah iuser space being at the bottom is a pretty standard scheme now
07:47:00 <geist> usual reasons
07:47:00 <geist> and then some arches codify it
07:48:00 <geist> but not the general modern ones
07:49:00 <clever> and thinking about it a bit, more from a malicious angle
07:49:00 <clever> if i wanted the hypervisor to hide itself from a linux guest
07:49:00 <mrvn> AArch64 seems to codify: hypervisor = lower, kernel = higher, user = lower
07:50:00 <clever> i would need to block access to a region of memory where the hypervisor lives, and maybe mess with dma controller commands, to stop you from using dma to peek behind the hypervisor mmu
07:50:00 <mrvn> clever: you can just swap address spaces when enteriong the hypervisor and put it anywhere
07:50:00 <mrvn> short of that little change addres spaces stub
07:50:00 <mrvn> I think you pretty much have to do it that way on 32bit.
07:50:00 <clever> using nested paging tables, i should be able to ban linux from reading a 1mb chunk of ram
07:51:00 <clever> but i could map that to some other address, to make it less obvious
07:51:00 <mrvn> no nested tables in hpyervisor mode
07:51:00 <clever> isnt that the whole point of hypervisor mode, so you can run the kernel under a second set of tables?
07:51:00 <mrvn> ahh, sorry, yes, linux would be nested
07:51:00 <clever> EL2 sets up the nested tables, EL1 sets up its own tables, and now all translations go thru both EL1 and EL2's tables
07:52:00 <clever> and EL2 can use that to hide the hypervisor from linux
07:52:00 <mrvn> nod
07:52:00 <clever> at which point, how can linux detect the hypervisor?
07:52:00 <mrvn> only by trying to use some address space and it not working
07:53:00 <clever> what if i map the hypervisor's address to some other part of ram
07:53:00 <clever> so that 1mb block shows up at 2 addresses
07:53:00 <mrvn> what you would see is that you have an odd ram size.
07:53:00 <clever> and both are within the "no touchy" zone declared by the rpi firmware
07:53:00 <clever> which is already stealing 24mb of ram
07:53:00 <clever> i can just boot with gpu_mem=23, and now the firmware only steals 23mb
07:53:00 <clever> then take the extra 1mb for my hypervisor
07:54:00 <mrvn> yep, that hides it well
07:54:00 <clever> and the ram size is just as odd as without the hypervisor
07:54:00 <clever> the only sign it happened, is that a 1mb chunk of that 24mb "dont look here" is duplicated
07:54:00 <mrvn> can you ask the VC for it's ramsize?
07:54:00 <clever> on pi4, that is permanently pegged at 1024
07:54:00 <clever> the VC is only aware of the lower 1gig
07:54:00 <mrvn> I mean the gpu_mem
07:55:00 <clever> you can, but i could just hook those routines...
07:55:00 <clever> but, you just gave me a crazy idea
07:55:00 <clever> i could live inside the gpu_mem's heap!
07:55:00 <mrvn> there are probably some follow up problems if you mess with that
07:55:00 <clever> using mailbox functions, i can allocate say a 1mb object on the VC's heap
07:55:00 <clever> and then i can copy my hypervisor into that
07:56:00 <clever> now it really is "in use" by the firmware!
07:56:00 <mrvn> yeah, maybe better. And you could display the hypervisor memory as graphics output for fun
07:56:00 <clever> already tried that in another crazy idea, i wanted to dump the bootrom on the framebuffer, without bringing ram online :P
07:56:00 <clever> but i think the framebuffer cant be too close to 0 in ram
07:56:00 <clever> "ram"
07:57:00 <clever> so i have to bring ram online, to address further away from 0 and have it function
07:57:00 <mrvn> isn't the bootrom some secure memory that the graphics chip wouldn't be able to access?
07:58:00 <clever> much like the gameboy and xbox, its just a normal axi slave, until you set a magic flag, then it drops off the bus and that addr becomes ram
07:58:00 <mrvn> does the RPi4 have the secure extension?
07:58:00 <clever> the secure extensions in the ram controller, are wired into the VC, not the arm
07:58:00 <clever> so only the VC in secure mode, can access protected pages
07:58:00 <clever> the official firmware runs in non-secure mode by default, and has an array of trusted functions that can be ran in secure mode
07:59:00 <clever> and a syscall like api, to run a function by index
08:00:00 <clever> secure_fn_0 is used as an index lookup, you give it a function pointer, and it returns the index into that array
08:00:00 <clever> that index is then stored under this->fn_foo_index, and later used to call it
08:02:00 <clever> the VC has a 128 slot vector table, 32 slots for cpu exceptions, 32 slots for software interrupts (like int 0x80), and 64 slots for hw interrupts
08:03:00 <clever> each slot is just a PC to jump to, but bit0 of the value signals if the vector should be serviced in the current mode or secure mode
08:03:00 <clever> so storing `&irq_uart | 1` into a slot, causes the irq handler to be ran in secure mode
08:03:00 <clever> and the same for software interrupts
08:04:00 <clever> each core (there are 2) also has a register for the base addr of that vector table (much like arm's VBAR)
08:05:00 <clever> alignment is enforced by the register simply not storing the lower bits, so if you read it back, its been rounded down to the nearest alignment
08:07:00 <clever> mrvn: nested paging tables are also taken to another level on the rpi, there is an extra mmu between "arm physical" and real ram, 64 pages of 16mb each
08:07:00 <clever> so you can potentially be going thru 3 paging tables, EL1, EL2, broadcom
08:08:00 <clever> the broadcom mmu is applied outside of the arm l1/l2 caches, so a cache-hit wont have any perf cost
11:36:00 <ddevault> can someone explain what the %gs register is for
11:36:00 <ddevault> I am utterly failing to understand its (apparently important) purpose
11:36:00 <GeDaMo> Thread local storage?
11:39:00 <gog> yes, typically %gs contains the base address for the thread's local data
11:39:00 <gog> %fs and %gs
11:39:00 <ddevault> hm
11:39:00 <gog> this was the convention before and since a few CPU generations ago is supported by CPU instructions
11:41:00 <gog> before amd64 thread-local storage was managed with the GDT, now it's managed with a pair of MSRs
11:42:00 <ddevault> I see
12:12:00 <klys> global segment
13:37:00 <mrvn> can one apply __attribute__((__packed__)) to a template<typename T>? gcc always says it will ignore it.
14:33:00 <bauen1> mrvn: some code i have here says you can, at least gcc (10, 11, 12) isn't complaining
14:34:00 <bauen1> code in question is roughly: `template <typename T> struct [[gnu::packed]] Timed { T value; }`
14:34:00 <mrvn> bauen1: is doesn't complain, it just ignores it. Check sizeof()
14:34:00 <mrvn> or it complains that it will ignore it
14:34:00 <mrvn> bauen1: your Timed is packed but T is not packed. So overall you just changed the alignment to 1 and broke T.
14:35:00 <mrvn> Try struct T { char c; int i; }; the value is not packed.
14:36:00 <bauen1> oh i hate c++
14:37:00 <mrvn> The problem might be that the [[gnu::packed]] neess to be between "struct" and "Name" in the T.
14:38:00 <bauen1> wtf
14:39:00 <bauen1> no, what, https://godbolt.org/z/Pecr3hsE3 seems to work
14:39:00 <bslsk05> godbolt.org: Compiler Explorer
14:39:00 <mrvn> bauen1: don't forget that packed isn't recursive. A struct in a packed struct is not itself packed. You have to apply the attribute to every sub struct too.
14:40:00 <mrvn> yes, packing S works, but packing Timed doesn't pack the inside T.
14:41:00 <bauen1> oh, i don't think that will be a problem, the code that cares about packed static_asserts that alignof(T<...>) == 1
14:41:00 <mrvn> I even tried this: <source>:4:34: warning: attributes ignored on elaborated-type-specifier that is not a forward declaration [-Wattributes] 4 | template <struct [[gnu::packed]] T>
14:42:00 <mrvn> ahh, I didn't think of you asserting it's packed. thanks.
14:42:00 <mrvn> well, packed or only contains chars
14:43:00 <mrvn> s/you/your/
14:44:00 <bauen1> mrvn: i have written a header that asserts all kind of weird things to ensure a struct can be passed between 2 platforms without issues, except you can't entirely ensure that as someone can always 1. forget to add a static_assert(sizeof() = x) on their struct and use types that actually have a different size, e.g. `long int`
14:44:00 <mrvn> Anyone know what the state of introspection is for c++? Could one use that to recursivley generate a "struct [[gnu::packed]] PackedT" from any given T?
14:44:00 <bauen1> it starts with:
14:44:00 <bauen1> static_assert(CHAR_BIT == 8, "Please use a reasonable platform");
14:45:00 <mrvn> hehe
14:45:00 <mrvn> You must be happy that int is now a two's complement.
14:45:00 <mrvn> till recently you could only share unsigned types and intX_t.
14:46:00 <bauen1> lol we're sharing floats and doubles here ...
14:46:00 <heat> c++ is a prime example of stockholm syndrome
14:46:00 <mrvn> uhoh. what about archs without denormalized doubles?
14:46:00 <heat> fuck em
14:47:00 <bauen1> there's also some really shitty stub headers so i can compile the microcontroller firmware for linux, and get all the offset of struct members and some other information exporter into JSON
14:47:00 <mrvn> On alpha doubles aren't even ieee unless you add a gcc flag that makes it run half speed.
14:47:00 <bauen1> writing a program to use libclang was also considered, but as far as i could see libclang works on the AST and not on e.g. the final struct layout / values
14:47:00 <mrvn> In the future you can do that with introspection.
14:47:00 <bauen1> mrvn: in the future there will be rust ...
14:48:00 <mrvn> can rust already do introspection?
14:48:00 <bauen1> mrvn: not sure, but it has macros / the derived-thingy that would make this exact thing a lot easier to build i think
14:49:00 <heat> bauen1, you could definitely use clang libraries to do that
14:49:00 <heat> clangd already knows sizes and alignments and whatnot
14:51:00 <bauen1> heat: libclang seems to only operate on the AST, or at least I couldn't figure how to find a list of all struct types in the entire project that fullfil a certain critera (e.g. passed to template, passed to function)
14:51:00 <bauen1> heat: problem is that all of this probably involves a few too many layers of templates :(
14:54:00 <mrvn> There is no problem that can't be made more magic by the use of more templates.
18:24:00 <geist> ddevault: are you using x86-64 or x86-32?
18:24:00 <ddevault> the former
18:24:00 <geist> also ugh, was responding to something 8 hours ago
18:24:00 <geist> oh looks like no more dicussion was on it
18:24:00 <ddevault> I still don't fully understand %gs, but I don't really need to right now
18:24:00 <geist> so yeah as gog was saying gs: is largely vestigial
18:24:00 <ddevault> well, I understand what it *was* for
18:24:00 <geist> basically the *value* in gs (and fs) is irrelevant now
18:25:00 <ddevault> but I don't really understand what kernels still do with it
18:25:00 <ddevault> in any case, my code works so I'm happy enough
18:25:00 <geist> but you can use an override prefix to dereference something off it
18:25:00 <geist> ie
18:25:00 <geist> mov gs:4, rax or something like that
18:25:00 <ddevault> hm
18:25:00 <geist> basically take the address that is 4 off of what gs 'points to' and move into rax
18:25:00 <geist> and that's accomplished in the assembler via a segment override prefix byte
18:26:00 <geist> the way gs (and fs) 'point to' something in x86-64 is not via the GDT like it used to, but via a set of MSRs you can set
18:26:00 <ddevault> I see
18:26:00 <geist> GS_BASE FS_BASE and GS_KERNEL_BASE
18:27:00 <geist> *basically* it's used for thread local storage in user space. traditionally fs points to the thread local structure
18:27:00 <geist> and in the kernel GS usually points to somethig similar. a cpu specific data structure
18:27:00 <mrvn> iirc on x86 the use of fs/gs is reversed
18:27:00 <geist> in an SMP system you always want to have at least one per-cpu structure that you can anchor things off of
18:28:00 <geist> so it's traditional (and kinda baked into the arch in 64bit) that gs points to that inside the kernel
18:28:00 <geist> on non SMP it isn't really mandatory
18:28:00 <ddevault> I understand, that makes more sense now
18:28:00 <ddevault> I was not grokking that the use by convention differed from the use per the CPU manual
18:28:00 <ddevault> thanks :)
18:29:00 <geist> yah the manual wont really describe what it's for, just the mechanism
18:29:00 <geist> this is also where GS_KERNEL_BASE and GS_BASE and swapgs will start to make sense
18:30:00 <ddevault> what confused me is that it had a much more important purpose before
18:30:00 <ddevault> so all of the docs cover it in great detail regarding its legacy use
18:30:00 <geist> which initially is head scratching, but if you have both the kernel and user space use GS, those features start to make sense
18:30:00 <mrvn> ddevault: You mean as an actual segment descriptor?
18:30:00 <geist> wasn't so much important as fs and gs were just another one of the regular segment registers then (ds, es, fs, gs, ss, cs)
18:30:00 <ddevault> yeah
18:30:00 <geist> and protected mode segment stuff was somewhat more powerful
18:31:00 <mrvn> it's been repurposed since all segment start/limit is ignored in 64bit.
18:31:00 <ddevault> yeah
18:31:00 <geist> exactly, so in 64bit the other 4 registers are basically entirely vestigial (except cs signalling what mode you're in)
18:31:00 <ddevault> but I still saw kernels in the wild messing with it
18:31:00 <ddevault> so I was a bit unsure as to why they were bothering and if it was important
18:31:00 <mrvn> What is surprising is: what's up with "es"? Why isn't that used?
18:31:00 <geist> but they left some functionality in fs/gs, but indirectly (via the MSRS) or via the new instructions to let you set them directly (fsgsbase instructions)
18:32:00 <geist> mrvn: anymore or at some point?
18:32:00 <mrvn> geist: in 64bit mode
18:32:00 <geist> oh i guess AMD basically left in the bare minimum
18:32:00 <mrvn> Is the "es" prefix worse than fs/gs?
18:32:00 <geist> also es has some hard coded uses in some instructions, so i'm guessing they left it along for that reason
18:32:00 <geist> otherwise you'd have to also modify those instructions to not use it, etc
18:33:00 <mrvn> ahh, that would explain it. Stipping it out of the instructions for 64bit mode would be complex.
18:33:00 <geist> yah iirc movs implicitly uses es for one of the sources? (amirite there?)
18:33:00 <GeDaMo> Destination, I think
18:33:00 <geist> or destination
18:34:00 <geist> yah
18:34:00 <geist> i dunno how the segment override prefixes work with movs. which side does/can it modify?
18:34:00 <GeDaMo> "For legacy mode, Move byte from address DS:(E)SI to ES:(E)DI. For 64-bit mode move byte from address (R|E)SI to (R|E)DI."
18:34:00 <GeDaMo> Doesn't seem to apply in long mode
18:35:00 <geist> GeDaMo: yeah or it does implicitly use ds/es except those have no offset/length so effectively it disables it
18:35:00 <geist> also interesting questin: can you use fs or gs override prefix for it in 64bit mode
18:36:00 <GeDaMo> "The DS segment may be overridden with a segment override prefix, but the ES segment cannot be overridden."
18:36:00 <GeDaMo> https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
18:36:00 <bslsk05> www.felixcloutier.com: MOVS/MOVSB/MOVSW/MOVSD/MOVSQ — Move Data from String to String
18:37:00 <geist> there ya go. makes sense
18:38:00 <geist> or at least doesn't really make sense, but thats the answer!
18:38:00 <geist> as is lots of x86isms
18:39:00 <mrvn> If an opcode uses two segment registers then you can't override both of them. There is no "override the other segment" prefix byte.
18:40:00 <geist> right
18:40:00 <geist> that's the at least internally consistent part of it
18:41:00 <geist> and outside of movs i dont think too many other instructions access two pointers at the same time
18:41:00 <geist> i'm sure there's some other one somewhere (there always is) but id ont know of it offhand
18:45:00 <zid> does push [] count
18:45:00 <zid> also uses two selectors
18:47:00 <zid> (does that even exist?)
18:48:00 <geist> push indirectly? i dont think so
18:48:00 <mrvn> https://www.felixcloutier.com/x86/push
18:48:00 <bslsk05> www.felixcloutier.com: PUSH — Push Word, Doubleword or Quadword Onto the Stack
18:48:00 <geist> and yyeah indirects or double indirects may reference more than one thing but i dont rememer if x86 has a bunch of those
18:48:00 <mrvn> memory, register or immediate. The first would use 2 segments.
18:49:00 <geist> ie, indirect this and then use that word to then indirect something else
19:03:00 <mrvn> geist: mov (#1, ds:r2*4, es:r3), (#3, fs:r4*8, gs:r5) to the rescue.
19:04:00 <geist> hmm?
19:05:00 <mrvn> a hypothetical 4* indirect addressing opcode
19:05:00 <geist> ah
19:05:00 <geist> i was expecting you to show off a 68k opcode that does this no sweat :)
19:06:00 <mrvn> no braindead segements in m68k :)
19:06:00 <mrvn> any idea how m68k does TLS?
19:06:00 <geist> though i gotta say x86 limiting themselves to one memory deref per instruction in most cases really does make the microcode simpler
19:06:00 <geist> 68k and vax have fairly complex internal states to make sure that page faults or whatnot on the Nth operation can be unwound and restarted
19:07:00 <geist> good question re 68k TLS
19:08:00 <mrvn> They actually screwed that up in the 68020. can't recover from a bus error so they run 2 68020 (iirc) in parallel with a clock offset. If the first throws a bus error the second gets stopped before it become unrecoverable.
19:09:00 <mrvn> What a way to unwind an opcode on error
19:09:00 <geist> 68000 IIRC. 010 fixed that among other things
19:10:00 <mrvn> How is your m68k board?
19:10:00 <geist> from poking around the web i've seen a few references to sysv 68k abi just not having thread local storage. have to make a syscall in linux
19:10:00 <geist> it's doing fine, need to futz with it some more
20:12:00 <wxwisiasdf> hello
20:13:00 <wxwisiasdf> how do i tell gcc to interpret printf formats with the -fexec encoding
20:14:00 <wxwisiasdf> i get lots of spurious warnings because i am using -fexec-charset=ibm-930 and it's very annoying because i have to basically rely on me not messing up formatting things on the kernel
20:15:00 <mrvn> no c++?
20:15:00 <wxwisiasdf> no it's c
20:16:00 <mrvn> maybe you should start there :)
20:16:00 <wxwisiasdf> ???
20:17:00 <wxwisiasdf> oh i see using automatic type deduction for formatting from c++
20:17:00 <mrvn> std::format
20:17:00 <wxwisiasdf> yeah
20:17:00 <wxwisiasdf> but this a kernel :)
20:17:00 <mrvn> even more reason to have it type safe
20:18:00 <Griwes> Idk what this being a kernel has to do with anything, my kernel formats stuff with std::format :P
20:18:00 <mrvn> Griwes: with or without type erasure?
20:19:00 <Griwes> It's per std::format spec
20:19:00 <wxwisiasdf> okay if not the kernel then my libc also uses printf for the various *nix utilites
20:19:00 <mrvn> type erasure is a implementation improvement
20:19:00 <Griwes> Which type erasure
20:20:00 <Griwes> As the spec stands, you need to erase some argument types and you need to erase the iterator
20:20:00 <wxwisiasdf> and those are -fexec-charset ibm930 too
20:20:00 <mrvn> can't remember exactly but it reduces the code bloat
20:20:00 <Griwes> The iterator erasure, then
20:21:00 <Griwes> It was DR'd to be effectively required
20:25:00 <Griwes> The thing that reduced most code bloat for me was a very careful dance of force inlining just the correct things
20:26:00 <mrvn> basically everything before the type erasure and nothing after
20:29:00 <Griwes> not... *quite*
20:29:00 <Griwes> it was a bit more involved
20:35:00 <heat> sup noobs
20:40:00 <jimbzy> Messing around with a schematic. Und du?
20:42:00 <heat> nothing, just got home
20:42:00 <heat> i'll probably try to finish my nvme driver tonight
20:42:00 <jimbzy> Schweet
20:44:00 <mrvn> tell me more
20:44:00 <mrvn> ups
20:45:00 <heat> tell me less
20:47:00 <mrvn> https://youtu.be/ZW0DfsCzfq4?t=60
20:47:00 <bslsk05> 'Grease - Summer Nights HD' by Kurt Harmsworth (00:04:01)
21:34:00 <geist> heat: looks like someone debugged the nvme fuchsia driver problem
21:35:00 <heat> oh man they ruined the fun :/
21:36:00 <heat> what was it?
21:37:00 <geist> some assumption the driver had about somethig. will check in a sec
21:37:00 <geist> not at work computer
21:37:00 <geist> iirc it was somehting like the river assumes you can build a queue this long but the device naked it
21:43:00 <heat> aha
21:43:00 <heat> IO queue right?
21:43:00 <heat> (not admin)
21:45:00 <geist> i'll have to check
21:45:00 <geist> afworkk right this sec
21:51:00 <heat> yeah i think it must be
21:51:00 <heat> unless the nvme is buggy
21:52:00 <heat> the queue limit they give you only applies to the io queue
21:52:00 <heat> btw I found out why PRPs and SGLs both exist
21:52:00 <heat> SGLs weren't a thing on spec 1.0
21:53:00 <heat> it can also explicitly not support SGLs
21:53:00 <heat> so you do need to support both PRPs and SGLs in your driver (yay complexity!)
21:57:00 <geist> ah that makes sense. i was expecting that SGLs are optional
21:57:00 <geist> so then the questio is what subset of consumer hardware supports it
21:58:00 <geist> i was thinking this was similar to the compex descriptors in SDHCI which has a similar thing (in spirit)
21:58:00 <geist> ie a simple scheme that everything supports and the complex one
21:58:00 <geist> which effectively means the simple one is the one you worry about
21:58:00 <geist> and the other one is gravy that maybe you can use
22:00:00 <heat> linux does seem to use SGLs by default since they're probably faster
22:01:00 <geist> oh sure, favor the fancy thing but fall back
22:01:00 <heat> see, this is where I wonder if a buddy allocator is really the best choice for a page allocator
22:01:00 <heat> more contiguous memory = better
22:02:00 <heat> if you try to do SGLs on really fragmented memory you'll end up with basically a larger PRP
22:02:00 <heat> like how much are you actually paying at page alloc time vs all the speed ups you can go for
22:03:00 <heat> hugepages too
22:03:00 <geist> yah agreed re buddy allocator
22:03:00 <geist> i'm not a huge fan, but i think to be honest it's beacuse i'm not a fan of doing whatever linux does because they are
22:03:00 <gamozo> Mornin everyone!
22:03:00 <heat> sup gamozo
22:04:00 <gamozo> I do 5 hours of yard work and now apparently I need to sleep until 2pm
22:04:00 <gamozo> Ahaha
22:04:00 <heat> geist, what did your other projects use?
22:04:00 <geist> for pmm? the queue
22:04:00 <geist> just a queue of pages in whatever order
22:04:00 <heat> ah just the simple list?
22:04:00 <geist> yep. zircon does too
22:04:00 <geist> hard to argue with O(1)
22:05:00 <heat> idea: buddy allocator as the backend, cache of memory regions as a percpu thing
22:05:00 <j`ey> a single queue of PAGE_SIZE's?
22:05:00 <heat> yes
22:06:00 <geist> maybe multiple queues for different numa nodes, etc but the idea is the same
22:06:00 <geist> carve off a struct per page and toss it in a list
22:06:00 <geist> works quite well, lots of large systems have survived on that
22:06:00 <heat> if you percpu cache it, you eliminate lock contention and stop any possible yo-yo of regions when alloc/freeing
22:06:00 <geist> it just hurts to allocate more than one contig page
22:07:00 <geist> yep. we have a per cpu cache in front of the pmm now. helped a lot
22:08:00 <heat> this is food for thought
22:08:00 <j`ey> nomnom
22:08:00 <heat> what does nt use?
22:08:00 <heat> or freebsd?
22:09:00 <heat> maybe it's in the windows internals books, can't find any information about it online
22:12:00 <geist> i think NT is queue based
22:28:00 <kazinsal> believe so
22:29:00 <kazinsal> I'd have to crack open Windows Internals to be sure
22:29:00 <kazinsal> and that's a lot of dead trees that's all the way over at the other side of my apartment, a whole 10 steps away
22:30:00 <geist> fairly certain queues of pages is pretty much the defacto implementation for more or less everything that was conceived of <2000 or so
22:32:00 <heat> here's a cute detail: windows maps IO ranges with large pages if it sees it can
22:33:00 <heat> i'm unsure of how this plays along with that scary x86 UB for large pages with multiple PAT attributes or whatever that was that doug16k once mentioned
22:39:00 <heat> haha they got struct page'd too
22:39:00 <heat> it's also a horrible thing with fields that are overloaded 4 times
22:42:00 <zid> Getting struct page'd is a terrible affliciton :'(
22:57:00 <heat> page tables are swappable wtf
22:58:00 <mrvn> if you have nested tables nothing stops you from swapping one of them
23:00:00 <kazinsal> just bank switch the switched bank
23:01:00 <heat> they mention lists of PFNs quite a lot
23:01:00 <heat> but I don't know if this is the actual format of the lists
23:01:00 <heat> they don't really mention large pages
23:02:00 <heat> except the "we try to use large pages transparently" part
23:07:00 <geist> yah i never completely grokked what they were talking about with 'prototype page table's or whatnot
23:07:00 <geist> PFN is probably just a way ot saying 'page address shifted over'
23:07:00 <geist> lots of systems do that
23:08:00 <geist> re: swappable page tables, that may have fallen out of earlier experience with VAX, which actually letsyou swap page tables
23:08:00 <heat> PFN is also what they call their struct page
23:08:00 <geist> remember most of the early devs for it were ex vax folks, so they applied a lot of the same design patterns
23:08:00 <geist> IRQL and whatnot is 100% a vax hardware feature they brought forward and emulated in software because they were used to the model
23:09:00 <heat> allegedly, they get PFNs on like 6 or 7 lists (dirty, clean, zero'd, unused, etc) and then allocate from those
23:10:00 <geist> yep. and theres a priority scheme there
23:10:00 <geist> when allocating a page it walks down the lists and finds the first one in the right list
23:10:00 <heat> but what constitutes a "list" is unclear to me. they mention a simple linked list
23:10:00 <geist> and then there's machinery that tries to keep the lists balanced and whatnot
23:10:00 <geist> i think it's basically an array of lists, in an allocation priorut order. i like that model, eve if it's only conceptual
23:11:00 <heat> but they surely mustn't be using a simple linked list if they want to allocate large pages
23:11:00 <heat> this section seems... vague
23:11:00 <heat> maybe microsoft will drop the nt kernel sauce next month
23:11:00 <geist> yah dunno how large pages would work in that model
23:11:00 <geist> but you said before it was for io pages
23:12:00 <geist> dunno the extent of large page support for non io pages
23:12:00 <geist> or maybe it's transparent in certain ways, like specific contiguous page allocation paths and then in that scenario it'll do a larg map if it works out that way
23:12:00 <geist> zircon has that
23:12:00 <heat> virtualalloc has a MEM_LARGE_PAGES
23:14:00 <heat> they say they could possibly breakdown a 1040MB allocation into 1 huge page and 4 large pages
23:16:00 <heat> geist: re prototype PTEs, it seems they're like vm objects
23:17:00 <geist> possible. maybe they track the page assignments of pages to vmos using some sort of N level thing that they call prorotype PTEs
23:17:00 <heat> like a shadow page table, real PTEs point to it (with the P bit 0'd)
23:17:00 <geist> yah
23:18:00 <geist> how that would precisely work i dunno, but i guess the gist is to use some sort of similar thing
23:18:00 <heat> linux also uses a page table-ish structure for their vmos
23:19:00 <heat> the so called radix tree, now renamed xarray
23:19:00 <geist> right
23:20:00 <geist> zircon uses a wavl tree of runs of 16 pages
23:20:00 <geist> basically arbitrarily picked to be a reasonable compromise
23:20:00 <heat> oh yeah for sure, the prototype page tables are like a little page table for the "vmo"
23:20:00 <heat> why did you go for a wavl tree?
23:21:00 <geist> yahwhether or not you can actually use it as a real page table i dunno
23:21:00 <geist> because we already have a wavl tree implementatio
23:21:00 <geist> ro you mean why did i use a tree in the vmo or why a wavl tree vs some other tree?
23:21:00 <heat> why a tree vs the radix-tree/prototype page table thing
23:21:00 <geist> ah simply the former
23:21:00 <heat> s/tree/binary tree/
23:22:00 <geist> we already had it and it was expediant and it has pretty reasonble performance and size characteristics
23:22:00 <geist> especially given that most vmos are pretty small, and thus really only end up with a single run of pages
23:22:00 <heat> the most ingenius use of the page table stuff I've seen in linux is that you can kinda figure out what's dirty right from the top level of the tree
23:23:00 <heat> just like a page table
23:23:00 <geist> oh yeah?
23:23:00 <geist> ah
23:23:00 <wxwisiasdf> finally i got rid of the debug diag 8 cmd and now i use a proper ic console :D
23:24:00 <heat> when dirtying, they queue the dirty inodes; on writeback, they look at the radix tree and go down the branches that are dirty (literally a D bit)
23:24:00 <heat> then you can easily writeback large runs of pages at once
23:24:00 <geist> ah makes sense
23:24:00 <geist> yah that's a thing we wouldn't be able to do in the wavl tree because the order of the tree is not constant
23:27:00 <heat> yup
23:28:00 <heat> i was thinking about going down the radix tree route and making a dynamically growable tree (I don't know if that's how linux does it, but probably)
23:28:00 <heat> essentially add levels to it once they're required
23:28:00 <heat> small files would be trivial to look up, huge files would still be very fast
23:30:00 <heat> would probably keep each table PAGE_SIZE size'd for compactness' sake although getting it larger wouldn't be too bad either
23:31:00 <heat> i think it's teoretically way better than my binary tree in every characteristic except memory usage
23:31:00 <wxwisiasdf> radix trees seems interesting
23:32:00 <geist> yah i guess picking the right radix is interesting too
23:32:00 <geist> since that affects how much internal fragmentation you get
23:35:00 * heat nods
23:36:00 <heat> also if you use whole pages you can skip malloc and or its internal fragmentation
23:37:00 <geist> yah but thats probably pretty bad for internal fragmentation in the sense that you probably have a large set of unused page pointers
23:37:00 <geist> may be good to generally pick a radix or a set of radices that are relative to the size of the object or whatnot
23:37:00 <geist> no idea what linux does
23:38:00 <geist> but yeah the obvious one is one page radix
23:39:00 <heat> theoretically you could change the radix and restructure the tree when the levels get too deep
23:39:00 <heat> not that they would get too deep
23:40:00 <heat> at most you get 6 levels for a huge 64-bit vmo
23:40:00 <heat> ... at that point your main concern probably isn't the radix tree :D
23:41:00 <geist> also this is a fun thing where using larger page sizes affects your radix and the number of pages, etc
23:44:00 <heat> i wonder if changing page sizes does have a measurable effect on system performance
23:44:00 <heat> IO even
23:45:00 <heat> it's very common for kernels to just size things based on pages
23:46:00 <geist> right. it's an interesting question