Search logs: #osdev2 - 13 May 2022

channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched

#osdev2 = #osdev @ Libera from 23may2021 to present

#osdev @ OPN/FreeNode from 3apr2001 to 23may2021

all other channels are on OPN/FreeNode from 2004 to present

http://bespin.org/~qz/search/?view=1&c=osdev2&y=22&m=5&d=13

Friday, 13 May 2022

00:44:00 <geist> kazinsal: figured out the weird segment thing from yesterday
00:44:00 <kazinsal> oh nice
00:45:00 <geist> see linux CONFIG_VMD
00:45:00 <geist> it's not really nice. it's a total intel hack
00:45:00 <geist> basically it's a fake virtual segment used by intel 'volume management device'
00:45:00 <geist> that basically puts devices on some other virtual segment so it can get access to more busses and resources or some nonsense like that
00:46:00 <kazinsal> oh nasty
00:46:00 <geist> that's why i starts at segment 0x1000. that's out of the range of ACPI or something
00:46:00 <geist> it's functioally a root bridge to another segment
00:46:00 <kazinsal> I think the Intel VMD thing is for NVMe hotplug and stuff
00:46:00 <geist> yah makes sense. gives them a new set of 256 busses to play with
00:46:00 <heat> i just found out why my connection was sometimes resetting for certain connections: my router has a bug where it resets the ivp6 flow label
00:47:00 <heat> s/ivp6/ipv6/
00:47:00 <heat> turned off flow labels in linux and windows and things work
00:47:00 <heat> this router is total crap i'm telling ya
00:48:00 <heat> doesn't even speak ARP properly
00:48:00 <geist> oh ugh
00:48:00 <geist> how is it doing v6? dhcpv6 or just stateless?
00:49:00 <heat> i think stateless
00:49:00 <heat> yup
05:55:00 <sikkiladho> While going through AArch64 mmu guides, I came across this: https://developer.arm.com/documentation/101811/0102/Translation-granule, which defines size per entry based on the granule size. i.e for a 4KB granule, size per entry for Level 0 page table would be 512GB. How is this calculated?
05:55:00 <bslsk05> developer.arm.com: Documentation – Arm Developer
05:56:00 <clever> sikkiladho: an entry in the L0 table, points to a single page (4096 bytes i assume) of L1 entries
05:56:00 <clever> and each entry in L1 points to a single page worth of L2 entries
05:57:00 <clever> until you hit the deepest point (i forget)
05:57:00 <sikkiladho> you can have 4-level tables, l0,l1,l2,l3
05:58:00 <clever> so the size would then be ((pagesize / entrysize) ^ depth) * granule? i think
05:58:00 <clever> let me check my notes
05:59:00 <clever> https://github.com/librerpi/rpi-open-firmware/blob/master/docs/arm-mmu.txt#L1-L12
05:59:00 <bslsk05> github.com: rpi-open-firmware/arm-mmu.txt at master · librerpi/rpi-open-firmware · GitHub
06:00:00 <clever> so for arm32, the first level of the table is just an uint32_t[4096] (16kb), and each slot represents a 1mb chunk of the virtual space, covering the full 4gig of virtual space
06:01:00 <clever> and strangely, an L2 is only 1kb long, a uint32_t[256], with each slot representing a 4kb page
06:01:00 <clever> and did i name L1 and L2 right in these notes??
06:01:00 <clever> comparing that to the link you pasted...
06:04:00 <clever> scrolling down a bit, they have an example of how a 48bit address is cut up into 5 parts, and each part is an index into a table
06:04:00 <clever> the first part is bits 47:39, a 9bit int, so 512 slots in the L0, at 4k granules
06:05:00 <clever> and with 64bit slots, uint64_t[512], thats 4096 bytes for the entire L0 table
06:06:00 <clever> oh, wait
06:06:00 <clever> > Math.pow(2,39)/1024/1024/1024
06:06:00 <clever> 512
06:06:00 <geist> the way i think about it is you take the log2 of each size
06:06:00 <clever> sikkiladho: the difference between slot0 and slot1 in the L0 table, is just +1 in bit 39 of the addr
06:06:00 <geist> ie, 12 bits for 4K pages
06:06:00 <geist> so the page tables then cover 12 + 9 + 9 + 9 + 9 bits of address space
06:07:00 <geist> reason for 9 is each page table entry uses 8 bytes, so that shifts 3 off the log2
06:07:00 <geist> okay this isn't clear, but once you grok it the math is simple
06:07:00 <geist> [9][9][9][9][12] kinda
06:07:00 <geist> so for 16k pages it's [11][11][11][11][14] and so on
06:08:00 <geist> but that's basically where they get those 'bits used to index' from
06:09:00 <geist> each level adds 9 more bits (for 4k base page granule)
06:09:00 <clever> is my notes right, about arm32 only having L1 and L2?
06:09:00 <clever> it feels weird now, that it doesnt start at L0
06:09:00 <geist> unless you enable PSE
06:10:00 <geist> i dont like they way they number things, but so it goes
06:10:00 <geist> i generally prefer to number the root L0 and count down, and if you only have two level syou only get to L1, etc
06:10:00 <geist> but they number them as if the terminal layer is always L3, i guess
06:10:00 <geist> at least in that doc
06:10:00 <clever> ah
06:10:00 <clever> and yeah, i see similar in the aarch64 doc sikkiladho linked, with 64k granules, the L0 doesnt exist
06:11:00 <geist> i think that's actually codified in the arch, because the ESR has a field that actually tells you waht level a permission check failed at
06:11:00 <geist> so they had to define some numbering scheme
06:12:00 <clever> is PSE like LPAE? supporting 64bit phys on 32bit virt?
06:13:00 <Mutabah> Kinda iirc
06:13:00 <Mutabah> I think it forces big pages and overloads bits 12-16 as extra address bits?
06:14:00 <clever> oh, that would also be why i cant find it in the v7 docs
06:14:00 <clever> because armv7 is pure 32bit
06:14:00 <geist> oh LPAE, that's right. sorry
06:14:00 <geist> got x86 mixed in there
06:14:00 <geist> LPAE in arm == PAE in x86
06:14:00 <clever> ahh
06:15:00 <Mutabah> Is PSE also an ARM thing? I was describing x86's "PSE"
06:15:00 <geist> armv7 has LPAE extensions, much like how x86-32 has PAE extensions
06:15:00 <geist> ie, you only get 32bit of address space but a larger physical space, by increaseing each entry to 8 bytes and thus you now need 3 levels of page tables to fit it in, etc
06:15:00 <clever> and LPAE is how raspi-os has gotten away with shipping a 32bit everything on devices with 8gig of ram
06:15:00 <geist> right
06:16:00 <geist> Mutabah: yah sorry, keep screwing up the names. PSE is iirc rarely used. was a temporary hack until PAE came along
06:16:00 <clever> the pi4 also has 2 modes for peripheral io
06:17:00 <clever> "high peripherals" mode puts the MMIO up at a 64bit only access, so you dont get a hole in your ram
06:17:00 <clever> but now an aarch32 kernel cant touch MMIO until it enables LPAE
06:17:00 <clever> so the default is "low peripherals" mode, which puts MMIO at the top of the 32bit addr space, creating a hole nearly dead-center in your 8gig of ram
06:18:00 <clever> but now a 32bit kernel can touch MMIO before the MMU is on, and isnt forced to use LPAE
06:18:00 * geist 's head hurts with even more stupid rpi4 shit
06:18:00 <geist> you're almost proud of how stupid that thing is aren't you?
06:18:00 <clever> its probably the stockholme, lol
06:18:00 <geist> like 'hey look at this dumpster fire i keep warming me hands to! if you put truck tires in it vs regular car tires the smoke is pretty!'
06:19:00 <clever> how would you have designed that?
06:19:00 <clever> put a hole in ram? put all ram after mmio? screw 32bit?
06:19:00 <geist> easy: put the peripherals at or around 0, start RAM at a higher address and cross right over 4GB
06:19:00 <geist> that's how basically all modern SOCs do it now
06:20:00 <clever> ah, yeah, thats simple enough
06:20:00 <clever> i think the problem is that the rpi reset vector is 0, and they didnt want to fix that
06:20:00 <clever> so ram must start at 0
06:20:00 <geist> limits your mmio space to say a GB or so, but if you have a really fancy thing you probalby have PCI or whatnot, and you can put a second aperture > 4GB if you want
06:20:00 <geist> easy: put a rom at 0, turn it off when you're done
06:20:00 <geist> say 64MB of rom then peripherals, and you can start RAM at say 0x4000.0000 (1GB) and you have all that space
06:20:00 <clever> pretty sure thats almost exactly what the amiga is doing
06:20:00 <geist> that's precisely what the virt machine does
06:21:00 <clever> the rom is aliased to 0 during reset, and once the bootrom is in control, it knows where the true copy lives in the addr space
06:22:00 <geist> also iirc 68k has a high starting vector, i think
06:22:00 <geist> so you could put your rom in the high part of the address. or maybe the other way? actually now that i think about it maybe nto
06:22:00 <clever> from what i heard, the 68k loads a pair of uint32_t from addr 0 and 4
06:22:00 <clever> and those become the initial SP and PC
06:22:00 <clever> kinda sounds like cortex-m?
06:23:00 <clever> the rom exists at some fixed higher addr, and is aliased to 0 temporarily, for that reset vector to find it
06:23:00 <geist> yes cortex-m has the exact same thing
06:23:00 <clever> but the initial PC is aware of that higher addr, and jumps directly to the non-aliased copy
06:23:00 <geist> and i think those both came from VAX. iirc
06:23:00 <geist> the double entry thing
06:24:00 <clever> but ive also heard that the amiga bootstrap rom doesnt use the initial SP at all
06:24:00 <clever> _start just loads sp like you would on any other platform, and that 32bit slot is instead used as some kind of version number
06:24:00 <geist> yah cortex-m explicitly doesn't push anything on the reset vector since there's nothign to save
06:24:00 <geist> and thus the SP doesn't really need to be valid. it's just a nice to have so your reset vector can be written in C
06:24:00 <clever> so your free to ignore the initial SP
06:25:00 <clever> yep
06:26:00 <clever> i'm also thinking, about how i could modify the rpi, to behave less like a dumpster fire, lol
06:26:00 <clever> on VC4 era, there is a dedicated mmu with 64 x 16mb pages, that translates "arm physical" to real ram
06:27:00 <clever> so i could put some ram in the 1st 16mb slot, then 32mb of mmio, then 976mb of ram
06:27:00 <geist> actually no VAX doesn't do the reset that way, so really the cortex-m in this case is basically copying 68k
06:27:00 <clever> (that model range only supports 1gig max)
06:27:00 <clever> and then treat that first 16mb of ram as a boot rom
06:28:00 <clever> so the main block of ram begins at +48mb
06:28:00 <clever> but with no ability to map things above the 1gig point in the physical space, the higher you move ram, the more you loose
06:29:00 <clever> i need to figure out how the pi4 handles 8gig, and how i could coerce it into being more normal
06:31:00 <geist> arm64 got no problem with that
06:31:00 <geist> 64bit and move on with things
06:31:00 <clever> more, about how high/low peripheral works, and can i move it even lower, all the way to 0?
06:32:00 <geist> get a better SOC?
06:32:00 <clever> never! lol
06:33:00 <geist> moving physical stuff around is not really in your list of things you can do
06:34:00 <clever> given that past models had a dedicated mmu and i could change the phys addr of mmio anywhere i want....
06:34:00 <clever> it may still exist on the bcm2711
06:37:00 <clever> i see signs that it does
06:38:00 <clever> but it only has room for 1gig, same as before
06:50:00 <sikkiladho> I think I finally got it. There are 4 tables in AArch64. L0,L1,L2 and L3. If we use 4KB granule, it means each entry in L3 can cover 4KB. There are 9(20:12) bits to L3 table index. Therefore, there are 2^9=512 slots. Whole L3 table can cover 512*4KB=2MB. L2 table can point to each L3 table. Therefore each entry in L2 can point 2MB and so on to L1 and L0. That's how it is calculated! Thank you.
06:51:00 <geist> yep!
06:51:00 <geist> and then the math follows from there f you use non 4k page granultes
06:51:00 <geist> everything is just shifted over
06:53:00 <clever> -rwxr-xr-x 1 root root 1.1K Feb 8 09:09 /boot/overlays/highperi.dtbo
06:53:00 <clever> oh, thats just cheating, lol
06:54:00 <clever> when you tell the firmware to move the peripherals, in addition to moving them, it just applies this DT overlay
06:54:00 <clever> and all its doing is patching the ranges= and dma-ranges= in a few spots
07:24:00 <moon-child> Q: what's the point of 5-level paging, or of arm's 52-bit address space extension?
07:25:00 <Mutabah> More address space?
07:26:00 <moon-child> yeah but why? What's it for?
07:26:00 <moon-child> I can't imagine people are mmapping files at that scale...
07:26:00 <Mutabah> Who doesn't wan't more AS? :)
07:27:00 <moon-child> no but seriously what's the demand?
07:27:00 <Mutabah> Probably future-proofing
07:27:00 <Mutabah> so people can do massive memory-maps if needed
07:28:00 <moon-child> yeah but again: why would you need that?
07:28:00 * kingoffrance .oO( https://johnhartstudios.com/bc/2017/05/28/sunday-may-28-2017/ )
07:28:00 <bslsk05> johnhartstudios.com: Sunday May 28, 2017 - B.C. Comic Strip
07:32:00 <geist> moon-child: arm does not do 5 level paging
07:32:00 <geist> x86 has a new 5 level paging extension, and it extends the aspace out to 57 bits
07:32:00 <geist> the point of that is fairly obvious: 57 > 48 bits
07:33:00 <moon-child> geist: yeah didn't mean to imply arm had 5lp. Hence the 'or' separating the two
07:33:00 <moon-child> I mean 128 > 64 but no one is making 128-bit cpus...
07:33:00 <geist> but yeah what use cases stuff has for that? I dunno, but i'm sure some folks do
07:49:00 <Griwes> 48 bits is only 256 TiB, and with machines actually having TiBs of RAM per box and stuff like RDMA, that is... getting crammed, at least for HPC use cases
07:52:00 <moon-child> oh yeah rdma
07:52:00 <moon-child> mesh stuff
07:52:00 <moon-child> makes sense
12:52:00 <mrvn> We have customers with TiB of ram at work. With 48 bit == 256 TiB half of that goes to user space, half to kernel. If you want a phys map for easy page table manipulations then that's half again. So 64TiB of ram before you ran into problems.
12:53:00 <mrvn> Suddenly it's no so big an address space.
14:29:00 <FireFly> funky
16:22:00 <mrvn> grrr, why is there no std::span<T>::at(size_t index) that does range checking?
17:01:00 <heat> because c++ is, above all things, consistent
17:02:00 <mrvn> heat: std::vector::[] has no check, std::vector::at() has check. How is that consistent with std::span?
17:03:00 <heat> i r o n y
17:03:00 <heat> ;)
17:03:00 <GeDaMo> "it's like bronzy or goldy but it's made of iron" :P
17:04:00 <mrvn> Is class A { int x; } class B : A { int y; } still an aggregate class?
17:05:00 <bauen1> mrvn: probably, you can check with https://en.cppreference.com/w/cpp/types/is_aggregate
17:05:00 <bslsk05> en.cppreference.com: std::is_aggregate - cppreference.com
17:07:00 <bauen1> mrvn: it is if you make all members and inheritance public
17:07:00 <mrvn> godbolt comfirms that
17:51:00 <mrvn> hehe, now isn't that a brilliant use of multithreading? std::thread t(func);
17:51:00 <mrvn> t.join();
18:41:00 <heat> it's just letting the main thread's cpu rest
18:41:00 <heat> good programming if you ask me
18:50:00 <mrvn> Assuming the thread isn't running on any core then t.join() could mark the thread to destruct itself and run the code in the main thread.
19:20:00 <Griwes> that'd be observable if you use tls
19:20:00 <mrvn> Griwes: you would have to set the tls reg to the thread till the thread exits.
19:20:00 <geist> true but why would you run it in the main thread? that would defeat the purpose
19:20:00 <Griwes> mrvn, what if the main thread has already shared a pointer to, say, an atomic in its tls with other threads that would want to access it concurrently with the call to join?
19:20:00 <mrvn> geist: save 2 context switches
19:20:00 <geist> well sure, but the point is to have a thread. if you want some sort of coroutine thing then build it that way
19:20:00 <mrvn> Griwes: then nothing. that keeps working
19:20:00 <Griwes> wdym by "then nothing"?
19:20:00 <Griwes> for the record, you cannot tell if that is the case by doing anything short of full program analysis
19:20:00 <mrvn> Griwes: why should the address become invalid?
19:20:00 <Griwes> ah, you mean swap the *tls* itself
19:20:00 <mrvn> push and pop it
19:20:00 <Griwes> I feel like that still has problems
19:20:00 <heat> meaningless optimizations
19:20:00 <heat> a totally non-trivial amount of work for a meaningless optimization
19:21:00 <mrvn> but those are always such fun
19:23:00 <mrvn> at least in the kernel the join() syscall can switch to the thread the code is waiting on for the remainder of the original threads timeslice.
19:24:00 <mrvn> (or the threads if that is bigger)
19:25:00 <heat> there's no join syscall
19:25:00 <heat> it's a futex
19:25:00 <heat> it just blocks
19:26:00 <heat> on thread exit the futex gets woken up (it's set on thread spawning time)
19:27:00 <heat> on linux that is, dunno about other OSes
19:28:00 <mrvn> same thing there. If a futex blocks wake up the thread holding the futex with the remainer of the timeslice.