Search logs: #osdev - 9 February 2019

channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched

#osdev2 = #osdev @ Libera from 23may2021 to present

#osdev @ OPN/FreeNode from 3apr2001 to 23may2021

all other channels are on OPN/FreeNode from 2004 to present

http://bespin.org/~qz/search/?view=1&c=osdev&y=19&m=2&d=9

Saturday, 9 February 2019

12:00:27 <doug16k> it's the one thing about x86 that isn't full auto-magical-ness
12:00:30 <geist> from looking at what freebsd and linux do, to unap a page they do an atomic swap with 0 and then harvest the M/A bits from what they go
12:00:51 <geist> got. so stands to reason that if at that instant in time another cpu modified the page before the TLB flush was done, the M update would be lost
12:01:18 <doug16k> it doesn't IPI at all?
12:01:39 <geist> sure, but there's a gap in time after you swapped 0 with it and the IPI went off
12:01:53 <geist> if another cpu in that instant had an active TLB for it it'd want to write back a M or A
12:02:03 <geist> well, M. A doesn't happen
12:02:07 <doug16k> how can it work asynchronously? it cant
12:02:21 <geist> because for an A writeback to happen the other cpu would have to read the entry, and youv'e just written zeros to it
12:02:26 <doug16k> it must have to synchronously wait for the other cpus to invalidate their tlbs for that
12:02:34 <doug16k> when clearing D
12:02:37 <doug16k> M
12:02:45 <geist> that would mean *all* unmaps too
12:02:52 <geist> since that could also be clearing M
12:03:08 <geist> clearly it doesnt so i think it ends up being a higher level problem: what happens if you 'miss' a M bit?
12:03:17 <geist> worst case you dont notice a page got modified
12:03:25 <doug16k> yes, but you can elide that by keeping your hands off that physical page and not care for the short term if a write leaks to an unmapped page in the next few ms
12:03:26 <geist> how fatal is that?
12:03:46 <geist> ah but what if it's shared?
12:03:56 <doug16k> as long as you don't actually free it until after you are sure everyone's tlb is up to date enough
12:04:00 <nyc> Shared pagetables is the brutal part.
12:04:01 <geist> shared pages are the main problem: you have a mmapped() file, it's mapped in multiple address spaces
12:04:11 <geist> you're not losing the page, you're simply removing an unmapping of it with RW set
12:04:37 <geist> the specific situation i'm talking about is more subtle
12:04:55 <geist> yes what you're talking about, you cannot free a page to the PMM until you are absolutely positive there are no TLB entries pointing at it on all cpus
12:04:59 <geist> that's a given, in *all* situations
12:05:23 <nyc> Virtual mapping is considered a reference.
12:05:41 <geist> the more subtle case is if it's a shared RW page. in that situation you're trying to harvest M bits from all the entries that may be present
12:05:54 <geist> but there's a race when unmapping
12:06:09 <geist> that's what i'm pointing out
12:06:42 <doug16k> ah. now I follow. it's a case where you really care even at teardown time
12:07:07 <doug16k> can't just table flip the address space
12:07:34 <geist> exactly!
12:08:14 <nyc> The question is whether the pagetables have to be modified to reap reference counts and accessed and modified bits.
12:08:40 <geist> this is where arches liek ARM without a M bit actually starts to look better
12:08:57 <geist> because now you're forced to do the R -> RW mapping soft fault in order to harvest M
12:09:37 <geist> which itself is a bit tricky
12:09:44 <geist> well, RW -> R is kind of tricky
12:14:47 <nyc> Page cleaning is painful.
12:17:41 <nyc> So the serial port is somewhere else and I think it's using something oddly in the middle of kseg1 in https://wiki.osdev.org/CFE_Bare_Bones
12:27:00 <nyc> I'm attempting to use the legacy 32-bit ranges so wired kernel mappings will fly.
12:27:11 <nyc> Or otherwise so I don't have to fill the TLB.
12:27:21 <mrvn> geist: do you allow one cpu to write to the page table of another?
12:30:01 <geist> of course
12:30:40 <mrvn> geist: The race at unmapping is kind of not my problem. If one thread unmaps an address while another writes to it the result is undefined.
12:31:13 <geist> that's most likely the answer. the question is can it be fatal to a 3rd process that didn't want to be part of this
12:31:14 <mrvn> Can work, can segfault or can hit the M bit race and loose the written data
12:31:20 <geist> and i thinki t can in a fairly degenerate mmap() case
12:31:40 <mrvn> geist: if it writes to the address then it's part of it.
12:31:44 <geist> ie, process A maps a file RW. process B then maps it too ans spawns 2 threads
12:32:02 <geist> then one of the treads in B simultaneously writes to the file while another thread unmaps it from B and the M bit is lost
12:32:07 <geist> fine. undefined as far as B is concerned
12:32:19 <geist> but now A sees a page that got modified, but the kernel missed the notification
12:32:30 <geist> and then sometime later decides to reclaim that page because it didn't think it was modified
12:32:41 <geist> then A suddenly sees the page go back to some previous state
12:32:41 <mrvn> geist: so A sees the modified data till the page gets eviced and re-read.
12:32:58 <geist> right. that seems incredibly unfair that B did something undefined and A got punished for it
12:33:13 <geist> it violates one or more contracts that the system provides, potentially
12:33:25 <geist> most likely Thats Simply The Answer, but still.
12:33:44 <mrvn> So when B unmaps something you have to shoot it down first. Then check the M bit and unmap it locally
12:34:10 <geist> but there's a race there where you've shot it down and then it gets faulted again
12:34:23 <geist> if you shoot down first theres one race
12:34:28 <geist> if you shoot down after there's another
12:34:40 <mrvn> geist: can't. all CPUs have to block in the shoot down till the original CPUs clears the entry.
12:35:04 <geist> lemme think about that...
12:35:32 <geist> yeah you could do it that way, but it disallows you from doing fire and forget unmappings
12:35:38 <geist> which is what you do most of the time to be efficient
12:35:54 <geist> ie, unmap a bunch of pages, then do a large tlb shootdown of the batch of pages, then synchronize
12:36:16 <mrvn> so unmmap(addr) -> IPI to all CPUs, all: { flushTLB; reply to IPI; wait_for_clear; }, orig: wait_for_reply; clear_entry; signal clear
12:36:40 <doug16k> yeah how else would you be sure that a subsequent write on another cpu will set D again
12:36:43 <geist> yah that could work (on x86), but it'd be terrible performance wise. essentially you're freezing all the other cpus
12:37:06 <mrvn> only way to avoid races is to synchronize.
12:37:19 <geist> rright, so seeing that existing systems dont do this
12:37:24 <geist> my question is how do they get out of it
12:37:45 <geist> my suspicion is it's simply UNDEFINED, and it's one of those gnarly edges that you rarely see
12:37:54 <geist> and really A isn't innocent here
12:38:03 <geist> it RW mapped a thing that it has no exclusive control over
12:38:04 <mrvn> maybe with lock prefix when writing to the table?
12:38:12 <nyc> Pagetable sharing happens.
12:38:17 <doug16k> mrvn, that won't change the other cpu's tlb
12:38:20 <geist> so that's where it essentially lost all ability to be protected of B's shenanigans
12:39:35 <mrvn> geist: actually what happens when this happens: B has the entry in the TLB for read, A clears the entry, B writes. The B core would update the entry that is no longer valid?
12:39:48 <mrvn> s/for read/from a read/
12:39:52 <geist> in x86 B faults
12:40:04 <mrvn> because TLB entry != page table entry?
12:40:05 <geist> because it'll reread the PTE and discover that it's not mapped
12:40:20 <mrvn> geist: why re-read?
12:40:36 <geist> well, perhaps it'll page fault
12:40:44 <geist> either way someone reads the entry and notices that it's no longer mapped
12:40:55 <mrvn> geist: the TLB entry for B would say: allowed to write, not modified.
12:41:08 <geist> you just said 'B has the entry in the TLB for read'
12:41:11 <mrvn> Does it always re-read when setting the M bit?
12:41:14 <geist> ie, it has a RO TLB entry
12:41:20 <mrvn> geist: from a read. wasn't clear
12:41:32 <geist> so you really mean it has a TLB entry with RW but no M bit?
12:41:36 <mrvn> yep
12:41:50 <geist> what you just described is *precisely* the race i've been talking about for the last few hours
12:42:03 <geist> it's unclear exactly what happens there. i think it loses the writeback of the M bit
12:42:18 <geist> AMD manual says it wont writeback the M bit
12:42:24 <geist> the intel manual doesn't say anything
12:42:42 <mrvn> geist: Ok. So it's not just when one CPU clears the entry WHILE another wants to write the M bit.
12:43:14 <geist> yah it's not the atomicity of the update per se, it's the window after the entry has been cleared by the other cpu hasn't gotten it's TLB shot down
12:44:20 <geist> i dont know where it's written but i think it's assumed that M and A bit updates happen atomically when they happen, which is why you use atomics to modify the entry, even if you have a spinlock aroun dit or whatnot
12:44:44 <geist> lemme see if i can find the section in the AMD manual
12:45:14 <mrvn> Can you change the page directory and the M bit gets written to the old page table because it's in the walk cache?
12:45:38 <geist> well, that's where the section in the AMD manual comes into play that i'm trying to find
12:46:03 <mrvn> I'm so glad I decided to not have threads.
12:46:57 <geist> section 5.4.2 in the AMD system programming manual has a little section on the A and D bits
12:48:10 <geist> https://pastebin.com/HhLkjAp6
12:48:50 <geist> the first sentence is the important one: "The processor never sets the Accessed bit or the Dirty bit for a not present page (P = 0)."
12:49:08 <geist> that implies that it does a cmpxchg when writing back to the PTE, and tests the present bit
12:49:24 <geist> it doesn't really imply that it validates that it's the same PTE that matches the TLB
12:50:01 <geist> as far as i can tell the intel manual has no such verbiage
12:50:28 <mrvn> If PTE[D] is cleared to 0, software can rely on the fact that the page has not been written.
12:50:43 <mrvn> cleared as in has not been set by the cpu, right?
12:51:06 <geist> that's how i'd read it
12:51:19 <geist> it says somewhere here that the cpu will never write 0 to the D bit
12:51:23 <geist> that's softwares job
12:52:02 <mrvn> doesn't mention what happens when I clear the bit without flushing the page.
12:52:31 <geist> right, it does somewhat mention or at least strongly impliy that it keeps a D bit in the TLB entry
12:52:45 <geist> so if it thinks its dirty it wont try to write back to the D bit in the PTE
12:53:13 <geist> so basically it tries to write back the D bit to the PTE whenever it first sets the D bit in its local TLB
12:53:25 <mrvn> That's what I assume. Otherwise it would have to check&write the bit every time.
12:53:29 <geist> including when it first loads the TLB entry from a PTE without the D set
12:54:11 <mrvn> Now one question is: Will it reread the entry when it is cached with M=0 and written to?
12:54:30 <mrvn> The line about P implies it does.
12:54:59 <geist> yah
12:55:21 <mrvn> So you can xchg the entry and then test if the D bit was set.
12:55:22 <geist> pesumably it does an atomic load; check P bit; or M bit; store
12:55:41 <geist> that's what you do to unmap, yes. that's what linux and freebsd does, at least
12:56:11 <mrvn> So swithing from RW to R you first unmap it, then shoot down all cores and then remap it RO on the next access.
12:56:37 <geist> yah i think you have to do that,yes
12:56:50 <geist> otherwise you can get a R PTE with the D bit set
12:56:54 <mrvn> Just don't remap it before all cores have flushed their TLB.
12:57:12 <geist> right. or arrange for it to not be possible for all the cores to have it, etc
12:57:37 <geist> ARM has some sort of sequence like this they recommend deep in the armv8 manual
12:57:40 <mrvn> geist: Hmpf. That would work too. Map it R, shoot down all cores, when they are done check the D bit again.
12:57:51 <geist> tere's a multi step process for transferring pages between certain states
12:58:07 <nyc`> I don't think a dirty readable pte is fatal. Just reap the dirty bit after read protecting it.
12:58:25 <geist> yes. this is where i'd like to know precisely if that's possible on all intel and amd cores
12:58:30 <geist> but it's not documented clearly
12:58:51 <mrvn> I would have thought the D bit can't be set on a RO entry. But don't see why not.
12:59:08 <mrvn> If the CPU only checks the present bit then D would get set.
12:59:22 <geist> right, that's how i read it from AMDs manual
12:59:38 <geist> it also means if you change the paddr of the page you really do need to go through a multi step
01:00:09 <mrvn> otherwise the new page is dirty while someone wrote to the old page.
01:00:11 <geist> which is fairly possible if you're doing a COW page, though in that case you wouldn't be faulting on a RW pte
01:00:23 <geist> you'd be going from RO paddr1 -> RO/RW paddrr2
01:00:33 <geist> so D writebacsk on that edge aren't possible
01:01:38 <mrvn> I would love to have a model of the CPU in one of those tools that evaluate race conditions between threads.
01:02:08 <mrvn> Give it your unmap() function and let the tool tell you how it will horribly break every blue moon
01:02:40 <geist> so some ARM cores can implement the A bit, but then of course in true ARM style everything is highly weakly ordered
01:03:17 <mrvn> I gave up on the A bit when I read that it may or may not be present.
01:03:19 <geist> though actually the AMD manual there says that the A bit writeback is fairly weakly ordered too. kind of makes sense
01:03:26 <geist> it's usually not that important that it be written Right Then
01:03:56 <geist> the trick is even if it's not present, atleast on armv8 they give you a lot of help to try to help you implement the A bit in software
01:04:08 <mrvn> geist: I read the AMD stuff so that it will write it early. Sets it on speculative code.
01:04:15 <geist> ie, the bit is still allocated in the PTE and you get a special exception with a partial decode of where in the page table tree the A bit failure was
01:04:47 <geist> so hypothetically you can write a fast A bit setting exception routine
01:05:02 <mrvn> Well, in ARMv6 when you turn of the A bit you get an usable bit in the page table for whatever.
01:05:18 <doug16k> mrvn, intel says pretty explicitly that you must invalidate when transitioning D from 1 to 0, or else you'll miss D updates
01:05:18 <geist> armv6... armv6... is that like a vax?
01:05:23 <geist> wait, people still use armv6?
01:05:39 <mrvn> geist: only the verry late one. RPi
01:05:50 <geist> well, that's their dang fault!
01:06:03 <mrvn> I'm pretty sure it's the same for all 32bit arm
01:06:16 <geist> i'm actively trying to forget
01:06:26 <mrvn> When you clear the control bit for the A bit the cpu simply doesn't care about it.
01:06:39 <knebulae> Just my luck. Step away for an hour, miss one of the best discussions in weeks. FML.
01:07:23 <geist> yah in v8 (and maybe v7 LPAE) it's the opposite: if you set the A bit the cpu does nothing
01:07:50 <geist> but if it's unset then it either writes back like an x86 or it triggers an access fault exception with some decode assist to poit at the page table (sort of)
01:07:50 <mrvn> geist: oh. I have no idea if the control bit was an enable or disable bit.
01:08:19 <mrvn> I just remember that when it was configured to not use the A bit then you could use it.
01:08:27 * geist nods
01:09:04 <mrvn> I bet it's the same in v6, v7 and v8. Would be insane to invert the bit in 64bit mode.
01:09:11 <geist> v8 is different
01:09:20 <geist> they broke lots of compabilility there
01:10:19 <nyc> With shared pagetables a pte may be used by another address space to map things at exit, but one just wouldn't free the pagetables there.
01:10:20 <mrvn> Well, my code runs on an RPi3 so can't be that borken.
01:10:51 <geist> mrvn: no, v8 specs that 64bit and 32bit EL1 are *different*
01:11:04 <geist> as in when you're in 64bit EL1 the page tables look like this, and exceptions look like this, etc
01:11:49 <mrvn> geist: but in 32bit mode the A controll bits works as previous. Would really surprise me if they inverted the controll bit for v8 there. Makes no sense.
01:12:16 <geist> wel, i'm looking in the manual now, despite not really giving a crap what armv32 does any more
01:12:59 <geist> yep, just confirmed it. the access flag worrks that way in armv8-64
01:13:21 <geist> its always there. you cannot disable it
01:14:01 <geist> you have to enable the optional hardware access bit update
01:14:16 <geist> but the behavior of the bit is always defined, and it's non optional
01:15:39 <geist> ah yes i see the verbiage here: if you're using the 'short descriptor format' (ie, the 32bit page table entries) then you have to set SCTLR.AFE to enable it
01:16:10 <geist> if you're using the 'long descriptor format' in 32bit mode then the page tables look more or less like armv8-64. and the access bit is always present
01:16:35 <geist> so it's much like how in x86 PAE mode extended the PTEs, except in the case of ARM they also really rearranged the bits and cleaned up some old cruft
01:18:45 <geist> oh huh. ARMv8-64 *does* support an optional D bit
01:20:20 <doug16k> geist, intel documents stop-the-world invalidation in section 4.10.5, then proceeds to be all hand-wavy about being "prepared to deal with the consequences of the affected linear-address range being used during that period"
01:20:31 <doug16k> ...if you don't stop the world
01:20:46 <geist> doug16k: yeah
01:22:40 <geist> god my head is exploding from reading the 10 page incredibly dense ARM docs on D bits
01:25:05 <geist> huh. the D bit works in a weird way on armv8
01:25:26 <geist> basically for perms there's a R W and E bit
01:25:52 <geist> the D bit in ARMv8, if implemented is called the DBM bit (dirrty bit modifier)
01:26:16 <geist> what it does is if you W fault on a page without the W bit set but you have the DBM set, it'll atomically update the page with the W bit set
01:26:37 <geist> essentially instead of marking it dirty, it in hardware marks it writable based on the DBM bit saying it's okay to do that
01:26:45 <doug16k> neat
01:27:08 <geist> DBM is sort of like saying 'could be writable
01:27:25 <doug16k> writable when I write it
01:27:36 <doug16k> that's awesome
01:27:56 <doug16k> hardware accelerates getting dirty info doesn't it?
01:28:03 <doug16k> M
01:28:15 <geist> maybe. trying to really grok it in terms of what it would let you do
01:28:52 <doug16k> I thought you made the page read only with that set. then later when you look at the PTE you see that it is writable so it is inferred to be dirty
01:29:02 <geist> right
01:29:23 <nyc> It sounds like ARM preemptively dealt with dirty bit update races in it's hardware pagetable walker.
01:29:25 <geist> and subsequent PTE loads by other cpus or the same cpu just see a writable page and dont need to do any additional logic
01:29:37 <geist> yah
01:29:53 <geist> coming along and designing things later has an advantage
01:30:01 <nyc> That is, from shared pagetables.
01:43:18 <doug16k> it's not better than x86 though, it's equivalent. when x86 tlb misses and loads a PTE with dirty bit set, it does nothing too
01:43:34 <doug16k> if it was a write
01:44:45 <doug16k> ah, I guess you mean it needs less gates in the implementation. for that yeah arm is probably simpler
01:48:23 <nyc`> Sharing, swapping, etc. pagetables is all in the mix.
01:50:15 <nyc`> Basically be able to run TPC-C with thousands of clients on Oracle.
01:57:15 <nyc> https://pdos.csail.mit.edu/papers/linux:osdi10.pdf
02:01:53 <nyc> I don't see that much material on scalability anymore.
02:08:59 <radens`> Hello, are there any recommendations for setting up the fs and gs base registers? GCC is generating code using the fs register, and I'm wondering if I should go with the flow and set up TLS like linux userland, or tell GCC to do something else.
02:11:03 <nyc> I see this: https://openbenchmarking.org/result/1811266-SK-LINUX420B84
02:17:38 <zhiayang> what's the "recommended" way to deal with the iopb in the tss?
02:18:05 <zhiayang> mapping all the ports takes up 8k of space
02:18:36 <geist> dont use it unless you have to
02:18:43 <geist> otherwise, suck it up and use 8K per tss
02:18:51 <zhiayang> and the bitmap potentially needs to be changed when switching between processes -- eg. this driver says i only need to access ports 00 - FF, this other driver says they only need 100 - 180
02:18:53 <geist> most systems dont need it
02:19:20 <geist> yep, what we do in zircon is keep per process a little runlist of required io ports
02:19:28 <zhiayang> so now each process either needs 8k extra per-proc storage, or have an optimised storage but will need to copy it out to the tss on context switch anyway
02:19:34 <geist> then 'patch' the io bitmap on context switch
02:19:37 <geist> yep
02:19:51 <geist> but the vast majority of processes dont need it, so the common case is to do nothing
02:19:59 <geist> it's only when switching between processes that need io port access
02:20:12 <zhiayang> hm
02:25:16 <zhiayang> wow this irc client is pretty trash
02:25:41 <zhiayang> typed a long message while in the lift and it deleted it cos it couldn’t send
02:26:29 <Mutabah> up+enter?
02:27:17 <zhiayang> anyway, for most user procs that don’t need io ports i can just set the iopb offset to 0 right
02:27:22 <zhiayang> since access denied is the default
02:27:32 <geist> right
02:27:42 <zhiayang> instead of writing 8k worth of 0xffs
02:27:46 <geist> rright
02:27:48 <zhiayang> i see
02:28:01 <zhiayang> ok thanks for the insight
02:28:10 <zhiayang> Mutabah: mobile client, not sure how to do “up”
02:28:38 <Mutabah> Ah
02:31:45 <froggey> not 0. it has to point at the end of the TSS, whatever you have the limit set to
02:37:07 <froggey> actually 0 might be ok, but there's something funny you have to watch out for?
02:46:22 <doug16k> don't let TSS cross a 4KB boundary IIRC, due to errata
02:47:04 <doug16k> the beginning part I mean
02:47:19 <doug16k> not IO bitmap
02:49:02 <doug16k> zhiayang, install hackers keyboard if it is android, it has full PC keyboard there
02:49:24 <radens`> geist: on zircon on arm64 you don't have IOPLs, so do you just have drivers live in el0 with memory mapped registers etc. mapped in to their address space?
02:54:50 <geist> correct
02:56:01 <geist> as a result, we bumped into a fun AMD errata the other day in KVM where we were doing a CPL=3 MMIO trap that vmexited but caused the cpu to use SMAP to fail inside KVM
02:56:21 <geist> you only see this when you map hardware regs to a CPL=3 process in the guest, which hardly anyone does
02:57:37 <doug16k> wasn't amd's fault though was it?
02:57:48 <geist> it was. found the actual errata
02:58:03 <doug16k> ah, it wasn't supposed to exit?
02:58:16 <geist> their fault was someone @amd.com added a patch to KVM for another reason that causes it to not deal with this errata. worse, it'd just go into an infinite fault loop
02:58:31 <doug16k> ah I see, amd broke kvm
02:58:51 <geist> so the failure precisely is that when a modern AMD machine faults on a EPT translation/protection fault
02:58:58 <geist> like you would if your guest as accessing MMIO regs
02:59:13 <geist> it'll fetch up to 15 bytes around the instrruction in the guest that causes it, for decode assist
03:00:06 <geist> neat. that's nice. but.... if the guest has SMAP enabled, then when the cpu fetches the bytes around the instrruction, mapped with user permissions in the guest, it actually fails do to a SMAP violation
03:00:22 <doug16k> this is the bug where they did that to handle encrypted guest memory?
03:00:26 <geist> essentially the cpu temporarily acts as if it were in ring0 when it does the fetch
03:00:30 <geist> yep, precisely
03:02:03 <geist> so you can imagine the complex microcode the cpu is going through when it's vmexiting
03:02:26 <geist> so it's probably an orrder of operations. it needs to disable SMAP and/or fetch the opcode pretending that it doesn't exist
03:03:20 <geist> https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
03:03:23 <doug16k> ah, so the hack was restart the instruction and hope for the best?
03:03:24 <geist> i think it's a varriant of 1053
03:03:27 <geist> yep
03:03:51 <geist> 1096 is the actual one
03:03:56 <geist> though you can see how it's related to 1053
03:04:50 <geist> ah 1053 is the actual opposite: the implicit access of GDT, etc does not honor SMAP
03:05:50 <doug16k> you are unable to use encrypted guest then?
03:06:06 <geist> no. we failed becuase a MMIO trapping instruction happened in CPL=3
03:06:32 <geist> thus KVM fails because it didn't get a decoded instruction when it vmexited
03:07:11 <geist> and the current logic in there, trying to handle the encrypted case, causes it to loop forever
03:07:18 <doug16k> ok but without encrypted guest, kvm can detect that scenario and manually read the code bytes, right?
03:07:23 <geist> correct
03:07:29 <geist> and the code used to do it
03:07:30 <doug16k> but can't with encrypted guest, even if it wanted to
03:07:45 <geist> rright, so sounds like encrypted guest + CPL=3 SMAP violations make that impossible
03:07:57 <geist> a larger fix is probably to disable SMAP guest support if it's encrypted
03:08:12 <doug16k> ya
03:08:23 <geist> we reported it to amd, they're working on a fix
03:09:10 <geist> but it onlyreally happens with CPL=3 mmio traps, which basically hardly anyone does except microkernels
03:09:32 <froggey> how do encrypted guests store their keys? surely the hypervisor can get to them somehow & decrypt the memory?
03:10:00 <geist> only via entering the guest. the hardware itself protects the hypervisor from directly accessing it
03:10:16 <geist> ie, part of doing a vm enter/exit is it loads the keys into the decryption unit
03:10:41 <geist> i think the actual key is randomly generated on guest start
03:10:52 <nyc> I feel like a complete idiot. At the moment I'm still trying to theoretically determine if the kernel text is in the right segment (legacy kseg0) and the MMIO is in the right segment (legacy kseg1) and am not even to the point where I can seriously consider whether I've gotten the actual precise addresses of the device IO memory right yet and it's been something like a week, though, granted, not where I could sit down and work for more th
03:10:52 <nyc> an a couple of hours at a time (usually I'm bogged down in either language study or communications with people I need to do for various reasons for most of the time I'm at a computer).
03:11:03 <geist> i dunno if there's some provision for saving/restoring the keys and/or moving them between machines
03:11:03 <froggey> ah, neat
03:11:48 <geist> there's also some provision in the 2nd level page tables to have unencrypted pages, but i'm guessing htat even since the hypervisor controls that it can't switch a previously encrypted page to unencrypted, without essentially turning it into garbage
03:12:31 <geist> nyc: sit down and debug it. i get the sense that you keep trying to reason through it
03:12:47 <geist> when you're doing initial bringup on something you haven't workedwith before, realize that you have tons of blind spots
03:12:50 <geist> things you dont know you dont know
03:12:58 <geist> so thats why you have to stop and verify everything
03:13:24 <geist> is the code loaded where you think it is. are the right constants getting into the registers, etc
03:13:38 <geist> maybe you're forgetting there's a branch delay slot and it's messing you up, etc
03:14:45 <geist> doug16k: it's fun, a lot of the erratum here in the ryzen errata doc just tell you to set some chicken bits in random MSRs
03:15:12 <doug16k> ya glad you shared that link, I haven't seen amd's errata yet
03:15:44 <geist> 1034 is pretty fun too
03:15:46 <mischief> wheres the ryzen errata?
03:15:52 <geist> https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
03:15:53 <mischief> *types on a ryzen laptop*
03:15:54 <froggey> I'm curious what those bits are actually disabling, but we'll never know...
03:16:33 <geist> i usually search for 'amd tech docs' which take you to https://www.amd.com/en/support/tech-docs
03:16:46 <geist> which doesn't get you a lot, but it does get you a date sorted list of stuff they released
03:17:14 <geist> froggey: incidentally i think the 'secure encrypted virtualization technical preview' has infos about the key bits in SVM
03:17:37 <geist> the crypto stuff is pretty hard core
03:17:38 * froggey looks
03:17:48 <doug16k> I love chicken bits. turn em all on, I want total stability
03:18:06 <geist> yah disable cache too. almost all of your problems start there
03:18:07 <doug16k> computers are fast enough. seriously
03:18:42 <mischief> at work i now get to learn about all the qualcomm network offload stuff
03:18:46 <mischief> kinda neat
03:19:26 <geist> ah looks like there is a way to migrate a guest to a remote node with a different key
03:19:42 <mischief> this magical thing called the NSS https://people.netfilter.org/pablo/netdev0.1/slides/IPQ806x-Hardware-acceleration_v2.pdf
03:19:50 <mischief> this is the only public doc i could easily find
03:21:18 <nyc> It seems stuck on the first instruction in gdb.
03:21:34 <geist> dont use gdb, step in it with the built in debugger, see where the cpu is
03:21:39 <geist> it's likely it's not even starting
03:21:47 <geist> PC is in a weird spot, code isn't loaded right, etc
03:22:23 <geist> usually the very very first thing i do is put an infinite loop at instruction 1, then load it, verify it got loaded, verify the emulator is stuck on that instrruction
03:22:37 <geist> then start moving the infinite loop futher down so you can see wher eit gets stuck
03:22:44 <geist> also fun thing you can do with qemu
03:22:49 <geist> -d int,cpu,exec
03:23:02 <geist> it dumps a total firehose of the log of pretty much everything the cpu is doing
03:25:24 <nyc> The log looks helpful.
03:30:31 <geist> but first thing is get used to the qemu console
03:30:35 <geist> 'info registers' is your friend
03:43:59 <ybyourmom> Jury Nullification
03:51:37 <nyc> load from an illegal address on the first instruction
03:54:46 <geist> there ya go
03:55:05 <nyc`> It doesn't seem to like what I'm using for kseg0.
03:59:15 <nyc`> The qemu console is a lot easier than gdb'ing qemu.
04:24:55 <froggey> geist: so that SEV doc was pretty interesting. the SEV looks like some kind of ME-like-ish coprocessor thing that actually understands certificates & chains of trust
04:25:06 <geist> yeah exactly
04:25:13 <froggey> I originally assumed it was some dumb extension that the hypervisor could throw any old key into
04:27:45 <nyc> Argh, the mips64 qemu monitor doesn't do info tlb or info mem.
04:27:59 <geist> yeah, i think only really x86 implements that
04:28:24 <geist> one of those is interesting if you want to know the memory map.... something like info qtree or something tree
04:28:34 <geist> it dumps the internal device config of the emulator
04:31:29 <nyc> info mtree?
04:32:46 <geist> maybe?
04:59:26 <zhiayang> froggey: hm. i think the last time i meddled with this i never used the iopb, so the limit was 108
04:59:43 <zhiayang> so what i should do for procs without io perms is to set the limit to 108
04:59:49 <zhiayang> so the cpu will just ignore the bitmap
04:59:57 <zhiayang> correct?
05:00:16 <zhiayang> doug16k: not android, unfortunately
05:00:38 <nyc> Linux loads at the same place I do.
05:01:06 <nyc> Well, plus 64MB.
05:01:13 <nyc> But it's in the compat KSEG0.
05:06:34 <nyc> I just tried the same exact address and got a different-looking but still crashing result.
05:14:19 <zxq2> hello universe
05:14:55 <geist> hey there
05:15:15 <geist> nyc: how are you getting these code into the emulator?
05:15:27 <nyc> -kernel
05:15:44 <geist> ah. are you trying to link it as a kernel such that it puts it where you want it?
05:15:47 <nyc> $ qemu-system-mips64 -M mips -nographic -S -kernel mips/hello -d int,cpu,exec |& tee -a qemu.log.006
05:15:50 <nyc> geist: Yes.
05:16:06 <geist> are you sure you have the right address, wher ememory is? lemme see where i linked it on LK
05:17:19 <geist> looks like it's starting at 0x80000000
05:18:15 <geist> https://pastebin.com/gxWWVPVu
05:18:18 <nyc> I'm linking it to compat kseg0 at 0xffffffff80000000 plus change.
05:19:03 <geist> oh this is 64bit mips?
05:19:14 <nyc> Yeah.
05:19:19 <geist> they implement negative address space? isn't that way out of physical?
05:19:39 <geist> oh oh they have the remap of physical up there?
05:20:31 <geist> may be better to start off with the raw physical address
05:20:38 <nyc> I think the pure 64-bit things are mapped via the TLB instead of hardwired.
05:20:41 <geist> could be that confuses the loader and it's trrying to follow the physical address
05:20:52 <geist> sure, but the -kernel logic may not 'see through' the way the cpu does
05:22:12 <geist> you you may need to arrange fot he paddr address in the program header to be 0x8000.0000
05:22:48 <geist> re the TLB thing that may depend on if the mmu is enabled
05:22:58 <geist> i suspect it starts off disabled and thus running 1:1 physical
05:23:31 <nyc> AT(0x4000000) may be needed.
05:24:32 <geist> indeed. anther thing to do is start qemu with i think -S or -s
05:24:42 <geist> i forget which one, but it starts the emulator halted before the first instruction
05:24:47 <nyc> -S
05:24:50 <geist> then you can inspect the cpu, see where the PC is set to
05:24:57 <geist> and poke around and see if your shit got loaded
05:25:03 <nyc> -s is for -gdb localhost:1234
05:44:57 <nyc> I mysteriously have an EPC in the data section now.
05:50:49 <nyc> 0xa reserved instruction exception.
05:51:13 <geist> interesting. is that a supported instruction?
05:51:36 <nyc> The EPC is in the data section somehow.
05:51:57 <geist> look o the elf file, what is the entry point set to?
05:52:47 <nyc> The start of the text section, a fair bit away.
05:55:02 <geist> i can't figure ut why it would put the PC at anything but the entry poit of the elf file
05:56:35 <nyc> The entrypoint of the ELF file looks good.
05:57:02 <geist> maybe it doesn't honor it. what addrress was the EPC set to?
05:57:09 <geist> numerically? could it be an entry vector of the cpu?
05:57:17 <geist> like, is it a really obvious boundary
05:57:29 <nyc> It was the start of the data section. Not an obvious boundary.
05:58:00 <geist> yah but what is it? like 0x1000? are you sure it's not an architecturally defined entry point?
05:58:21 <nyc> ! 3 .data 00000010 ffffffff84010190 0000000004010070 00010160 2**4
05:58:21 <nyc> CONTENTS, ALLOC, LOAD, DATA
05:58:36 <geist> weird.
05:58:41 <nyc> mips_cpu_do_interrupt enter: PC ffffffff84010070 EPC 0000000000000000 reserved i
05:58:41 <nyc> nstruction exception
05:58:41 <nyc> mips_cpu_do_interrupt: PC ffffffffbfc00380 EPC ffffffff84010070 cause 10
05:58:41 <nyc> S 00400006 C 00000028 A 0000000000000000 D 0000000000000000
05:58:58 <geist> is that the first instruction?
05:59:49 <geist> i thought you werre trying to start the emulator stopped, and verify the stuff was loaded properly and the PC was set to something sane?
05:59:55 <nyc> It doesn't look like it. It looks like the PC is skipping at even intervals.
06:00:11 <geist> yes, but what about precisely the first instrruction?
06:00:13 <nyc> I got to it being set to something sane with the LMA set.
06:00:46 <geist> that's still not the answer i'm looking for
06:00:53 <geist> is it or is it not starting at the right address?
06:01:00 <geist> and is there code there you expect to be when it does?
06:01:10 <geist> everything after that is irrelevant
06:01:46 <geist> once it goes off in the weeds, the way it goes crazy is generally unimportant, you have to trace back to where it goes off the rails and dont get sidetracked by the mangled wreck
06:02:08 <nyc> The starting address looks like all zeros i.e. a nop.
06:02:32 <geist> so now it's starting to look like something
06:03:06 <geist> my suspicion is it's running a nop sled and garrbage until it happens to trip over an undefined opcode, which happens to be the first word in your data segment
06:06:09 <nyc> (qemu) x 0xffffffff84010070
06:06:09 <nyc> ffffffff84010070: 0x48656c6c
06:09:54 <geist> yah a strring
06:10:03 <geist> that looks like ascii
06:10:44 <geist> can yuo pastebin a dissassembly?
06:12:43 <nyc> https://pastebin.com/FVTuvQ64
06:14:37 <nyc> I'm assuming the LSR isn't meaningful on an emulated serial port, but that might be a bad assumption. I'm not really getting far enough to worry about that.
06:15:25 <nyc> There aren't really any comments in the asm, so I'm not sure the asm source is any more informative.
06:17:15 <doug16k> you can disassemble in qemu monitor with the x command using the i modifier, x /10i 0x1345
06:17:55 <doug16k> x /10i 0xffffffff84010070
06:18:16 <nyc> https://pastebin.com/EvhD7CRg
06:19:17 <nyc> All nops in the .text section.
06:21:06 <doug16k> that linker script says AT(0x4000000)... is qemu setting up paging like you expect?
06:22:18 <nyc> The hope is the loader will load things there.
06:23:07 <doug16k> I wouldn't expect a non-identity-mapped entry point to work with qemu's kernel loader
06:23:09 <nyc> It's 64MB from the start of compat kseg0.
06:26:05 <nyc> It didn't land at compat kseg0 or the 64MB physical.
06:26:19 <doug16k> use the xp command to dump physical addresses
06:26:25 <doug16k> x dumps virtual addresses
06:26:42 <nyc> I got to xp, yes.
06:28:42 <nyc> It's nops at both xkuseg and ckseg0.
06:30:11 <nyc> I should try xkphys.
06:30:37 <nyc> It's all nops at xkphys, too.
06:45:21 <nyc> 0xB000000000000000 looks like the start of cached xkphys.
06:46:10 <geist> instead of 0xffff.ffff.000...
06:46:12 <geist> ?
06:46:38 <geist> that does seem sort of right. if the 64bit mips is going to have the whole physical map into ekrnel thing, seems like it'd start a lot lower than -2GB
06:46:40 <nyc> 0xffffffff80000000 is the start of compat kseg0.
06:46:47 <geist> oh, 'compat' i guess
06:46:57 <geist> as in if it were a 32bit machine it could just pretend the 0xfff wasn't there
06:47:05 <geist> possible qemu doesn't implement it right
06:47:58 <nyc> I'm not finding the section of xkphys at 0xb000000000000000.
06:48:29 <geist> also in the program header, in the LMA are you sure that's the proper physi8cal address?
06:48:32 <geist> 0x0400.0000?
06:48:40 <geist> that seems like an odd offset
06:48:56 <nyc> 64MB is where Linux loads by default.
06:49:27 <nyc> It's CONFIG_PHYSICAL_START (grep the Kconfig files for PHYSICAL_START).
06:50:05 <geist> okay
06:50:25 <nyc> .data is there in compat kseg0, but .text is MIA.
06:50:31 <geist> you're probably right. i told my (32bit) thing to load at 0x8000.0000 which is i'm sure physical address 0 using the kseg0 for 32bit
06:50:49 <geist> so you confirmed at start that it's simply not loaded into memory?
06:52:42 <geist> oh hang on, i think maybe this is it?
06:52:54 <geist> i noticed in the LOAD header:
06:53:07 <geist> LOAD off 0x00000000000000f0 vaddr 0xffffffff84000120 paddr 0x0000000004000000 align 2**4
06:53:10 <geist> filesz 0x0000000000010098 memsz 0x0000000000010098 flags rwx
06:53:21 <geist> seems like paddr should be offset into the page the same way vaddr is
06:53:25 <geist> ie, 0x0400.0120
06:53:47 <geist> if it followed this precisely seems like it would generally load everything shifted over 0x120 bytes to the left
06:53:52 <nyc> That does sound suspicious indeed.
06:54:04 <geist> i think your AT(.... needs a + .) or something in it
06:54:24 <geist> something to make sure the AT is within the same alignment into the page as the vaddr
06:55:13 <nyc> I just added SIZEOF_HEADERS to the first AT().
06:55:51 <geist> you did now that i mentioned it or yo did before?
06:56:04 <nyc> You mentioned it.
06:56:16 <geist> https://github.com/littlekernel/lk/blob/master/arch/microblaze/linker.ld#L18 is basically what i do
06:56:41 <nyc> Now it's in the loop at 0xFFFFFFFF840001BC.
06:57:01 <geist> ah. so it's working as you wanted to now?
06:57:32 <nyc> The control flow is following what I had in mind and it's actually reaching the executable code.
06:57:42 <geist> woot, yeah i see the logic
06:58:00 <geist> yah i think what was happening is it was just shifting it over 0x120, so that it ran through a nop sled and hit the front of the data segment
06:59:11 <geist> and that's exacty what it is. the EPC was 0....070, which is precisely -0x120 from the ...190 it should have been
07:01:27 <nyc> It's easier when memory doesn't come up wiped so illegal instructions are hit ASAP.
07:02:06 <doug16k> it's possible to give qemu a file full of random data as the initial ram content
07:02:12 <geist> yah there's a way to arrange that fairly easily but i forget the syntax
07:02:29 <geist> one way is you can tell it to use an existing file as the backing store for memory, and then just preinitialize it to garbage
07:03:07 <doug16k> -object memory-backend-file,id=ram-node0,mem-path="random-mem.img",size=4G,prealloc=no <-- if your file of garbage is a 4GB file named random-mem.img
07:03:29 <doug16k> then add: -numa node,nodeid=0,memdev=ram-node0
07:03:45 <geist> with a cow file system like btrfs you can even easily snapshot it
07:03:47 <nyc> I don't have enough spare RAM to fire up a 4GB VM.
07:04:07 <geist> sure but obviously you can change the size
07:04:29 <geist> whether or not that exact syntax that doug16k posted will work for !x86 remains to be seen though
07:04:57 <doug16k> yeah not sure either. I'd be slightly surprised if not
07:06:05 <geist> well the num node stuff is problematic i suspect
07:06:31 <nyc> Something is off.
07:07:22 <doug16k> numa isn't x86 specific
07:07:47 <geist> of course, but not every arch/target has a concept of it
07:08:05 <geist> there are lots of switches to qemu that are unimplemented in various arch/target combos
07:10:51 <nyc> I think the first couple of instructions are missed.
07:11:14 <geist> thats what i was hinting at with the AT() thing
07:11:27 <geist> i think your SIZEOF_HEADER isn't precise enough
07:11:39 <geist> it needs something that says 'right here', see the math i did in the example i linked?
07:12:23 <geist> trouble is there are more than one section in front of the .text segment there, so it's not getting accounted for in the SIZEOF_HEADERR
07:14:05 <nyc> I only put SIZEOF_HEADERS in the AT() for .MIPS.abiflags i.e. the very first one.
07:14:14 <doug16k> geist, yeah, I tried a few architectures, qemu doesn't like numa on several
07:16:00 <geist> nyc: well then double check those LOAD headers and make sure it makes sense
07:16:10 <geist> also suggest validating against a hexdump of the file
07:21:08 <nyc> Yeah, it's turned into a struggle.
07:30:03 <doug16k> can't you run it with -S to make it not start CPU automatically then look at the disassembly at the instruction pointer?
07:30:34 <nyc> Yes. It's off by 4 or 8 bytes or so.
07:31:02 <nyc> Trying to adjust things in the linker script makes it jump around at various thresholds.
07:31:12 <doug16k> are you generating a link map?
07:31:29 <doug16k> -Wl,-Map,whatever-name-here.map
07:32:03 <doug16k> it will tell you where it places everything virtually and physically
07:32:03 <nyc> I'm not using gcc as a driver, but I guess I can do that.
07:32:24 <doug16k> drop -Wl, then and use a space for that comma
07:32:49 <geist> if it's off by 4 or 8 bytes, it may be because the text has higher alignment requirements
07:33:08 <geist> like 2**4 or something, so based on how it decides to align the text segment it picks up some skew
07:34:09 <geist> if you rewind all of this. have you determined that you actually *have* to use the AT() thing?
07:34:21 <geist> does the loader not handle physical addresses in the kseg0-compat?
07:34:35 <geist> it's possible it'll just deal with it without needing that
07:34:39 <nyc> I have not really determined whether I do or not.
07:35:03 <geist> i used the kseg0 in the mips32 stff and it seems to work
07:35:16 <geist> i had just set it to load at virtual/physical 0x8000.0000
07:35:17 <nyc> https://pastebin.com/bfAUvcnV
07:37:43 <geist> anyway, the AT thing is annoyingm but what you absolutely have to do is arrange for the math inside it to compute the physical:virtual address be always correct even if there is a random set of things in front of that section
07:38:58 <geist> one way to do is is to make sure you do it on exactly the first section in the file, that way you can do a simple SIZEOF_HEADER
07:40:21 <doug16k> yes, it should be done at the beginning. I wouldn't even try to do it elsewhere
07:41:40 <nyc> I did it for the first section.
07:42:36 <nyc> .MIPS.abiflags : AT(0x4000000 + SIZEOF_HEADERS) { *(.MIPS.abiflags) }
07:43:20 <doug16k> and objdump -p shows vaddr and paddr you expect?
07:44:05 <nyc> 2 .text 00000030 ffffffff840001a0 0000000004000198 000001a0 2**4
07:44:05 <nyc> CONTENTS, ALLOC, LOAD, READONLY, CODE
07:44:24 <doug16k> 1A0 vs 198?
07:44:48 <nyc> That seems off.
07:45:10 <doug16k> generally you want those last 3 digits to be the same on v and p addr
07:46:58 <doug16k> which puts them such that 4KB (re)mapping can handle it
07:48:14 <doug16k> 4KB granularity I mean
07:49:53 <doug16k> say it remapped from address bit 16 up, then you'd need the low 4 digits to be the same, etc
07:50:44 <geist> you see why though
07:50:49 <geist> the alignment: 2**4
07:51:13 <geist> the address you compute (0x400... + sizeof_headerrs) does not take into account that the .MIPS.abiflags section is going to be aligned to the next 16 byte boundary
07:51:26 <geist> so youcompute in this case ...198 but it ends up being ...1a0
07:51:57 <nyc> I had to use ALIGN() statements for both virtual and physical addresses to fix it up.
07:52:09 <geist> right. it's a little fragile, but at least it's explicit
07:52:31 <geist> all this aside, you can probably toss this section in DISCARD anyway and/or arrange for .text to be first
07:52:48 <geist> trouble is this willall break if for some reason the compiler decides not to emit this ection, etc
07:53:02 <geist> but if you had one, say .text.boot, that you called out specially, put it at the start
07:53:15 <nyc> The first instruction is a load from the GOT, which faults.
07:53:16 <geist> then you can ensure that it's the first one and you *know* its always there because you write it
07:53:30 <nyc> Or otherwise $gp.
07:53:46 <geist> eh? did you write that?
07:53:51 <geist> plus you probably dont want a GOT
07:54:04 <geist> as in you should link it no pic, no external refs. probably wont emit a got
07:54:24 <geist> this is precisely the sort of fiddly crap tht makes mips annoying to deal with, btw
07:54:42 <geist> last week i was saying it was straightforward except for tooling and whatnot? it's precisely this
07:55:01 <geist> the ABI is annoying and subtle to setup, with reserved registers that are supposed to be set up early
07:55:33 <geist> that being said, i'm not sure i ended up using $g0
07:55:45 <doug16k> if you link with --no-dynamic-linker, it will encourage the linker to relax those references to not use the got
07:56:12 <geist> also i dunno why i added it, but i added -mno-gpopt to my mips linker flags
07:56:38 <geist> https://gcc.gnu.org/onlinedocs/gcc/MIPS-Options.html ah,it does precisely what you wayt
07:56:46 <geist> basically says 'dont use $gp for small data references'
07:57:20 <geist> also see -G
07:57:52 <geist> but, small data refs are annoying and look like they're dubious value, easier to not deal with them for a while
07:58:08 <zhiayang> is it a big limitation if i force processes to be pinned to a specific cpu from start to finish?
07:58:17 <geist> zhiayang: generally, yes
07:58:20 <zhiayang> bleh
07:58:31 <nyc> It's still generating a $gp-relative load.
07:58:42 <geist> nyc: did you write it in assembly? pastebin it please?
08:00:16 <nyc> The actual asm source: https://pastebin.com/YnVXnPJW
08:00:41 <zhiayang> next question then, how is the per-cpu data (eg. gdt, tss) mapped within a process's addrspace? i can think of two approaches to allocating the per-cpu data: every cpu has the data at the same virtual address, mapped to different physical pages, or every cpu gets a unique virtual address for its data
08:00:52 <geist> zhiayang: the latter
08:01:05 <geist> also gdt/tss is not mapped in the processes addrress space
08:01:08 <geist> it's mapped in the kernel
08:01:19 <geist> you do *not* want the process to be able to access those data structures
08:01:24 <geist> you only need one per cpu, not one per process
08:02:00 <geist> nyc: okay, so which instrruction is generating the $g0? the dla $t1, str?
08:02:09 <nyc> dla $t1, str yes.
08:02:17 <immibis> zhiayang: iirc they use physical addresses too
08:02:18 <geist> if so, might want to look into what dla is. that's a pseudo instruction
08:02:32 <geist> immibis: incorrect. gdt/idt/tss are all virtual
08:02:40 <immibis> okay
08:02:42 <nyc> It is a pseudo instruction, yes.
08:02:45 <geist> the only real cpu data structures that are physical are the page tables
08:02:57 <immibis> i'm probably thinking of page tables
08:03:08 <geist> nyc: and so... why does it decide to generate a $g0 reference. what section does str end up in?
08:03:15 <nyc> I can probably hammer it out in terms of 16-bit slices.
08:03:39 <nyc> str ends up in .data
08:03:55 <zhiayang> if the gdt/tss is not mapped the addrspace, then how do i syscall/interrupt without faulting
08:04:17 <geist> nyc: so what is the dissassembly? does it seem to imply that $g0 is pointing at the base of data or something?
08:04:34 <geist> zhiayang: it's in the kernel, and the kernel is mapped in every process
08:04:45 <zhiayang> ok right, pardon my terminology
08:04:48 <nyc> ffffffff84001000: df8d8020 ld t1,-32736(gp)
08:05:06 <zhiayang> when i say "mapped in a process's addrspace" i meant "mapped in the cpu's current address space somewhere"
08:05:19 <zhiayang> where current address space == cr3 of the current process
08:05:19 <geist> nyc: so based on that where does it seem to think the $g0 should be? most likely this is a mips32 vs mips64 abi thing, and there's something that says '$g0 shall be set to XXX'
08:05:29 <geist> and so probably your first instruction is to get $g0 pointing at something legit
08:05:44 <geist> zhiayang: then yes. it's mapped all the time, in the supervisor portion
08:05:50 <geist> most people dont call that the same address space
08:06:01 <geist> most people say there's a user adress space, and a kernel one. the kernel one is always mapped (usually)
08:06:10 <zhiayang> next question: if each cpu has its own base address for its cpu-local storage, should each process contain the mapping for every cpu, so it's easier to move between cpus?
08:06:15 <geist> the fact that on some architectures that means they use a single page table tree is an arrchitectural detail
08:06:27 <geist> yes
08:06:29 <zhiayang> or should it only contain the mapping for the current cpu, then map/unmap as necessary when moving
08:06:34 <geist> single kernel image
08:06:41 <zhiayang> ok, i see
08:07:00 <zhiayang> slightly unrelated question: is the idt usually done per-cpu? i don't really see a reason for it atm
08:07:16 <geist> the standard thing to do (there are exceptions but i dont wnt to lead you too astray) is to have the kernel identically mapped on all cpus in all processes
08:07:26 <geist> usually not per cpu
08:07:37 <geist> same with GDT. there's usually not a reason to copy it
08:07:41 <geist> TSS is per cpu
08:08:05 <zhiayang> ah ok i was thinking i needed a gdt per cpu, but i realise now i can just use one gdt with multiple tss entries
08:09:12 <zhiayang> . o O (if i have more than 4096 cpus, then there won't be enough space in the gdt for a tss per cpu)
08:09:15 <geist> nyc: can you post an objdump -x <your elf file>
08:09:18 <geist> zhiayang: right.
08:09:33 <geist> and that's a reason why you'd likely need more than one GDT
08:09:42 <zhiayang> but otherwise no, correct?
08:09:45 <geist> right
08:09:50 <zhiayang> alright
08:10:02 <zhiayang> i'm curious though what are the exceptions where the kernel is not identically mapped on all cpus in all procs?
08:10:04 <geist> and dont bother with ldt. nothing reallly needs that anymore
08:10:23 <geist> the exceptios are some more advanced systems (linux, etc) let you have temporary per-cpu mappings
08:10:44 <zhiayang> oh, interesting
08:10:53 <geist> usually some spot in the kernel that is reserved for short sequences of 'i need to map this thing right now and as long as i dont move off the cpu i can avoid TLB flushing the rest of them'
08:11:03 <geist> it's an optimization for short map/use/unmap, essentiall
08:11:17 <nyc> https://pastebin.com/2RtgeY3X
08:11:26 <geist> usual restriction is the code that's accessing per-cpu mappings cannot be preempted and must be 'pinned' to the current cpu
08:11:47 <zhiayang> ah, i see
08:11:53 <nyc> Linux has fixmapspace and kmap_atomic slots and such.
08:12:20 <geist> nyc: i'm sure the mips64 ABI will tell you how to load up $gp, but i'm guessing you need to arrange for it to be pointing to the VMA that is in the _gp address
08:12:42 <geist> so the question is is there a way to li the addres of _gp into $g0 without using $g0 itself
08:13:12 <geist> i was thinking it might have a special PHEADER or something that described how to set up the $g0
08:13:27 <nyc> HIDDEN (_gp = ALIGN (16) + 0x7ff0);
08:14:34 <geist> from grepping around the internet, it seems hat the GOT is more important on mips64 than mips32. presumably because it's more common to synthesize a large 64bit constant out of a relative load
08:14:38 <geist> vs a sequence of instructions
08:17:34 <nyc> I'll hand-write the sequence of instructions.
08:17:53 <geist> seems like it should be able to compute it pc relative or something
08:18:10 <geist> othterwise i'm sure there's a way to forrce a hard li with a fixed address
08:18:21 <geist> using the raw li instructions most likely
08:18:47 <geist> looks like dli may do it
08:18:48 <geist> vs dla
08:19:21 <zhiayang> (i wonder why amd made the rsps and the ists not 8-byte aligned in the 64-bit tss...)
08:23:05 <geist> nyc: yeah that makes sense. dla generally loads a large address by doing path against $gp
08:23:18 <geist> s/path/math
08:23:51 <geist> i guess because PC is not generally available it uses $gp as the anchor
08:31:56 <nyc> It won't let me use the address as a variable.
08:32:39 <nyc> li $t1, ((str >> 0) & 0xFFFF)
08:32:39 <nyc> daui $t1, ((str >> 16) & 0xFFFF)
08:32:39 <nyc> dahi $t1, ((str >> 32) & 0xFFFF)
08:32:39 <nyc> dati $t1, ((str >> 48) & 0xFFFF)
08:32:43 <nyc> :q
08:32:58 <nyc> /home/nyc/src/nmcp/mips/hello.S:15: Error: invalid operands `daui $t1,((str>>16)&0xFFFF)'
08:34:59 <nyc> I know it's the only thing in .data
08:54:24 <zhiayang> does swapgs function if %gs is loaded with a null selector?
08:56:32 <zhiayang> or do fs and gs have 64-bit descriptors in long mode
08:57:42 <nyc> I hope SPARC is less painful than MIPS has been. Then again, it's probably a lot of getting into qemu.
08:58:15 <nyc> I haven't really done a significant amount of x86-64.
08:58:46 <zhiayang> hm, they do not have 64-bit descriptors
08:59:15 <zhiayang> wait no, the value of %gs is irrelevant on entry to the kernel
08:59:22 <zhiayang> so it probably doesn't matter whether it's 32-bit or not
08:59:40 <zhiayang> unless somebody's writing strange asm in user mode??
09:02:20 <zhiayang> there's no swapfs, and from what i am reading the cpu will just always use the base in the fsbase msr
09:02:32 <zhiayang> so i don't need a per-cpu fs descriptor, but i do need a per-cpu gs descriptor
09:02:43 <zhiayang> true/false?
09:03:26 <nyc`> I think that would mirror userspace thread local storage.
09:04:35 <zhiayang> tls uses fs, but each cpu has its own set of msrs
09:04:54 <zhiayang> so i should be able to essentially have one fs base per cpu without needing extra gdt entries...?
09:05:08 <geist> zhiayang: gs and fs in 64bit mode work differently
09:05:37 <zhiayang> right, i’m aware
09:05:44 <zhiayang> i think
09:05:51 <geist> for the most part (*) the value in fs/gs do not matter, and there are a set of 3 MSR registers thatprovide the base offset of anything that uses a fs: gs: offset
09:06:01 <geist> so yeah you can and hsould just load 0 into fs and gs
09:06:42 <geist> but, the fs/gsbase is usually owned by the process. since they use it for TLS
09:07:14 <geist> typically the kernel uses gsbase for its own purposes (usually pointing at a per cpu data structure or the current thread)
09:07:23 <nyc`> Everybody knows ARM, so once I get to that, there'll be plenty of coverage.
09:07:32 <geist> but once you switch to user space a process typically has it's own fs and gs base that the kernel basically doesn't care about
09:07:48 <zhiayang> i mean
09:07:54 <zhiayang> how is it gonna load those bases
09:08:01 <geist> can you be more specific?
09:08:15 <geist> you just write to the MSRs
09:08:18 <zhiayang> like it’s not going to be able to write to the gdt or anything
09:08:22 <geist> (or you use the newer fsgsbase instructios)
09:08:31 <zhiayang> so aren’t the bases determined at program load time
09:08:51 <geist> up to you. but typically the strategy is that's a user space problem
09:09:02 <zhiayang> like %fs:0 is supposed to point to some thread structure according to sysv abi
09:09:06 <geist> ie, the kernel only arranges to keep whatever user space wants fs or gs base to point at
09:09:35 <geist> it's respoinsibility ends there. user space yes probably arranges for fs:0 to point at some thread structure, that's correct
09:09:52 <zhiayang> right
09:10:00 <zhiayang> so %fs can be 0
09:10:01 <geist> but we need to disambiguate problems. ring0 its more straightforward
09:10:07 <geist> you simply write to the MSR
09:10:21 <geist> usually fs: is unused, gs: is used by the kernel to point at a per cpu or per thread strructure
09:10:26 <zhiayang> but my question is, if %gs is 0, does the cpu have anywhere to “swap” to when doing swapgs?
09:10:29 <geist> hence why the swapgs instruction exists
09:10:44 <geist> again. the value of fs and gs is completely irrelevant
09:10:55 <geist> swapgs literally does a swap of two MSRs
09:11:01 <geist> GS_BASE and GS_BASE_KERNEL
09:11:04 <zhiayang> oh wait what
09:11:10 <zhiayang> oh dude
09:11:16 <geist> yes. it's not what you think. it's subtle and you have to really grok it
09:11:27 <zhiayang> i was misreading gs_base as the value in the gdt entry
09:11:41 <zhiayang> not a separate msr
09:11:43 <zhiayang> i see now
09:11:43 <geist> yah. it *is* that gs_base. in a roundabout way
09:11:46 <nyc`> User processes in Linux get some LDT access with certain system calls and can use segment selectors set up in their LDT's that way. I found out about that when Oracle forkbombed in some benchmark run and LDT's were a significant part of its per-process overhead.
09:12:19 <geist> here's how it works: in 32bit x86 processors since day when when you load a GDT entry into a segment register the cpu reads a bunch of fields out of the GDT/LDT and loads a bunch of hidden registers
09:12:32 <geist> in 64bit mode most of the hidden registers dont matter, but they still exist
09:12:39 <geist> the MSRs are direct windows into the hidden registers
09:12:50 <geist> it's like the GDT is the page table, and the MSR is pointing at the TLB
09:13:12 <zhiayang> ah, right
09:13:43 <zhiayang> thanks for clearing that up
09:13:55 <zhiayang> both the amd and intel manuals weren’t very clear on that
09:14:07 <geist> yah and there'ssome subtle detail that i'm actually oging to have to go re-read
09:14:23 <zhiayang> they both had two paragraphs justifying the design of syscall/sysret and how the swapgs instruction was specifically made for that
09:14:26 <geist> in one of intel or AMD there's an implicit clearing of the fs.base or gs.base register when you load a 0 into fs or gs
09:14:36 <zhiayang> hm
09:14:39 <geist> yah exactly. swapgs is a hack
09:14:52 <zhiayang> the amd one says that the hidden registers are not cleared when null is loaded to fs or gs
09:15:09 <geist> it's used precisely because you typically use it as the first or second instruction on exception/interrrupt entry so you can recover the state of the cpu
09:15:17 <geist> which gs: nominally points to when in the kernel
09:16:17 <geist> what's crappy about swapgs is since it's just a toggle, if you ever screw up and get an unbalanced number of calls to it you have a great sploit
09:16:27 <geist> (with the absence of SMAP which helps a lot)
09:16:35 <geist> you can arrange for the kernel to write to user space, etc
09:16:41 <olsner> I guess swapgs is the minimum necessary, but it would be very nice if you could explicitly "load kernel gs" or "load user gs"... my little kernel probably has 15 or more cases where it can end up with a user-controlled gs value
09:17:12 <geist> so for zircon we just explicitly zero out fs and gs on context switch
09:17:34 <geist> basically user space is free to dick with those regs, but it is not guaranteed to be saved across context switch
09:18:19 <geist> but yes, the swap is dangerous. what you ask for makes total sense
09:18:44 <geist> 'copy user MSR to active MSR' vs 'copy kernel MSR to active MSR' would have been a much nicer and less errror prone instruction
09:20:14 <immibis> "not guaranteed to be saved across context switch" makes anything pretty useless for userspace
09:20:54 <geist> right. theres a rreason you can't even guarantee it
09:21:14 <geist> in one case , i forget precisely, in 64bit the cpu will blat a 0 over one or more of the segment registers
09:21:27 <geist> ss i think on exception entry or exit, i forget
09:21:58 <nyc> That's stunningly train wreck -esque.
09:22:00 <geist> so even if you go out to try to letuser space treat its segment registers as a 'free' 16 bit register, you can't completely guarantee safety
09:22:24 <geist> so the simpler answer is 'user space cannot rely on any segment register actually sticking'
09:23:27 <geist> they should have just turned it all off in CPL=3, made it so that you can't load or read a seg register
09:24:10 <geist> CPL=0 though it's still needed because you still need to be able to load seg registers and have the mechanism behind the scenes (accessing GDT, loading hidden registers) to work, prior to context switching into and out of a 32bit user space
09:24:17 <zhiayang> lmao
09:24:33 <zhiayang> yea iirc ss gets set to 0 when entering a higher-priv interrupt stack frame
09:24:37 <zhiayang> to "support nested interrupts"
09:24:42 <geist> so thats why they can't just completely turn it all off and declare it unused. the mechanism needs to exist, it just doesn't enforce most of it when in 64bit
09:26:40 <geist> this is where the armv8 4 priviledge level model is so much cleaner. it logically has the same 4 rings but there's a pretty good symmetry between switching between EL (exception levels)
09:26:49 <geist> riscv has more orr less the same model
09:27:31 <nyc> I'm foggy on why they need more than 2.
09:28:03 <geist> in ARM it's EL0 = user space, EL1 = supervisor (kernel), EL2 = hypervisor, EL3 = secure monitor
09:28:26 <geist> ARM specifies that EL2 and 3 are both optional, though almost all implementations implement all 4
09:29:05 <geist> EL2 has it's own nested page table, so EL1 (an os instance like linux or windows) runs on top of an EL2 page table
09:29:12 <geist> EL3 always runs in pure physical addressing mode
09:29:32 <geist> so EL3 is kind of like SMM mode in x86. you put core security and rom like firmware in it
09:30:14 <geist> EL2 is where you'd run a type 1 hypervisor, or the little shim that something type 2 like KVM would install to context switch between guests
09:31:34 <nyc> My first thought is of what happens when they nest.
09:32:08 <immibis> i guess they still haven't figured out nested hypervisors then
09:32:22 <geist> yeah a good question there
09:32:38 <geist> you could of course run an EL2 under and El2 and simply trap the instructions
09:33:11 <geist> one fun nice thing about trapping instructions in ARM is since it's RISC there are only a few type that can trap (memory accesses, control register accesses, a few misc otherones)
09:33:19 <geist> and those can be essentially fully decoded by hardware for you
09:33:40 <geist> so i think you simply trap a nested EL2 under EL2
09:34:34 <geist> so as an architecture it scores very high on the virtualizability index
09:34:42 <geist> or whatever that's called. POWER is well known for this too
09:35:28 <nyc> POWER is crazy about it.
09:35:35 <geist> yah
09:38:19 <nyc> I did some kind of hashed pagetable thing for some ppc64 hypervisor. I think it ended up being used with an emulator.
09:39:48 <nyc> They called it Hype of all things.
09:44:06 <immibis> did it go into futuristic hud glasses that cover your face? Visor hype!
09:46:41 <nyc> I mostly saw that it used large chunks of memory to map and allocate to guests.
09:47:48 <immibis> isn't that generically what hypervisors do?
09:47:54 <immibis> seems sensible enough
09:49:04 <nyc> No, it's just that the nemory management it did used a very coarse granularity. 256MB? I don't remember. Nothing sounds right.
09:56:32 <nyc> My first thought is what happens when you have a fully populated physical address space and actually need to use it.
09:59:02 <nyc> I think 2**52 was IA64's design limit and I think it was on the high end.
09:59:56 <bcos_> 2**52 ought to be enough for anybody!
10:00:24 <nyc> You might want a few cores to speed up zeroing it all before userspace can be allowed to use it.
10:01:10 <bcos_> Some of that will be ROM and memory mapped devices though, so...
10:02:03 <nyc> Maybe 2**16 cores can take the edge off it all. Then it's 64GB/core.
10:03:17 <nyc> bcos_: Sure, no problem.
10:05:44 <nyc> 4PB probably won't work well with page-by-page iteration over 4KB pages.
10:10:46 <nyc> So my thought is to establish contiguity and exploit it to get O(fragments) or better, preferably O(lg(fragments)), and that bearing in mind parallelism for 2**16+ CPU's.
10:19:23 <nyc> There'll probably be cache hierarchy for threads of a core, cores of a socket, sockets of a board, boards of a case, cases of a rack, racks on a floor, etc.
10:23:02 <immibis> are there any kernels that treat inter-CPU communication the same as over a network? not relying on much (if any) shared memory, and mostly pinning tasks to CPUs
10:24:06 <nyc> Multikernels.
10:25:08 <nyc> Amoeba got nigh network transparency that way and so clustered very well.
10:28:05 <nyc> I'm not sure that's going to help with large-scale SMP.
10:30:59 <nyc> I think everything obvious has been tried, and there isn't an obvious way to summarize the issues.
10:32:40 <nyc> I think people were trying to deal with things like contention on per-directory and per- address space locks 10 years ago.
10:33:50 <nyc> There are still huge issues with kernel data structures proliferating.
10:34:06 <nyc> (This is in Linux.)
10:37:54 <nyc> I don't know how much or what's gone on for the past 10 years.
10:48:30 <nyc> I really hope I do better with SPARC.
11:00:02 <nyc> I can't believe I didn't RTFM for qemu better. Or that I didn't remember MIPS better.
11:22:05 <nyc> It's been a few years since I've done any kernel hacking. I hope it's just being a little rusty.
01:11:05 <nyc> When I get back from my appointment: 1. Set up $gp or load immediate. 2. Get serial IO going. 3. Set up C stack etc. 4. Move on to SPARC.
02:03:23 <program> hello
02:03:50 <program> is there anywhere an GDB protocol description?
02:05:27 <nyc> There is, but I am on a very limited cellphone at the moment.
02:05:58 <program> i can't find the specification anywhere
02:06:29 <program> the client specification
02:08:24 <Mutabah> The serial/tcp protocol used for remote debgging?
02:08:30 <program> yeah
02:08:36 <Mutabah> I'd expect that to be documented in the gdb documentation
02:08:55 <Mutabah> an implementaton of that is usually referred to as a "gdb stub"
02:09:01 <program> like when you are running an gdb server inside the qemu and you use the gdb client to connect to it ; then where is the specificiation for the client side?
02:09:24 <program> i thougtht that gdb stub is what i would need to have inside the kernel that is being debugged
02:09:42 <program> but what about the client side ; the application that connects to the gdb server
02:10:41 <Mutabah> they're talking the same protocol, documentation for one covers the other
02:11:20 <program> oh
02:13:31 <program> https://cdn.kernel.org/pub/linux/kernel/people/marcelo/linux-2.4/arch/mips/kernel/gdb-stub.c
02:13:34 <program> this would work?
02:20:10 <nyc> To the extent you can port that to your kernel, sure.
02:21:40 <Mutabah> program: https://sourceware.org/gdb/onlinedocs/gdb/Remote-Protocol.html
02:22:49 <nyc> Works great over serial or network.
02:24:58 <nyc> gdb stubs are probably worth doing first.
02:27:27 <nyc> I think the first thing I'll do after hello world is a gdb stub.
02:29:09 <klys> perhaps there is a frontend to gdb, as it seems the gdb prompt requires a lot of typing.
02:29:52 <nyc> They probably exist, but I know very little about them.
02:30:20 <program> it would be nice to have some gdb front end
02:30:23 <program> for windows
02:30:24 <nyc> I think there is an xgdb.
02:31:31 <klys> would you have to write a frontend into your interface if you were using the gdb stub?
02:32:02 <nyc> No, of course not.
02:32:34 <klys> also I've heard there is a gdb problem when transitioning between register widths (cpu modes).
02:32:55 <nyc> I can't imagine what a gdb GUI would do, but I'm sure I've seen one.
02:33:22 <nyc> x86-32/64?
02:33:29 <klys> yes
02:34:28 <klys> nyc, perhaps you saw an IDE debugging a program, such as kdevelop
02:34:36 <program> yeah
02:34:43 <program> but i'm writing in assembly without any IDE
02:34:44 <nyc> I'm unclear how MIPS or SPARC address it
02:35:06 <program> i would like something like gdb -tui
02:35:15 <program> too bad i don't have this on windows
02:35:22 <nyc> klys: I think it was in the 90's.
02:36:17 <klys> nyc, in the 90s there were some fun debuggers, turbo debugger stands out; was part of turbo assembler, probably available at winworldpc.com
02:37:44 <nyc> klys: I have no idea what you're on about. I saw xgdb in the 90's.
02:37:52 <klys> oh all right!
02:38:58 <klys> and it seems ddd is also a thing
02:42:47 <nyc> I don't know anything about the Windows ecosystem. I'm sure something exists, but I don't know anything about it.
02:44:19 <nyc> I'll get disconnected when my train leaves the station, but...
02:52:55 <nyc`> I sort of want to drop a relatively well-developed product on the world in a way. I'll probably hold back from GitHub and public releases for a good long while. Maybe I'll change my mind about what my goals are again.
03:02:57 <nyc`> I figure some sort of shim for POSIX and I have no idea about GUI stuff might be worthwhile. Running some database benchmarks and otherwise a big set of benchmarks would be a good goal.
03:06:41 <nyc`> Patches for whatever AIO API I cook up etc. to get used by some open source da
03:07:24 <nyc`> database would be good selling points.
03:11:07 <nyc`> Zerocopy IO is scary given all the TLB issues out there.
03:11:30 <nyc`> Zerocopy RX esp.
04:25:50 <mrvn> nyc`: look at ddd (dddd?)
04:28:14 <mrvn> Apropo zerocopy. Has anyone implemented lazy IO so that when the user reads a block and then writes it to a socket then the SATA driver would send that data directly to the NIC using DMA without touching main memory at all?
04:29:51 <bcos_> I'd expect unsolvable flow control issues with that..
04:33:51 <nyc`> I have no idea apart from it being trendy a while back. I'm network clueless as of yet.
04:40:02 <nyc`> Whole Foods is K: lined.
04:42:00 <mrvn> bcos_: how so? IO would be a (*buzz word alert*) promise and whatever needs it at the end starts the IO. A access to the memory faults and pages the promnise in for users and drivers can do more clever things.
05:14:13 <nyc> It doesn't seem any harder than any other direct device access of userspace to me.
05:14:50 <nyc> There's probably something I don't know about networking, though.
05:20:11 <nyc> I don't know how much packet unraveling NIC's can realistically do on their own as a prerequisite for avoiding outer protocol wrappings from lower network layers polluting things.
05:23:05 <nyc`> I think the NIC's would have to do a lot of packet assembly and disassembly on their own to do zerocopy RX.
05:24:50 <nyc``> SCTP wouldn't be able to do it with reliable out of order delivery.
05:26:29 <mrvn> I'm not sure what you mean. The driver would set up the frames with IP header and all and then pass an iovec array to the SATA driver with the data parts of each frame and it fills it using DMA.
05:26:46 <mrvn> And then the NIC driver tells the NIC to send.
05:27:18 <nyc``> It sort of has to be TCP or something like it in order to buffer enough to dodge long and short reads.
05:28:26 <mrvn> nyc``: use case would be iSCSI or AOE where you have nice block aligned requests.
05:30:22 <nyc``> mrvn: RX not TX.
05:31:40 <nyc``> mrvn: TX is sort of obvious even to someone as ignorant of networking as me.
05:31:53 <mrvn> reverse on RX. The NIC tells you when frames are there. You parse the header and pass around iovecs to the data parts.
05:32:41 <mrvn> Not sure I would want the same there though. The memory on the NIC is usualy limited and you want to clear it fast so it can receive more frames.
05:32:50 <nyc``> mrvn: For zerocopy RX I think the NIC has to do a lot more.
05:33:46 <nyc``> Otherwise you're copying just to assemble the byte stream out of the packets.
05:34:30 <mrvn> nyc``: Why? each frame ends up in one buffer and you simply pass around the buffers.
05:34:42 <nyc``> And to get the outer layers / lower protocols out of the way.
05:35:34 <mrvn> The outer layers are stripped by returning a pointer+length to inside the frame.
05:35:44 <nyc``> mrvn: IIRC RX usually keeps contiguity of headers etc. for protocols lower than TCP/IP.
05:36:56 <nyc``> mrvn: Well, yes, it would be nice if NIC's did that or the drivers could do that without copying.
05:37:23 <mrvn> nyc``: the TPC/IP stack usualy already does that. In linux it passes around skbuf structures.
05:38:54 <nyc``> mrvn: Sure, but is it done by copying or by DMA'ing the headers in a different place from the payload?
05:41:08 <nyc> I think there was a guy at Oracle obsessed with zerocopy RX.
05:41:32 <nyc> I don't know anything about networking.
05:42:00 <nyc> I think it might've been the Lustre guy.
05:42:24 <mrvn> nyc: The NIC has memory mapped in the PCI address space. I would read the header right there
05:43:27 <mrvn> otherwise it would be hard. Probably better to copy the whole frame to memory otherwise.
05:45:20 <nyc> mrvn: That sounds like I'm not understanding how NIC's do RX. I thought they had rings of max packet size chunks of memory they drew upon to do RX.
05:46:00 <mrvn> nyc: That's my understanding. And that memory should be directly accessible toi the copu
05:46:06 <nyc> In current practice that is.
05:46:35 <mrvn> so you parse the header and pass around pointers to inside that chunk of memory
05:46:55 <nyc> mrvn: Well, they're a sort of pool of preset DMA destinations for RX of unprocessed packets AIUI.
05:47:55 <mrvn> nyc: yeah, some cards you can only tell where to DMA data to. So on RX you would have to already know to send it to the SATA controller.
05:49:02 <nyc> I thought the memory was mostly recycled back into the NIC's ring of buffers right after packet disassembly copied headers and payload to separate locations and translated structures etc.
05:49:49 <mrvn> obviously, if you copyed them then you reuse the buffers
05:50:30 <nyc> Zerocopy RX would need the DMA to pre-disassemble it all.
05:51:43 <nyc> And for buffering on the NIC to DMA the expected amount of payload data.
05:53:53 <nyc> The NIC is an embedded *nix anyway, so most of what it needs is enough onboard RAM and a fast enough embedded system.
05:57:27 <nyc> It's probably worse than that because of things like IP forwarding.
06:00:27 <nyc> It may be better to focus on things that can work on existing hardware like anything anyone who actually knows about networking cares about.
06:14:44 <nyc> The real issues are probably surprise lock contention around broadly shared per-something data structures and standing on one's head to reduce context switches and such.
06:20:01 <nyc> I'll bet port number allocation trips over lock contention or something.
06:24:36 <nyc> I don't see meaningful lockless approaches with such a small port IO space and IIRC the required cycling semantics.
06:24:57 <hxclurk> Do you guys like SICP?
06:25:32 <nyc> hxclurk: Yes, though it's focused on only userspace.
06:26:05 <hxclurk> I thought the ideas it teaches applies equally well to the kernel space?
06:26:08 <mobile_c> how do i corectly intitialize a library
06:26:21 <nyc> hxclurk: TAOCP is nice because it breaks all the algorithms down to asm even if for an imaginary arch.
06:26:29 <hxclurk> in winblows you call dllmain
06:26:35 <mobile_c> during the relocation and linking process
06:26:55 <hxclurk> taocp is too hard to read
06:27:22 <hxclurk> nyc: have you read TAOCP?
06:27:30 <hxclurk> which parts?
06:29:11 <hxclurk> i'd have to quit my job if i wanted to actually read the taocp
06:29:19 <mobile_c> 0.0
06:29:52 <hxclurk> im pretty behind on mathematics
06:30:04 <hxclurk> do you guys know good amount of mathematics?
06:30:16 <nyc> hxclurk: Scheme is a safe language with automatic memory management. So it leaves out a lot of what's going on in the kernel.
06:30:40 <nyc> hxclurk: I was a math major.
06:30:41 <hxclurk> my high school conditioned me to feel physical pain when seeing mathematical notation
06:30:55 <hxclurk> im trying to break that
06:31:02 <hxclurk> more like planning to
06:31:19 <nyc> hxclurk: You'll need math.
06:31:31 <hxclurk> i know
06:32:14 <nyc> Real math is more like philosophy.
06:32:59 <hxclurk> yeah but it leaves me twitchy when i come into contact with it
06:33:23 <hxclurk> uneasy feeling when into philosophical math land and ego rearing in its ugly head as well
06:33:28 <hxclurk> feeling too smart about oneself
06:33:30 <hxclurk> etc
06:33:54 <hxclurk> happens to you too?
06:34:21 <nyc> Ego isn't compatible with math.
06:35:25 <nyc> There are unsolved mysteries everywhere defeating the best.
06:38:50 <nyc`> It's a veritable danse macabre.
06:39:54 <nyc`> The idea there (the dance of the Maccabees) was that death was the great equilizer of humanity.
06:40:21 <nyc`> Math has ages-old unanswered questions to do that.
06:42:01 <hxclurk> wew
06:42:26 <nyc`> Fibonacci's last theorem may have been taken on, but many more such things remain unanswered.
06:42:49 <nyc`> And in reality are likely never to be answered.
06:43:46 <hxclurk> fermat's lst theorem?
06:44:57 <hxclurk> have you read GED?
06:45:44 <nyc`> Sure, I might do integrals faster than you, but none of us can even approach Goldbach's conjecture. So we're all laid flat out by it and ultimately in the same position.
06:46:43 <hxclurk> i want to integrate maths into my computing
06:46:55 <hxclurk> to be able to reason about some problems more mathematically
06:47:20 <hxclurk> i dont really give a fuck mysteries hidden within the properties of numbers shit
06:48:18 <hxclurk> universe is gay anyway
06:48:54 <nyc`> I have read TAOCP. I recommend Concrete Mathematics and Generatingfunctionology to learn the math involved with algorithmic analysis. There might be some other good sources once I get to where I can google.
06:49:21 <nyc`> (nyc`) I have read TAOCP. I recommend Concrete Mathematics and Generatingfunctionology to learn the math involved with algorithmic analysis. There might be some other good sources once I get to where I can google.
06:49:37 <nyc`> Kleinrock is good for queueing theory.
06:50:17 <hxclurk> sometimes i think about as to how and whether some knowledge about computing like even orgnizational things can be formulated mathematically
06:50:59 <hxclurk> is there like mathematics for formulating knowledge and organization and such
06:52:13 <hxclurk> i dont care much about algorithmic analysis and other such cs maths
06:52:26 <hxclurk> i dont know if im expressing myself right but
06:52:49 <hxclurk> i was wondering about maths that is more about say software organization
06:55:42 <nyc```> Not all algorithms are to be measured in the same terms. Others may be concerned about disk blocks, cachelines, or TLB pages touched and/or dirtied. Others might be concerned about arrival rates vs. hold times of particular locks. Cache and icache footprints, space footprints, and more also come up.
06:56:09 <hxclurk> right
06:56:20 <hxclurk> let me try to explain myself better
06:57:11 <hxclurk> i had this idea in mind
06:57:26 <hxclurk> i thought that there might exist a maths field for it
06:57:38 <nyc```> hxclurk: For a new OS?
06:57:53 <hxclurk> it was about my own mental processes, how human mind reasons in hierarchies and abstractions and such
06:58:04 <nyc```> What's the math idea?
06:58:36 <hxclurk> its not maths idea, i just thought that there might already exist a math subfield dealing with such objects that come close to reflecting human thought processes
06:58:43 <nyc```> Networking loves queuing theory BTW.
07:00:37 <nyc```> I don't think that's really math, but it might apply math. ISTR Patricia Churchland advocating linear algebra models of neurons or some such.
07:00:59 <hxclurk> Not really physical mental processes
07:01:19 <hxclurk> More about hirarchies of reasoning about things
07:02:39 <nyc```> That might get into philosophy more than psychology. AI has some ontologies it's been working on for a while.
07:03:06 <hxclurk> Do you need a good background in conventional maths to be able to swin in the AI field?
07:03:14 <hxclurk> s/swin/swim
07:05:22 <nyc`> Yes.
07:05:40 <hxclurk> well fuck
07:06:10 <hxclurk> How much time do you think it takes a determined person to reach that level of maths?
07:06:16 <hxclurk> self-study
07:07:34 <nyc`> It depends a lot.
07:07:54 <hxclurk> is a year realistic target?
07:07:56 <rain1> you basically just need to learn calculus and linear algerbra right? you could probably do that in 2 years
07:12:20 <nyc`> I would say that one needs some analysis, some abstract algebra, some number theory, some asymptotics, some diffeq's, some vector calc, some numerical analysis, some prob/stat, and probably miscellaneous other things.
07:13:29 <hxclurk> when i learn that stuff it all feels mechanical and there are rarely anything unifying and i leave feeling like i just learned some tricks and nothing else
07:13:47 <nyc`> Complex analysis would be very helpful for the residue calculus and affairs like elliptic functions and preliminaries to asymptotics.
07:14:20 <nyc`> hxclurk: It's different when it's proofs instead of calculations.
07:14:23 <geist> hxclurk: that's not unique
07:14:54 <hxclurk> proofs feel even worse
07:15:29 <nyc`> hxclurk: Math majors do all proofs from the get-go. They're like philosophy classes.
07:18:46 <nyc`> hxclurk: Insight replaces nose-to-the-grindstone brute force calculation.
07:19:16 <nyc`> hxclurk: Gauss was asked to add up all the numbers from 1 to 100.
07:20:42 <hxclurk> whatd he do
07:21:05 <nyc`> hxclurk: He noticed that if you take all the numbers twice, reverse one of them, and then add the pairs, every pair adds up to 101, and there are 100 pairs.
07:22:20 <hxclurk> what is reverse of number
07:22:21 <nyc`> hxclurk: Since that's two copies added together, take half, and that's 50*101.
07:24:02 <hxclurk> i dont understand what you wrote
07:24:31 <nyc`> Two copies of the list of numbers from 1 to 100.
07:24:56 <nyc`> Reverse the second copy and pair the elements of the lists.
07:25:19 <hxclurk> oh
07:25:59 <hxclurk> makes sense
07:26:59 <nyc`> That made the answer a lot faster and easier than carrying out addition by hand on all the numbers from 1 to 100.
07:27:20 <nyc`> Insight spared effort.
07:27:22 <hxclurk> right
07:27:30 <hxclurk> how did these olden time mathematicians sustain themselves
07:28:38 <nyc`> Computer only came to mean a machine after Turing etc.
07:28:41 <jmp9> Hello guys. I have another stupid question for you because I didn't read intel docs :-). Here the question
07:28:56 <jmp9> Which address i should pass to LGDT,LIDT,etc
07:28:57 <nyc`> Before that, computer was a job title.
07:29:11 <jmp9> physical address or virtual (after i set paging enabled)?
07:29:40 <nyc`> Virtual.
07:29:44 <jmp9> Rly?
07:29:50 <jmp9> huh. thanks
07:29:57 <jmp9> it's much easier
07:30:27 <jmp9> so after I set paging enabled, every address that I should pass to CPU is virtual?
07:30:56 <nyc`> I'm not sure the Intel docs are easy to interpret regarding that issue.
07:32:00 <jmp9> The only issue that is there 5 thousand pages of intel docs
07:32:17 <nyc`> jmp9: No, not every address.
07:32:33 <jmp9> oh
07:32:45 <mahackamo> gotta get the 4 seperate volumes
07:32:48 <jmp9> for paging I should put physical address in cr3
07:32:56 <mahackamo> so it's actually searchable in a timely manner
07:34:09 <nyc`> jmp9: cr3 is the usual counterexample. I'm foggy on other cases where physical addresses are used.
07:40:27 <nyc`> The Whole Foods coffee was not enough to keep me awake. My $gp and serial IO woes will have to wait until I've rested a bit.
07:40:55 <hxclurk> get IV caffeine like real men
07:41:41 <nyc`> I probably no longer qualify, if I ever did.
07:42:03 <hxclurk> dont tell me you sucked a dick
07:42:20 <hxclurk> without saying no homo afterwards
07:42:40 <nyc`> hxclurk: /whois and check the name.
07:43:02 <hxclurk> why whois returns your real name
07:43:10 <hxclurk> oh
07:43:27 <hxclurk> you are cool
07:44:10 <nyc`> That's my real name.
07:44:22 <geist> re: physical addresses. remember virtual -> physical is because there's a paging unit (the MMU) on the cpu
07:44:31 <geist> everything else sits on top of paging, pre-translation
07:45:04 <hxclurk> wait it actually returns the username
07:45:08 <geist> essentially the paging unit is almost a separate thing, and translates all the addressing coming and going from the rest of the cpu core
07:45:21 <hxclurk> didnt know that
07:45:21 <nyc`> It's messy, as segmentation is another form of address translation.
07:45:32 <geist> yes, but like i said, it sits on top of paging
07:45:43 <geist> as in you need to remember that paging is at the 'bottom' of all the translations
07:45:56 <geist> it's essentially implemented as an external chip on the bus between the cpu and everything else
07:46:05 <geist> in early 80s design it literally was
07:46:37 <geist> by implementing a paging unit in 386 intel essentially integrated an external chip into the cpu, though i dont think intel had an official external mmu chip, like motorola did for 68000 series
07:46:45 <geist> 68882 i think was their external mmu?
07:47:30 <nyc`> It could be messier e.g. IBM's segmentation scheme.
07:48:09 <geist> ah no 68882 is the fpu, 68851 is the external mmu. https://en.wikipedia.org/wiki/Motorola_68851
07:48:40 <nyc`> :o
07:49:23 <nyc`> The BBN Butterfly probably had those.
07:49:51 <geist> sure. and if you were some big machine the MMU would be some set of cards in a rack or something
07:50:06 <geist> logically MMUs are usually almost completely external to the core
11:41:12 <jmp9> Can I change page directory address in cr3 multiple times?
11:42:18 <mrvn> What do you think? You can only do it once per cold boot or what?
11:42:55 <jmp9> Maybe my laptop will explode if I do this xD
11:43:19 <mrvn> It's flash, you can only set it about 10000 times before the cpu burns out. :)
11:43:58 <mrvn> but seriously: you can set it any time you want
11:44:50 <mrvn> Usually you set it for every task switch.
11:58:50 <ronsor> generally, if you aren't allowed to do something, the CPU will fault (and likely reset)