Search logs: #osdev - 19 February 2019

channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched

#osdev2 = #osdev @ Libera from 23may2021 to present

#osdev @ OPN/FreeNode from 3apr2001 to 23may2021

all other channels are on OPN/FreeNode from 2004 to present

http://bespin.org/~qz/search/?view=1&c=osdev&y=19&m=2&d=19

Tuesday, 19 February 2019

12:08:59 <jmp9> https://www.youtube.com/watch?v=Fj0KtKalbO8
12:09:01 <jmp9> good music for coding
12:19:49 <jmp9> finally
12:19:58 <jmp9> "info mem" shows correct mapping
12:32:18 <jmp9> okay
12:32:23 <jmp9> how to setup 64 bit code segment in gdt
12:32:34 <jmp9> i tried this ony
12:32:35 <jmp9> lmem->gdt_entry_code = (u64)((1<<21)|(1<<15))<<32;
12:39:46 <jmp9> okay i'm getting GPF
12:42:26 <nyc> Interesting... http://linuxfinances.info/info/microkernel.html says ```If you want to make a good microkernel, choose a different syscall paradigm. Syscalls of a message based system must be asynchronous (e.g. asynchronous IO), and event-driven (you get events as answers to various questions, the events are in the order of completion, not in the order of requests). You can map Unix calls on top of this on the user side, but it won't nece
12:42:26 <nyc> ssarily perform well.``` attributing it to Bernd Paysan
12:46:58 <nyc> foobiebletch: East Harlem here.
12:52:42 <bcos_> nyc: I'd describe that as "right idea, wrong words" - it's not the micro-kernel's syscalls that need to be asych, it's the "syscalls that aren't syscalls anymore for micro-kernel" that do
01:00:42 <foobiebletch> nyc: ?
01:05:10 <nyc`> foobiebletch: Your hostname suggested you might also be in the city
01:06:34 <nyc`> bcos_: I understood him, but not what your answer to him was.
01:08:59 <bcos_> nyc`: Imagine something like "sleep()" is implemented in the kernel, the same for both monolithic and micro-kernel - it's no different and doesn't need to be asych. Next imagine something like "read()" is implemented as a syscall in a monolithic kernel but is not implemented as a syscall in a micro-kernel - it would want to be asych for micro-kernel because it's not a syscall anymore
01:48:01 <nyc`> I'm out of steam for solving real problems for tonight, so time to bikeshed about microkernel design or whatever.
01:50:18 * geist yawns
01:55:26 <nyc`> At least I got some RISC-V asm written today.
01:58:12 <mischief> ddd
01:58:54 <mischief> *cough*
01:58:58 <uelen> okay, still running into issues. I can actually get into the userspace thread now via a context switch, but when I do a syscall it tries to jump to 0x0000 and triple-faults. The CS and SS segment selectors get set correctly but I never actually make it into the syscall handler. Interestingly I can do a syscall if I'm in ring 0, but as soon as I make the new thread ring 3 it breaks
02:01:09 <uelen> funnily enough my double-fault handler and page-fault handler never even get called as well
02:02:50 <nyc`> Is it running in a simulator?
02:03:54 <uelen> in qemu
02:04:20 <uelen> I've got gdb set up as well, I can see it hit the syscall instruction and then just jump to null
02:05:29 <yrp> you didnt mention, but i presume you have lstar and cstar setup properly?
02:06:04 <uelen> I believe so, syscalls work fine when I'm in ring 0
02:06:37 <uelen> so I know LSTAR is definitely working right
02:07:56 <uelen> this is how I'm setting up STAR:
02:07:57 <uelen> ` let sysret_cs = 0x18;
02:07:57 <uelen> let syscall_cs = 0x08;
02:07:57 <uelen> let star: u64 = sysret_cs << 48 | syscall_cs << 32;`
02:14:32 <doug16k> uelen, you enabled syscall instruction?
02:14:46 <doug16k> EFER.SCE
02:15:06 <uelen> yeah, syscalls are definitely working when CS and SS are 0x8 and 0x10
02:15:22 <doug16k> I couldn't care less what syscall does in kernel mode. that is nonsense usage
02:15:38 <uelen> but when they're 0x1B and 0x23 it ends up breaking
02:16:19 <uelen> they're definitely enabled
02:17:03 <uelen> I just have to change what segments my new thread uses and it ends up breaking system calls
02:17:51 <doug16k> LSTAR must be 0 if syscall loads rip with 0
02:18:18 <doug16k> LSTAR is 0xC0000082U if you want to double check your constant
02:20:26 <doug16k> you are saying it is fine with one set of selectors to load into cs and ss (STAR(), but it somehow puts 0 in rip if other selector values?
02:20:34 <doug16k> hey wait, why do you have RPL != 0 in those?
02:20:58 <doug16k> I think you want 0x18 and 0x20
02:21:03 <uelen> is there anything that can cause the LSTAR to get reset? I'm definitely setting it and I can syscall to it if I'm in ring 0
02:21:03 <uelen> relevant code: `msr::wrmsr(msr::IA32_LSTAR,syscall_entry as u64 );`
02:21:39 <doug16k> hang on, what order is your GDT?
02:21:41 <uelen> those are my selectors for userspace code + data, I thought I had to add 3 to them
02:22:18 <uelen> 0x0: Null, 0x8: Kernel CS, 0x10: Kernel Data, 0x18: User CS, 0x20: User Data, 0x28: TSS
02:23:22 <doug16k> hey, is it a sysretq (q at the end?)
02:23:44 <doug16k> wait that is too late for it to matter
02:25:10 <doug16k> I have these asserts to verify that the GDT is in the expected order: https://github.com/doug65536/dgos/blob/master/kernel/arch/x86_64/cpu/cpu.cc#L187
02:27:05 <doug16k> you fail this assertion: GDT_SEL_USER_CODE64 == GDT_SEL_USER_DATA + 8
02:28:07 <doug16k> let me find SDM section
02:34:00 <doug16k> but ya for syscall to work, the STAR[47:32] must be the kernel code selector and STAR[47:32]+8 must be kernel data selector. the rest of the restrictions apply to sysretq
02:34:08 <doug16k> intel says they mask off the RPLs in there
02:39:28 <geist> oh cripes. the timer rate on this sifive board is 32k. yay crappy precision
02:47:30 <uelen> The segment switch part is definitely working, I can see the selectors getting set to the right values from qemu's crash log. There's definitely something super funky going on (especially because none of my interrupts work)
02:47:37 <uelen> thanks for trying to debug my super weird errors :P
03:36:57 <geist> ooooh i see what the problem is
03:37:34 <geist> when riscv takes an exception it saves a few bits (notably the interrupt disable bit) not into a separate reigster like a lot of systems to but into another field in the mstatus register (which is the main control thing, like cpsr or eflags)
03:38:07 <geist> so. if you're not dumping that on the stack or saving it in context switch, you will trash the saved irq-restore-on-eret behavior
03:38:36 <geist> in practice this means things run for a while and then once you get an unbalanced number of irq entry/exits it wedges up because you end up with irqs disabled
03:39:30 <doug16k> interrupt handlers run with interrupts enabled? or am I missing something?
03:39:51 <doug16k> how can it be unbalanced?
03:40:28 <geist> in this case it's because thread A enters with irqs enabled, it gets disabled, then preempted and it switches to another thread that was not previously preempted
03:41:01 <geist> i haven't been saving mstatus on the stack, so you end up now in another thread that's holding the 'saved irq state at last exception entry' bits
03:41:13 <doug16k> and never loaded the mstatus context in that cswitch
03:41:17 <doug16k> ya ok that makes sense
03:41:22 <geist> so then when it switches back to thread A and then it goes to exit the irq, it doesn't restore the old irq state
03:41:42 <doug16k> how much do you have to save ? roughly
03:41:49 <geist> right, it's because mstatus holds both the current IRQ disable state *and* the saved irq state from the last exception
03:42:02 <geist> oh probably just blat down the entire mstatus register as part of the iframe
03:42:06 <geist> no sweat
03:42:40 <geist> arm64 is very similar to this except it simply copies the relevant bits of the cpsr to a second copy, which you then blat on the stack (if you want)
03:43:00 <geist> and eret there simply atomically copies saved cpsr and pc back to the main ones
03:43:35 <geist> in the case of riscv it simply moves the saved irq bits in the mstatus register to the live ones and restores from the epc register, which had previously saved the interrupted/faulting address
03:44:00 <geist> it's one of these cases where the two arches are almost identical in behavior except for one little detail, which actually is harder to find sometimes becaus eyou gloss over it
03:44:35 <doug16k> hard to resist feeling like you already know and can assume eh?
03:44:54 <doug16k> ya I can see that being distracting
03:45:34 <geist> yah
03:45:46 <doug16k> yeah I guess once you know a certain amount you have to fight assumptions that you know details already more and more
03:45:50 <geist> i was seeing this because after a little bit LK just sort of locks up
03:46:00 <geist> turns out it's sitting in the idle loop with interrupts disabled, no timers or irqs firing
03:46:38 <geist> anyway, bbiab. need to go to the store. my pantry is bare
03:48:09 <doug16k> ya I did that too a little while ago. I made it so when you wake up a thread whose home cpu is another cpu, then it sends an IPI to ask the other CPU to reschedule (the one I'm waking up almost certainly is going to preempt the one that is running, since it just woke up from sleeping for time)
03:48:54 <doug16k> and in there I forgot EOI. so eventually it made those CPUs idle forever without EOI-ing and getting nothing because the IPI is higher priority than the timer and stuff
03:50:48 <doug16k> so it worked well then dwindled down to a couple of cpus
03:51:21 <doug16k> more and more hopelessly waiting for a reschedule timer or ipi to ever happen again
03:52:26 <doug16k> thankfully I looked at `info lapic` in qemu monitor and it said plain as day what was happening
03:52:42 <doug16k> IPI sitting there unacknowledged in IRR
03:52:48 <doug16k> er ISR I mean
03:55:27 <froggey> hah, I fixed that *exact* bug about a week ago
03:55:39 <froggey> even down to using 'info lapic' to diagnose it
03:55:47 <doug16k> neat
04:12:55 <doug16k> the most used words are interesting ones -> http://bespin.org/~qz/irc/2019-02-February.html
04:13:08 <doug16k> thjnk being 2nd has to be use as meaning uncertainty
04:24:23 <geist> froggey: haha, yay!
04:38:21 <ashkitten> i'm on the leaderboard :O
05:02:43 <geist> huh. that's kinda neat. this riscv board actually encourages using atomic operations to modify a hw register
05:03:08 <geist> in this case read/writing a output reg for gpios
05:03:27 <geist> as in instad of doing a read/modify/write, just use an atomic instruction to set or clear
05:14:22 <doug16k> it has RMW instructions that aren't xchg or CAS or LL/SC?
05:14:43 <geist> yah it's got a fairly clean set of single instruction atomics
05:14:58 <doug16k> nice
05:15:28 <geist> the control register instructions are also similar. they always do a swap with atomic
05:15:54 <geist> so reading a control reg is like oring 0 with it, etc
05:16:00 <geist> it's a little loopy, but kind of clever
05:16:14 <geist> i just didn't expect it to also work against hardware regs
05:16:57 <doug16k> isn't it PCI that makes that not work? devices don't implement the bus cycles for it?
05:17:11 <geist> presumably. i have no idea if this is widespread
05:17:36 <geist> as in the hw peripheral would need to be able to support the bus transaction for this, and presumably not all of them do
05:18:43 <geist> but it's in the chapter for the gpio block. casually mentions that if your riscv core has the 'A' extension (atomics) you can use it to do RMW on the output value register
05:19:27 <geist> i guess since this particular machine is single cpu they can just say it works, but if it were SMP it would all fall apart
05:20:31 <doug16k> you're making me want to read about that in the pci spec some more to realize why nobody would ever implement locked cycles it on MMIO on PCI :)
05:21:01 <doug16k> s/it //
05:21:49 <geist> yah ARM says it's Right Out to use atomics on anything but fully cached memory
05:22:03 <geist> presumably because the mechanism relies on the cache coherency protocol
05:22:21 <geist> and i bet x86 might not like it either. kind of curious there
05:23:10 <doug16k> my guess: it would work on video memory but nothing else
05:23:21 <geist> i suspect whats going on here is they're syaing 'hey this is a single cpu SoC, and ther'es nothign to race against since this is an output register, so go ahead and use the atomic instructions'
05:23:24 <doug16k> it already needs almost arbitrary alignment and stuff on its interface
05:23:52 <doug16k> not would work, I mean would have a chance of working
05:24:33 <doug16k> ya the internal MMIOs are probably uber fast and well implemented
05:25:35 <doug16k> ah, no other agents (DMA, etc) can possibly reach those eh
05:27:09 <doug16k> it would be fun to go do a bunch of inadvisable locked I/O and MMIO on some real hardware and watch for magic smoke :P
05:29:47 <geist> yah. hypothetically this would work on a cortex-m as well, i've just never seen it called out like this
05:30:09 <geist> and probably only works against registers that only modify when you write to it
05:36:07 <geist> anyway, yeah the control reg read/write is nifty: csrrw/csrrs/csrrc [dest], [control reg], [source]
05:36:33 <geist> so to read a control reg you can atomically or in a 0 to it
05:37:08 <geist> or to write it atomically swap in a register with the value, storing the result in the zero register, thus discarding it
05:37:32 <geist> the 's' and 'c' versions of it set and clear bits in the source register or immediate
05:37:43 <geist> pretty clever, very handy
05:38:21 <geist> also means you can do stuff like clear the interrupt enable bit and save the old value in a single instruction
05:59:30 <doug16k> that's handy!
05:59:55 <geist> yah i just replaced my disable-and-save and restore-previous routines with em
06:15:07 <zhiayang> hm, how do i design ipc messages such that they don't end up being a stupid onion of layers like tcpip
06:15:49 <geist> make them smarter
06:15:57 <zhiayang> smart onions?
06:16:04 <geist> using potatoes
06:16:14 <zhiayang> :o
06:16:14 <doug16k> less tears?
06:16:27 <bcos_> What makes you think they end up with layers like tcpip in the first place?
06:16:39 <geist> with onion dip
06:17:23 <zhiayang> currently each message has a header and a payload right
06:17:42 <zhiayang> and so if i want to send stuff to say a driver that expects a certain format of stuff in the payload
06:17:46 <geist> currently? beats me. depends on what you've done
06:17:56 <zhiayang> well yea that's how it is for me currently
06:18:07 <zhiayang> so the actual data needs to be wrapped twice
06:18:10 <zhiayang> which is kinda lame
06:18:36 <geist> generally speaking most heavy ipc systems start building up a form of IDL that can generate code to pack/unpack/access the data
06:18:51 <geist> and maybe even build a message loop to dispatch stuff
06:19:25 <geist> but yes, generally you end up with at least one layer. but i dont think most IPC based systems go that deep
06:19:30 <geist> maybe only one layer or two
06:21:01 <bcos_> I normally have "payload" (the message containing whatever the sender felt like), then "metadata" (sender ID, message size) that isn't part of the message and isn't a header (but is tracked by kernel and passed along with the message - e.g. "getMessage()" returns multiple things)
06:21:25 <geist> that is a way to do it
06:23:10 <geist> depends a lot on how the low level ipc works. ie, is it many to one? does it have to be established up front, etc
06:23:33 <immibis> zhiayang: if your problem is having to shuffle things in memory, you can always reserve space before the message for the headers
06:23:36 <geist> there are lots of ways to cut that, and depenending on how it can deterrmine how much importance kernel metadata is
06:23:51 <geist> or implement readv/writev style iovec ipc
06:23:59 <geist> i've found that to be generrally worth the trouble
06:26:21 <geist> hah the power management window on this linux box is broken. it has a slider for after this much time go to sleep
06:26:31 <geist> it varies between 2 and 5. no units. just 2
06:36:33 <zhiayang> h
06:36:36 <zhiayang> m, i see
06:37:08 <zhiayang> bcos_: so what’s the difference between the header and the metadata?
06:38:09 <bcos_> Imagine you have a buffer or something, where kernel puts the received data. Metadata isn't put in that buffer
06:39:03 <bcos_> E.g. my "getMessage()" might return senderID in EBX, message size in ECX and message type in EDX; and the message payload gets stored in caller's RAM
06:39:44 <bcos_> For header, it'd be part of the payload
06:40:09 <geist> yah, or a timestamp, or even message length
06:40:25 <geist> that's part of the information about the message that the kernel may provide
06:43:12 <doug16k> what do I do about surprise removal of drives?
06:43:35 <bcos_> For my case; "sender ID" needs to be unforgeable for security purposes (so receiver can do "if(sender == someone_I_trust) { .."
06:44:27 <doug16k> a mechanism to keep the drive object alive for all handles still using it, but transitioned to a state that errors ENODEV for everything?
06:45:33 <bcos_> doug16k: I'd start by classifying drives as "intended to be removal, never used for memory mapped files, never used for swap" and "not intended to be removable, panic if removed"
06:46:20 <doug16k> usb storage is the one where I expect frequent surprise removals
06:46:32 <zhiayang> oh, hm
06:46:42 <zhiayang> right now my “metadata” as you describe lives in the header
06:47:21 <zhiayang> which would explain why i’m afraid of onions
06:49:20 <doug16k> perhaps just lock the handle table and walk it and stick a dummy drive implementation that just ENODEV's everything for every handle that is on the surprise removed drive
06:50:35 <bcos_> zhiayang: I'm not sure how your messaging works (synch/asynch, max. size, delivery guaranteed/not guaranteed, ordering rules, ..). For mine; internally kernel may move (not copy) pages, so for things like swap space and memory mapped files I need the data to be "undecorated" (e.g. so page fault handler can ask get a "reply for block from swap space" and shove the message's underlying physical page of RAM directly into page tables.
07:01:05 <geist> ballsy, going with the direct move therre
07:08:15 <zhiayang> oh, interesting
07:08:44 <zhiayang> that’s another thing, like are messages just gonna be inefficient if you have to keep copying to/from kernel buffers
07:08:49 <zhiayang> without shared memory that is
07:08:54 <geist> it depends a lot
07:09:19 <geist> size of the message matters greatly, and the size of the crossover point changes over time as machines get faster
07:09:41 <doug16k> wouldn't I be able to do a bunch of temporal stores and have delayed non-coherent stores bound to there, which could get missed by loads done during validation? I hope there is a fence or serializing operation between the user code's writes and the validation
07:09:47 <doug16k> non-temporal*
07:10:10 <geist> if you have two processes that have a pre-established relationship, and can tolerate whatever security aspects may come, then pre-establishing a shared emmory system is pretty important
07:10:20 <geist> like, say a file system and a block device. or a net stack and a ethernet driverr
07:10:51 <geist> or perhaps between a gui process and a server, provided you can ensure the security aspects of that
07:11:27 <geist> but never under estimate the expense of screwing around with page tables, and conversely never underestimate the ability of a cpu to copy lots of data really fast
07:11:47 <geist> at the moment in zircon the crossover point is much higher than most people thing. in the MBs on modern hardware
07:11:59 <geist> usually a few multiples of the size of L2 cache, but within L3 cache size
07:12:14 <geist> on modern hardware
07:12:43 <zhiayang> oh, interesting
07:12:54 <geist> also, "sharred memory at all costs" has serious security concerns
07:13:05 <zhiayang> right the eventual intent is to have some kind of mechanism to get two processes to agree on a shared memory buffer if they need to transfer large bits of data
07:13:06 <geist> since the sender can now manipulate data while the receiver is looking at it, etc
07:13:17 <zhiayang> eg. fs driver and sata driver
07:13:22 <geist> yah i think that's quite important for sure
07:13:42 <zhiayang> oh, didn’t think about the security side
07:14:02 <geist> yah i didn't either until working on this project. security folks punch holes in all sorts of things
07:14:12 <geist> doubleplus so with meltdown/spectre/etc
07:14:55 <geist> even with shared memory buffers, it's fairly common to make a local memcpy of the original IPC message (maybe minus large payload) so you can safely process it on the side
07:15:25 <geist> or, say, send copies of the contrrol plane of your message thrrough the kernel, but refer to bulk data in a shared buffer
07:15:45 <zhiayang> ah right
07:16:11 <geist> that's one of the reasons we designed the zirrcon VM to treat memory bufferrs as a first class citizen
07:16:21 <geist> can map, can send, can read/write directly, etc
07:16:33 <geist> very easy to do, shared memorry is a natural thing, since it's not a special case
07:18:42 <geist> okay, lets see if that helps. finally pulled the 'r' key on the keyboard and blew some air in it
07:18:49 <geist> maybe it wont keep repeating like it has
07:19:06 <geist> so far so good
07:19:11 <Mutabah> Oh, I thought you were just cold :)
07:19:29 <geist> nah, the key has been repeating for a while, but i've been too lazy to pull it
07:19:36 <geist> but.... shoulda dont this forever ago
07:19:52 <geist> and yeah my hands are a little cold, about to head upstairs
07:20:28 <klys> i'm so done rearranging furniture tonight
07:22:00 <dminuoso> geist: You know what would be wonderful?
07:22:01 <dminuoso> First-class support for cross process STM.
07:22:13 <geist> hmm, what is STM in this case?
07:22:54 <klys> signal to message?
07:22:58 <bcos_> Sexually Transmitted Messages
07:23:08 <Mutabah> Software Transational Memory?
07:23:21 <klange> Surface Transportable Missile
07:23:45 <geist> silly transformer movie
07:24:32 <klys> stump the man
07:24:52 <dminuoso> Software transactional memory
07:24:54 * bcos_ votes "yes" for first-class support for surface transportable missiles - could be an elegant solution to the "PEBKAC" problem
07:25:22 <geist> damn, now i feel really silly. i've been bitching about this r key for months now, and all i had to do was pop it off
07:26:01 <bcos_> geist: We all postpone getting off our R's sometimes.. ;-)
07:26:05 <klys> I needed an F on this tablet-pc notebook, and so I spent like 12.00 on one
07:27:22 <dminuoso> klys: Do you have a Cisco notebook? Or why does your vendor make you pay dearly for each key?
07:27:25 <Mutabah> bcos_: Glorious
07:27:44 <klys> dminuoso, I got it on eBay. fujitsu lifebook t901.
07:28:06 <dminuoso> geist: But seriously, I'd love to see OS support for STM. Maybe it's something I should just write.
07:28:31 <geist> i honestly dont know precisely what that means
07:28:32 <klange> Fujitsu builds those things like tanks. Terrible specs for their production dates, though.
07:28:48 <dminuoso> geist: Are you familiar with what STM is?
07:28:54 <geist> only vaguely
07:29:07 <geist> it'ssome intel thing that no one uses, which is about the extent of it
07:29:13 <dminuoso> STM is basically like database transactions, except for memory.
07:29:19 <geist> ah, that
07:29:37 <dminuoso> So any process/thread participating in it gets to always see a consistent view of memory. You can commit transactions, roll them back and most importantly retry them.
07:29:57 * geist nods
07:31:13 <bcos_> dminuoso: There was a bunch of research (around 10 years ago maybe) into STM, mostly showing that it sucked under high contention
07:31:16 <klys> geist are you still working on a new scheduler?
07:31:50 <bcos_> dminuoso: (primarily, page size lacks the granularity needed to make it less sucky)
07:32:52 <geist> klys: yah
07:32:57 <klys> coo
07:33:01 <geist> it's in a branch now
07:34:16 <klys> I did a twigs_alloc function earlier for mine
07:34:50 <bcos_> doug16k: I'm a little worried now - found that things like "syscall" instruction aren't fences, so maybe there's plenty of scope for "non-temporal store, followed by normal load" problems lurking
07:35:13 <dminuoso> bcos_: Perhaps, I've been writing programs using STM for a while now and it's the silver bullet against race conditions. Being able to reason about concurrent programs without writing extensive formal proof is priceless.
07:37:00 <bcos_> dminuoso: Part of me is "actor model, shared nothing", the other part of me is "heh, Intel wrote HTM patches for Linux and everyone found out that they didn't help performance at all"
07:38:07 <dminuoso> bcos_: Well STM comes at a deep price, it's definitely not for free.
07:40:47 <bcos_> I see it as mostly a "developer time vs. performance (under contention)" trade-off; where (in theory) developer time should be amortised (by code re-use, etc) into insignificance but in practice nothing works like that..
09:07:24 <doug16k> swapgs is serializing. that will probably cover everything
09:08:28 <doug16k> nice sandpile page I hope I mentioned: http://www.sandpile.org/x86/coherent.htm
09:13:15 * bcos_ isn't planning to use swapgs - sfence might be the best option
09:13:36 <doug16k> I had a feeling you might not be using swapgs
09:13:49 <nyc`> I'm not sure anything doesn't suck under high contention.
09:14:38 <nyc`> I wonder if that really meant large core/thread count.
09:14:38 <bcos_> nyc`: The goal would be to reduce contention (fine grained locks vs. coarse grained whatever)
09:19:11 <bcos_> Hrm.. while also trying to avoid "busy work" (the "do work, then discard it if it can't be committed" thing you get with TM and lockless)
09:19:41 <nyc`> I think the synchronization method isn't going to be to blame for contention per se. Maybe there's another way to phrase this that would make more sense to me.
09:21:39 <mrvn> nyc`: wait queues don't suck under contention
09:22:12 <nyc`> That sounds like a combination of wait-reducing / wait-freedom and otherwise reducing synchronization overhead.
09:23:24 <mrvn> nyc`: the thread tries to get the lock, if not you add it to a queue of threads waiting for the lock and put it to sleep. When the lock clears you wake up the first thread in the list.
09:24:17 <nyc`> (One of my more notable technical achievements was actually reducing the overhead of waitqueues under contention.)
09:25:22 <mrvn> contention in the wait queue itself? You have that many cores?
09:27:50 <nyc`> It was more of a thundering herd sort of issue than lock contention.
09:28:32 <mrvn> the wait queue prevents the thundering herd when you only wake up one thread at the front.
09:29:03 <nyc`> I was primarily dealing with a 64 core system back in 2004.
09:30:14 <nyc`> Wake-one semantics needed my code to be recovered in the situation in play.
09:53:05 <geist> but but but everyone uses swapgs!
09:59:50 <nyc`> swapgs is a headache I'm not looking forward to. Hence putting off x86 for a later additional ports phase.
10:31:48 <nyc`> I probably should have looked at Redox and Genode.
10:34:08 <lkurusa> Redox is apparently a bit of a mess
10:34:14 <lkurusa> at least according to some people here
10:34:26 <lkurusa> i have not looked at it properly thus far
10:36:08 <nyc`> It looked like it was going somewhere from the Wikipedia page.
10:36:31 <lkurusa> oh, yes, people were not disputing that
10:36:40 <lkurusa> it seems the internals aren't very well designed
10:37:43 <klange> It's a mess because it rather suddenly picked up a large group of people.
10:37:52 <nyc`> The good news for my current project is that it hasn't been designed yet.
10:37:53 <klange> And no one involved knew anything about how to manage it.
10:38:18 <klange> And then they went and split it up into dozens of different repositories and it's now impossible to figure out where anything is in it.
10:38:55 <klange> I envy Redox's rise in developers, but it's also been a curse for them.
10:39:54 <nyc`> It's Rust partisans boosting the contributor count, no?
10:46:01 <nyc`> I wonder how they ended up not being able to park bodies on things.
10:47:00 <nyc`> Large teams of coders are useful.
10:48:29 <nyc`> Ports to new architectures, sending them after toolchain issues, userspace, you name it. Bodies help a lot.
10:52:05 <klange> Their last binary release is from a year ago.
10:53:06 <klange> Their self-hosted gitlab has no actual binary downloads, despite their docs pointing to them for that.
10:53:17 <klange> I'm not really sure what they're going for anymore.
11:01:35 <lkurusa> WHy are they self hosting gitlab
11:01:56 <lkurusa> Sure sure GitLab is okay because hurr durr microsoft yada yada
11:02:08 <lkurusa> but selfhosting seems to be an overkill imo
02:05:11 * renopt manually prefixes and renames some ~100 functions and structs
02:05:30 <renopt> man, it'd be super rad if C had namespaces
02:06:59 <mrvn> it would be C plus plus rad
02:07:01 <ryoshu> geist: hi
02:07:14 <ryoshu> geist: I think I know why we need emulator for legacy OSes
02:07:21 <ryoshu> on x86
02:07:43 <ryoshu> https://patchwork.kernel.org/patch/28685/
02:07:56 <ryoshu> KVM: VMX: Support Unrestricted Guest feature
02:08:07 <ryoshu> without UG we need emulator for real mode
02:16:04 <ryoshu> probably not a big deal as this is mostly booting phase for modern OS
02:36:10 <nyc`> I'm surprised to hear of so much chaos in Redox. To me, operating systems are projects that could relatively naturally use tons of bodies.
02:40:09 <nyc`> Park a body on each architecture port, device driver, major subsystem, and even branch out into parking bodies on things like major userspace apps and toolchain components and such.
03:43:07 <knebulae> @nyc: if you follow the rust community, you'd likely not be at all surprised at the choas in Redox.
03:43:27 <knebulae> That community is going to cannibalize the whole thing. It will be an interesting case study in a decade.
03:43:49 <knebulae> Not even the Mozilla $$ will be able to save it.
04:31:42 <nyc`> They're at least closing the loop with a language and compiler.
04:33:13 <nyc`> I don't know of too many other projects that did or are doing likewise.
04:35:23 <nyc`> Oberon I guess.
05:25:55 <bwb_> > That community is going to cannibalize the whole thing.
05:26:02 <bwb_> knebulae: would you like to expand upon that?
05:29:57 <nyc`> If I were going to do a loop-closing project, I would probably have multiple languages. It would make sense to have a low-level language for the kernel and runtime affairs in userspace and then to use super high-level languages for major userspace components. I'm not a believer in one size fits all for languages.
05:29:58 <knebulae> well, they just don't prioritize development. I'm kind of old school. They are worried about everything under the Sun, except for a core team of competent folks, and there's just too many distractions.
05:30:48 <knebulae> It also doesn't help that they talk to people like they're stupid. I could show you 8 to 10 different ways to accomplish memory safety in a modern programming language without introducing features that are completely antithetical to how people have programmed for 40+ years.
05:31:48 <knebulae> Plus there's zero OO, even where appropriate, and the way they did composition and turning enums into an ADT, I mean, I could go on and on.
05:32:31 <knebulae> It's really smart kids, who are very talented programmers, who need more top down experience. More pencil and paper, less node.js crowd.
05:33:09 <knebulae> Now get off my lawn!
05:35:26 <nyc`> My interactions with userspace people has usually been talking at cross purposes.
05:36:53 <nyc`> Sort of like what always happens with me and mrvn, but in a different direction.
05:38:52 <knebulae> @nyc: loop-closing project; I'm not sure exactly what your intended meaning of that was, but I like the ring of it for my own purposes.
05:39:53 <nyc`> knebulae: Loop-closing meaning self-hosting with compiler and other supporting userspace.
05:40:28 <knebulae> single lang all the way down (except the dirty bits) then... ok.
05:41:57 <nyc`> knebulae: I think I just said that the single language part of it wasn't such a good idea.
05:42:59 <knebulae> Realistically, there's only 3 languages that could pull that off right now: C/C++/C# (with the native compiler). zig/rust/nim/crystal are probably in the next tier to pull it off. Apologies if I've forgotten anyone.
05:43:53 <knebulae> @nyc: I missed that, and I agree. Some of the old UNIX guys are hell-bent though on userspace + kernel being single language (or at least C/C++)
05:47:58 <nyc`> knebulae: Languages aren't that far out to wing. Most of the ugliness is C missing modules etc. I guess C++ has things that could help, but they're a little weak. I don't know enough about Zig, Rust, Nim, or Crystal.
05:49:13 <andrewrk> speaking of zig, look what I just got working: https://pbs.twimg.com/media/DzyTGk3XgAAfTOh.png
05:49:37 <andrewrk> not relevant for freestanding target, but could be useful for unit tests or userland
05:51:58 <nyc`> knebulae: I think things like userspace or the compiler would be good to go all the way to crazy AI research stuff like Lisp, Prolog, Haskell, etc.
05:52:34 <nyc`> andrewrk: Excellent !
06:03:49 <nyc`> knebulae: I guess I would write front ends for C, C++, Ada, Fortran, Zig, Rust, Nim, Crystal, etc., but I'd probably write the compiler in Haskell or Mercury or similar.
06:04:32 <knebulae> @nyc: right
06:05:38 <knebulae> @andrewrk: very nice man!
06:07:58 <nyc`> knebulae: llvm's Design bothers me because alternative calling conventions and GC hooks seem to need to have holes punched in the system to be fit into it.
06:10:07 <andrewrk> nyc`, I know you're right with regards to calling conventions, but it seems like LLVM has a bunch of GC stuff in it: http://llvm.org/docs/GarbageCollection.html
06:10:15 <andrewrk> I haven't personally tried it though
06:10:38 <nyc`> I still haven't entirely decided whether to go macro or micro or what language to write my current project in.
06:10:41 <andrewrk> nyc`, what do you do in C for alternative calling conventions?
06:12:01 <andrewrk> or maybe a better question is what do you do if you want alternative calling conventions?
06:13:07 <nyc`> andrewrk: C is wed to its stack discipline and the calling conventions are hardwired. It's different when you're doing a codegen back end from IR's.
06:17:00 <nyc`> andrewrk: Ideally a relatively general back end working from a cross-language IR wouldn't basically say that its code model is C except for a bunch of language-specific calling convention hooks.
06:18:59 <andrewrk> nyc`, I think llvm gives that with the `fastcc` calling convention: "This calling convention attempts to make calls as fast as possible (e.g. by passing things in registers). This calling convention allows the target to use whatever tricks it wants to produce fast code for the target, without having to conform to an externally specified ABI (Application Binary Interface)."
06:20:03 <andrewrk> (I will note that zig does not guarantee an ABI on anything unless explicitly specified, and this is currently the calling convention chosen by default)
06:20:50 <nyc`> andrewrk: The trick is when languages don't use C-like stack disciplines.
06:21:22 <andrewrk> nyc`, what do you mean by that?
06:21:34 <andrewrk> what's an example of not doing that?
06:22:33 <knebulae> it would have to be something functional
06:24:15 <nyc`> andrewrk: Higher-order languages heap allocate activation records. Standard ML of New Jersey uses that with continuation-passing style so furthermore no call ever returns to its calling context.
06:29:38 <andrewrk> I see
06:29:45 <nyc`> andrewrk: GHC (Glasgow Haskell Compiler) does something relatively wild where it uses a graph manipulation (reduction) execution model that is very far from how humans understand what their programs represent so infinite data structures etc. work.
06:30:59 <andrewrk> interesting
06:31:31 <andrewrk> this is relevant to my goal of achieving safe recursion / eliminating the possibility of stack overflow
06:31:48 <andrewrk> one question that comes to mind with using the heap for activation frames is - what happens if the heap allocation fails?
06:32:09 <andrewrk> what about Out Of Memory?
06:32:25 <knebulae> @andrewrk: and you just discovered the problem with functional programming :)
06:33:03 <nyc`> The runtime system throws an exception that can theoretically be handled or caught, though it's not likely to be recoverable.
06:33:27 <nyc`> knebulae: All GC, really.
06:34:05 <knebulae> @nyc: true dat
06:35:45 <nyc`> Most C/C++ programs just blindly dereference after allocations so it's not much of a difference for userspace.
06:38:19 <andrewrk> nyc`, in this case you will find cognitive dissonance with zig programming language and I would not recommend it to you. the idea of zig is to handle everything perfectly, no exceptions
06:39:02 <andrewrk> hmm sorry if that sounded condescending. I didn't mean it that way
06:39:25 <nyc`> Most C/C++ userspace programs have no business being written in C/C++.
06:39:44 <knebulae> @nyc: that's why I'm kind of excited about C# native.
06:40:01 <knebulae> If people can get over the "from M$" thing, it's getting pretty damn good.
06:40:15 <knebulae> much better for userspace
06:40:35 <knebulae> *general userspace (i.e. not scientific or high-performance)
06:40:43 <nyc`> andrewrk: Not sure what you mean there. I know about other things, but am almost entirely kernel.
06:41:41 <nyc`> andrewrk: Zig sounds like it would be good for kernels.
06:42:54 <knebulae> @nyc: zig is excellent for kernels
06:44:13 <nyc`> I think the trick is with targets that llvm doesn't cover.
06:57:17 <andrewrk> nyc`, I think it would be easier to implement more LLVM backends than to implement the optimizations they have
07:10:41 <nyc`> Possible, though I'm less worried about optimization than just getting from whatever language to native code.
07:12:27 <nyc`> Or whatever will boot on the system.
07:16:06 <nyc`> My current project I'm keeping to a relatively limited scope of a kernel that can probably run Linux-like userspace on some preexisting set of targets.
07:17:42 <nyc`> I'm not that good and don't have the bodies or materials to do like Wirth did for Oberon.
07:19:23 <nyc`> IIRC he designed hardware and wrote kernel and userspace for it.
07:20:39 <nyc`> And, of course, the compiler.
07:27:05 <knebulae> @nyc: did he do that commercially, or with academic "help?"
07:27:12 <knebulae> @nyc: Wirth/Oberon
07:28:25 <nyc`> knebulae: Academically.
07:30:52 <nyc`> I'd probably want to do cache directories and crossbar IO and RDMA etc.
07:31:16 <knebulae> @nyc: I was going to say, that's pretty ballsy (hardware and all)
07:31:42 <knebulae> Only Apple has been able to pull it off (or I should say are the last ones standing)
07:32:30 <nyc`> knebulae: Make things modular and efficient and make the kernel run full SSI on it all.
07:32:45 <nyc`> Yeah, it was awesome.
07:33:11 <knebulae> @nyc: I'll be waiting for my github invite :)
07:33:45 <knebulae> @nyc: well, if you ever get the emulators going
07:34:11 <knebulae> s/emulators/emulators & toolchains/
07:34:19 <knebulae> probably a bad regex
07:41:55 <nyc`> knebulae: Basically the only thing stopping people from making systems modular from probably the 70's or earlier is wanting to bleed people for more money.
07:50:28 <knebulae> @nyc: well, everybody's still gotta eat. If you give people perfect software, then you're out of business.
07:50:44 <knebulae> @nyc: so it's a good thing perfect software is unobtainable currently.
07:54:46 <nyc`> knebulae: There's plenty to do in the world.
07:55:24 <nyc`> knebulae: Plus a big chunk of that was hardware.
07:56:33 <knebulae> @nyc: right and right.
08:00:30 <nyc`> Calculating protein foldings or relativistic n-body problems can absorb endless programmer time.
08:01:03 <nyc`> And have real-world benefits!
08:04:59 <nyc`> Medical statistics to find the latest carcinogens etc. are also handy and would easily have dealt with Big Tobacco in the 60's if not for corrupt interference. There are computations to do with computers besides just making computers work.
08:08:47 <doug16k> knebulae, yep. some regex engines interpret & to mean the whole match :P
08:09:27 <doug16k> thankfully in here we can use super duper extended regex syntax with common sense heuristics
08:10:24 <nyc`> A lot of the number crunching problems can basically use more capacity for more detailed models and more accuracy and such.
08:13:20 <bcos_> For a lot of the number crunching problems; the operating system's main goal is to get out of the way
08:13:28 <nyc`> If you can speed them up they can do more. There are TLB, cache, CPU scheduling and more issues to squeeze more out of it all.
08:14:25 <nyc`> Getting out of the way is an odd idea.
08:14:56 <immibis> getting out of the way is what OSes do when they're not doing stuff for you
08:15:58 <nyc`> They certainly don't want the kernel to divert significant CPU bandwidth or memory, but they demand services from it.
08:18:50 <immibis> in fact everything should get out of the way when you're not using it - what's the alternative, prompting you to buy the full version every 30 minutes?
08:19:51 <bcos_> Hardest = general purpose OS for desktop/gaming. For general purpose servers, you take a desktop/gaming OS and rip out all the hard parts (GUI, sound, 3D graphics, ..). Then, for HPC/number crunching you take the server OS and rip out more (scheduler, security stuff, file system, ...) - mosty just need raw physical memory manager with a network card driver
08:20:32 <bcos_> :-)
08:21:13 <nyc`> What percentage of CPU or RAM goes to the kernel and whether that's a linear fraction of it all or what.
08:22:09 <nyc`> Are you even aware of parallel slowdowns?
08:22:55 <doug16k> are you aware of parallel speedups?
08:23:03 <doug16k> :)
08:23:27 <doug16k> processors nowadays try harder to have good QoS
08:25:25 <nyc`> Multiplying the number of CPU's by some factor doesn't speed things up by that factor.
08:25:35 <immibis> there's an april fools' idea actually. have the linux kernel send SIGBUYFULLVERSION to a random process every few minutes
08:25:51 <doug16k> it's not a first-come-first-served free for all anymore for the cores, things like memory controllers really care about fairness as far as it doesn't hurt bandwidth _too_ much
08:26:58 <nyc`> There is in general some number of CPU's where it gets slower than uniprocessor.
08:29:14 <nyc`> The question then becomes not just how far off you can make that point but how many you can handle while adding more CPU's still increases performance.
08:34:51 <nyc`> I'm actually foggy on the analogous issue for RAM, but a fair amount of the practice seems to consider only a certain amount of RAM per CPU to be usable very effectively.
08:35:44 <doug16k> ya well there's wind resistance too, doesn't stop cars from going 240km/h
08:37:46 <nyc`> TLB reach is a big part of that as is cache locality. Things want superpages so TLB miss overheads are cut down and TLB reach is increased. The kernel is less involved in the cache issues.
08:38:09 <doug16k> nyc`, got measurements showing TLB miss having a significant % of cycles? I gotta see that
08:39:14 <nyc`> doug16k: They actually do, though I'm upstairs on my phone so I can't chase them down this instant.
08:39:17 <doug16k> speculation is so aggressive now it's walking TLB misses long before it reaches them
08:41:00 <chrisf> doug16k: something like a 64M hashtable gets you into TLB-bound territory pretty easily
08:43:48 <doug16k> belly-aching about 4+ GHz 16+ CPU machines with 2+ channels at 2000+ MHz memory clock, srsly? I guess I'm old now
08:44:55 <chrisf> im not sure if belly-aching.. but just dont leave all the perf on the table ;)
08:50:09 <knebulae> @doug16k: I treat regex like nitroglycerin :)
08:53:07 <knebulae> @doug16k: with great power, yadda, yadda...
08:55:13 <doug16k> here's my problem with it: people count the number of say, tlb miss events, but they don't realize that a ton of those are occurring speculatively, on ops far down the instruction stream, so the cpu is very often hiding a ton of tlb latency
08:56:42 <doug16k> that will dilute attempts to make tlb optimizations improve things
08:58:33 <knebulae> @doug16k: right, I noticed that too. People are trying to optimize the software for what they "think" the cpu is doing, while the cpu guys are optimizing for the code the cpus are executing in the wild. One side needs to stop for a bit. Lol.
08:59:37 <doug16k> I actually understated it quite a bit, it's actually much much better than what I said. the cpu is so aggressive about it, it will just predict what you will access next and start walking the tlb without even seeing a memory access instruction
09:01:14 <doug16k> if someone talks about some bottleneck and has some performance analyzer output proving it, I'm all ears
09:01:52 <bcos_> doug16k: Sometimes predicting the future is hard..
09:02:25 <nyc> doug16k: There are papers everywhere.
09:02:36 <doug16k> yes, but when you guess correctly that much...
09:02:43 <nyc> doug16k: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.4733&rep=rep1&type=pdf for example.
09:03:45 <doug16k> yes of course, I'm not saying the TLB is never an issue. I'm saying that in the general case the TLB is _really_ good
09:04:03 <doug16k> nowadays (<= 8 yrs old)
09:04:36 <bcos_> doug16k: Simple example is iterating a boring old singly linked list with a "while(element) { element = element->next; }" - you scatter those elements "randomly" over an area of many GiB of virtual space (because "malloc()" is crap) and then...
09:04:43 <nyc> It doesn't matter how good it is.
09:05:19 <nyc> TLB reach matters.
09:06:08 <doug16k> I think people are overly worried about the TLB because they see a TLB miss as a stall, as if the CPU sat there and did nothing the whole time it was doing the walk. it isn't, it's very very aggressively overlapped with your code
09:06:31 <doug16k> if you create a tight dependency chain then it will be latency bound, it can't help that. for the almost always occurring case it is excellent
09:08:09 <bcos_> Worst case = (number_of_levels_of_page_tables + 1) * cache_miss_cost = ~1000 cycles for modern 80x86
09:10:32 <nyc> Tight dependency bounds are common e.g. linked lists.
09:12:15 <nyc> Things aren't stupid; they try to use whole cachelines and pages at a time.
09:12:28 <nyc> e.g. the Intel/Oracle FAST tree.
09:13:26 * bcos_ disagrees - most things are very stupid ;-)
09:13:26 <nyc> What it comes down to is that that's not that much data.
09:14:35 <nyc> bcos_: Okay, well, those who do linear searches need to have their licenses to program revoked.
09:14:45 <bcos_> (I've never seen a programming language that gives the programmer any control over locality - malloc, new, GC, ... = all crap)
09:15:30 <bcos_> (the "placement new" that got added to C++ might be an exception, maybe)
09:16:14 <doug16k> locality is what makes vector<> good
09:16:14 <nyc> Userspace programmers actually allocate groups of things at a time for locality's sake when they're not stupid.
09:17:04 <bcos_> nyc: If I had $1 for every time a userspace programmer wasn't stupid I wouldn't be able to afford a cup of coffee.. ;-)
09:18:00 <bcos_> Heck, now that "web app" has become standard practice stupidity is a minimum requirement
09:19:26 <nyc> Where are you finding these people? In my career, the compiler hackers had Ph.D.'s and the various userspace types all had their own specialties like that.
09:21:18 <bcos_> Where are you finding software that isn't crap?
09:22:03 <bcos_> (translation: Where are you finding software that isn't improvable - same meaning with different words)
09:22:41 <doug16k> you could make a 3d renderer fill fast enough with put_pixel calls now. seriously. you have to do it pretty wrong now to matter. as time goes on, doing it correctly will only become more and more outrageously faster than the naive way, because the naive way can be worse and worse and still work now
09:22:49 <nyc> The userspace people who weren't compiler hackers at IBM were hacking on some system simulator among other things.
09:23:30 <nyc> The userspace people at Oracle were obviously hacking on various aspects of the database.
09:24:13 <uelen> okay, I figured out the issue I was having yesterday. I didn't set my userspace descriptors up right so as soon as I entered userspace the cpu went into compatibility mode
09:24:49 <uelen> which broke syscall
09:26:01 <knebulae> @uelen: that'll do it
09:26:35 <bcos_> nyc: I'm talking about software, not people - an extremely talented developer with multiple PhDs and 20 years of experience, who works on a "generic database that can't be optimised for any one specific use case" is working on crap
09:27:29 <bcos_> ..and that does NOT mean the person is crap
09:27:37 <knebulae> I'm with doug. Too many whitepapers, not enough "close to the metal" with pencil and paper. I mean, I enjoy the whitepapers, but the rubber has to meet the road. I like Rob Pike's advice: for almost all n, the standard accepted way to do things is fast enough, and when it isn't you'll know it (paraphrasing)
09:29:08 <bcos_> knebulae: I very much dislike the "fast enough" mentality - in theory, most operating systems are "multi-process" and therefore you should assume that any piece of software is competing for CPU time with many other pieces of software
09:29:35 <bcos_> ..your software might be fast enough when it's the only thing running, but..
09:29:51 <knebulae> @bcos_: when I said "fast enough," that was in reference to super fancy theoretical algorithms that are really only faster with very large datasets (very large n).
09:30:13 <knebulae> Not that we shouldn't code for perf using common sense.
09:30:59 <knebulae> I like to go fast too!
09:31:02 <nyc> Also, if we're going to go about saying that the semiconductor chemists are going to be entirely relied upon for system performance then we might as well find another line of work.
09:32:14 <nyc> The upshot of that attitude is to shut down all the computer science departments because intelligent programming is rendered completely unnecessary by speedups from semiconductor technology etc.
09:32:44 <knebulae> @nyc: how much more do we need to specialize when the state of computing is still kind of primitive?
09:32:55 <knebulae> Those people need to write new systems that don't suck.
09:33:11 <knebulae> Not write textbooks (I guess I should exit stage left now)...
09:35:12 <doug16k> I can't help but try to help the compiler generate branch free code. maybe it hurts readability a bit, but my mispredict rate is so low on my ryzen it is at the limit of belief. is it because of my style of helping the compiler? I guess I'll never know for sure
09:35:40 <eryjus> nyc, I'm late to the conversation here but I want to recount a recent operating system upgrade we did.. first, I am a business programmer that hacks on oracle software running on DB2/400 databases... we had a recent upgrade to the OS on our IBMi that changed the way that the query analyzer worked.. the result was that the query analyzer dropped into full tables scans far earlier than it did in the previous OS version. on a table with
09:35:41 <eryjus> about 50M records, this killed our performance -- begging the question I deal with all the time: how long does it take to do a day's worth of work?
09:36:23 <eryjus> i hate to go backwards like that, but thank you IBM for not makeing that change flash in red letters on the net change memo
09:36:52 <nyc> eryjus: I hope IBM is helping arrange a backout of the upgrade.
09:37:29 <eryjus> nope -- support said "if you have crap indexes, we can't make is perform like a ferrari"
09:38:16 <eryjus> now, they had a point -- the job in question went from 4 minutes to 15 hours to 3 seconds with a well built index
09:39:03 <eryjus> the point is -- you cannot ignore performance even as a business application developer, I am constantly evaluating the code I write for how it will perform
09:40:35 <eryjus> oh, and I was fixing oracle's software...
09:45:23 <nyc> Spectre/Meltdown really shaved performance percentages off of x86.
09:46:34 <doug16k> it's ok. the performance we got in exchange for those not being noticed were cheats anyway
09:46:58 <doug16k> user code controlling kernel branch prediction? come on man
09:47:22 <mahaxemu> >sharing cpu across multiple users
09:49:26 <mahaxemu> originally i was mad because i thought it really was a hardware bug, but now i lean towards software bug.
09:49:41 <mahaxemu> on x86 at least
09:50:07 <zhiayang> > running code that you didn't 100% write yourself
09:50:08 <nyc> They basically had to do XKVA / 4/4 for security reasons anyway after permavetoing it for 32-bit concerns.
09:50:20 <doug16k> it is a hardware bug though, the cpu should be tagging caches with a bit that controls whether it is eligible for hit in the current context
09:50:40 <doug16k> caches like the branch predictor
09:50:58 <doug16k> should miss if it is a user mode history entry
09:51:51 <nyc> Virtual indexing seems like it would have helped Meltdown since it wouldn't be a cache hit if it's from the wrong address space.
09:52:26 <doug16k> only intel has meltdown though
09:52:50 <doug16k> I meant in general x86 suffers from spectre on all
09:53:53 <nyc> Um, SPARC is clean, ARM is a mixed bag, POWER is affected, and there's probably more.
09:54:18 <nyc> This is with respect to Meltdown, not more general Spectre.
09:55:19 <nyc> I think Itanium is actually clean.
09:58:47 <doug16k> did someone look at ia64?
09:59:52 <nyc> Yes.
09:59:54 <nyc> https://secure64.com/not-vulnerable-intel-itanium-secure64-sourcet/
09:59:55 <doug16k> people still hemorrhaging power into those today? they double as space heaters
10:00:59 <doug16k> perf/W I mean
10:01:08 <nyc> They were interesting machines that would have been rather worthwhile if the fabs hadn't been diverted to x86-64.
10:01:19 <doug16k> oh definitely interesting design
10:01:49 <doug16k> amd64 trounced ia64 when I was at MS
10:01:50 <nyc> Basically everything hinges on the top-line fabrication facilities.
10:02:30 <zhiayang> that website is fucking with my scrolling
10:02:32 <zhiayang> i don't like it
10:02:43 <nyc> Computer science still hasn't become relevant because the fabrication processes haven't hit the wall yet.
10:03:07 <doug16k> and in exchange for their power hunger, the devs had a special class of bugs that can't happen on amd64, which was a bit annoying for a bunch of x86 people I bet
10:03:15 <nyc> Performance is all about the fabrication process and price is all about the volume.
10:04:02 <nyc> Neither price nor performance say anything about the virtues of the architecture.
10:05:17 <nyc> Wattage is pretty much fabrication process again.
10:05:40 <qookie> hi, sorry for comming on just to ask a question, but has anyone else experienced freezes/crashes on real hardware when there are port 0xE9 writes happening
10:05:56 <qookie> i use port 0xE9 for debugging output and noticed that on real hardware the kernel freezes
10:06:09 <qookie> and commenting out the outb that writes to that port fixes the freeze
10:06:24 <qookie> this happens on both computers i tested it on
10:07:22 <nyc> doug16k: So sure, you can fixate on who and what monopolized the top-of-the-line fabrication facilities and what the top-of-the-line fabrication processes are this year, but that makes you interested in metallurgy or some such, not a computers.
10:08:07 <nyc> qookie: Bochs' 0xe9 port is not good to use on real hardware, sure enough.
10:08:47 <qookie> okay, i'll just make it a debug build only option
10:10:38 <doug16k> I don't think more and more sketchy behaviour and unpredictable order is the direction we want to go on this planet. I think the total store ordering of x86 is a killer feature. correctness trumps performance, even when you might have been able to hint it to do it in a recklessly aggressive order
10:13:04 <doug16k> nyc, I'm totally happy with current fabrication. they can just stop here and I'm good
10:14:19 <doug16k> the EE's can only go down to so many atoms before reality starts going all wonky
10:17:15 <doug16k> if moore's law has much hope of continuing, the volume of the chip will have to start going up, by spreading more and more complex chips across the motherboard into ram and such
10:20:25 <nyc> 5 nm processes are on the roadmaps and 1nm processes have been demo'd in labs etc.
10:21:43 <nyc> From previous lab demo to product ship times that suggest there are 20 years of fabrication processes in the pipeline.
10:22:07 <doug16k> qookie, at the very least you are supposed to read 0xe9 once and if it isn't reading the value 0xe9 then it isn't there
10:22:17 <doug16k> port 0xe9*
10:23:09 <doug16k> or have a kernel parameter or boot menu turn it on somehow
10:23:18 <doug16k> ...in dev build
10:23:43 <qookie> i currently have a ifndef block that surrounds the write to port 0xe9
10:24:09 <qookie> and the define is added if the current build isn't a debug build
10:24:19 <doug16k> ok, but you need a runtime check
10:24:30 <qookie> alright, i'll add it now
10:24:44 <doug16k> even when the ifdef is on you are supposed to, at least once before using it, input from port 0xe9, and verify that the value you got is 0xe9
10:25:03 <nyc> (I probably don't have 20 years left in me, so I probably won't ever see computer science begin to be the dominant factor.)
10:26:20 <doug16k> reading 0xe9 when it isn't there isn't very cool, but spamming port 0xe9 with transactions that go nowhere is less cool
10:27:23 <doug16k> what you really should do if you really care, is read CPUID leaf 0x40000000 and see if it is qemu or kvm (or other(s))
10:27:27 <qookie> yeah, makes sense
10:27:34 <doug16k> and only then, when you are 99.999% sure, use it
10:27:55 <qookie> btw, aren't port reads that go nowhere supposed to return 0xFF?
10:28:03 <doug16k> they will typically yeah
10:28:19 <doug16k> aliasing can make the answer to that "it depends"
10:28:58 <doug16k> it's not uncommon for a device's registers to echo several times over an unnecessarily large range, so something that looks past a device is still in there
10:29:35 <qookie> okay
10:29:47 <qookie> i implemented the check and it seems to still be working fine, so that's good
10:31:17 <qookie> thanks for the answers
10:42:52 <doug16k> port 0xe9 is great debug output stream because it is so utterly trivial is is almost inconceivable for something to be wrong with it
10:43:14 <doug16k> no matter how hardcore the crash/corruption, it should work assuming you didn't overwrite the code itself
10:44:02 <doug16k> the code itself being, load 3 registers and execute one instruction, it's pretty foolproof
10:47:29 <doug16k> under tcg it's practically a direct write() call to the chardev, but goes through address translation vectoring thing
10:48:13 <doug16k> under kvm at least the whole string instruction is one vmexit regardless of string length
10:48:45 <qookie> in the earlier stages of development my kernel crashes were so severe that all i got were single nonsensical characters after the point it broke at
10:49:16 <doug16k> ya, before you get memory under control, it's pretty disastrous :)
10:49:38 <doug16k> I've had a couple of extremely bad memory corruption bugs. made me wonder how it worked at all
10:49:59 <qookie> i mean, the latest crash like that occured not too long ago, my memory allocator was returning a block in the middle of the process table and other data
10:50:15 <doug16k> same pages were mapped in more than one place and it somehow worked enough to make it hard to diagnose
10:50:30 <qookie> and somehow it broke so much that it caused the code to overwrite the kernel binary
10:50:48 <qookie> since outputted strings were corrupted
10:50:54 <doug16k> do you have a stack guard page?
10:50:58 <qookie> no
10:51:02 <doug16k> you should have a page it runs into if the stack runs away
10:51:08 <doug16k> if not you will keep going until you overwrite code
10:51:56 <doug16k> it will hit it, then it will try to push a page fault frame, bam, double fault, and at least stops before it trashes the entire universe and breaks into a disastrous mess that is impossible to diagnose
10:52:13 <qookie> hmm, sounds like a good idea
10:53:09 <qookie> i'll align the stack to a page boundary and unmap the first page of the stack
10:56:45 <doug16k> for that to truly work, your #DF needs to switch to another emergency stack. on x86_64 it'd be an IST slot, on i386 it'd be a task gate
10:57:38 <doug16k> but it will at least partially work in concert with qemu "-d int" if you wire that up and look for #PF followed by #DF
10:58:03 <doug16k> when that happens, it is almost certainly a kernel stack overrun
10:58:47 <qookie> passing "-no-shutdown -no-reboot" allows inspection of registers and memory at the point of the triple fault since it pauses qemu when one occurs
10:59:35 <doug16k> in my kernel I have a constant `stack_guard_size` and I reserve that much guard pages of address space at both ends of every stack to catch underrun and overrun. I use 64KB for now. it helped a lot
11:00:17 <doug16k> it's just a bunch of not-present pages that are setup so #PF handler can tell it should not have accessed there
11:00:29 <doug16k> don't consume memory other than the PTE's they cover
11:00:51 <qookie> i feel 64KB is a bit of an overkill(my kernel and irq stacks are both 16KB)
11:01:03 <doug16k> it isn't using 64KB of memory
11:01:19 <qookie> i know, but i feel that big of a hole is an overkill
11:01:23 <doug16k> it is reserving a no-go zone of 64KB where pages would be, but aren't, because they are not present guard pages
11:01:30 <qookie> yeah i understand
11:01:34 <doug16k> I have 128TB to play with
11:01:55 <qookie> i'm in 32bit mode without PSE for now
11:01:55 <w1d3m0d3> uh is that of one physical page or
11:01:56 <doug16k> think I am worried about 64KB to get correctness and less memory corruptions? sign me up!
11:02:19 <doug16k> 64KB = 16 PTEs = 128 bytes of overhead
11:02:47 <doug16k> 256 bytes per kernel stack of PTEs
11:03:38 <doug16k> use less then, 4KB at lowest address is minimum
11:04:07 <doug16k> I assume you know, but just in case, don't forget that you need to make the 1st page of the stack not present, don't do the last page
11:04:14 <qookie> yeah i know
11:04:20 <w1d3m0d3> wait why
11:04:23 <doug16k> onlookers :)
11:04:39 <doug16k> push down stack. when the stack overruns it is because the stack pointer pushed down to too LOW of an address
11:04:42 <qookie> this conversation made me hastly check if i set esp to stack+stacksize actually
11:05:13 <w1d3m0d3> I remember that one time I put my stack in the middle of my code, fun times...
11:05:33 <w1d3m0d3> had me debugging for hours
11:05:39 <qookie> and i just realised that my irq stack actually points at the end of the designated irq stack area and possibly tramples the normal stack
11:05:42 <w1d3m0d3> before i noticed one of the instruction encodes to a return address
11:06:21 <doug16k> it's fun when people accidentally put the stack pointer at the beginning of the stack memory block, so it never even touches the stack memory, it just starts trashing whatever is before the stack block :)
11:06:56 <qookie> i'm suprised it actually worked
11:07:13 <qookie> probably because everything happens in userspace apart from interrupt handling, so the irq stack is always used
11:07:42 <w1d3m0d3> hm I should start drawing memory maps in reverse so that they align with code
11:12:12 <qookie> okay, implemented the guard pages and it seems to work
11:12:31 <qookie> writing into the guard pages yields a special message before the regular kernel panic
11:13:29 <qookie> looking at my paging code makes me really want to refactor it
11:13:40 <doug16k> nice
11:13:55 <qookie> i'll leave it for later
11:13:55 <doug16k> it helps
11:14:01 <qookie> i don't want to do it right now
11:14:12 <qookie> it's 00:13 and i'm kinda tired
11:14:23 <doug16k> ya I think everyone wants to refactor their paging code once a bunch of stuff works :)
11:14:37 <doug16k> at first it's "oh please work"
11:15:39 <doug16k> diagnosing bad paging can be very difficult, depending on the bug and how rare it occurs
11:15:40 <klange> Either you're dying to refactor, or you never want to touch any of it ever again.
11:16:08 <doug16k> or that
11:18:31 <doug16k> I'm in a weird predicament where I did the bootloader's page table manipulation nicer than my kernel, because I learned so much from doing kernel paging by the time I added uefi and stuff to my bootloader, my approach was way better
11:18:55 <doug16k> now I have to migrate that design into the kernel which has a much wider API interface
11:19:43 <doug16k> the bootloader's old page table manipulation code was so concise it was easy to replace it with a drastically better implementation
11:20:44 <qookie> I generally want to rewrite the paging code because I wrote it when I knew less about paging than I do now
11:21:26 <qookie> and I constantly use hacks like: "set_cr3(different_page_directory); map_page(...); set_cr3(kernel_page_directory);" to map into a different page directory
11:22:15 <doug16k> also, my bootloader does large block I/O's better, lol. when you do a large read there the fs driver makes an I/O plan that makes a list of partial sector aligned block ranges to transfer into memory
11:22:43 <doug16k> partial sectors or aligned block ranges*
11:24:01 <doug16k> I have to migrate that concept up into the kernel fs and block driver interfaces too
11:24:57 <doug16k> but, the bootloader is calling int 13h garbage and kernel is doing ahci/nvme/xhci/whatever proper dma into a cache, so...
11:25:26 <qookie> I want to start working on a device manager and VFS server but I keep finding bugs with what I currently have implemented
11:25:49 <qookie> which sometimes take days and multiple bugfixes fixing bugs in bugfixes to completly fix
11:27:43 <qookie> like recently I found a bug where the kernel was freeing buffers that were in use, causing them to be allocated and corrupted somewhere else
11:28:30 <qookie> and I introduced that bug as a fix to a memory leak when an inter-process message failed to send
11:29:41 <doug16k> sounds about right. that's good progress
11:30:02 <qookie> anyway, I'm gonna go now
11:30:06 <qookie> it was nice talking to you
11:30:14 <qookie> see ya
11:30:14 <doug16k> you too