channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched
#osdev2 = #osdev @ Libera from 23may2021 to present
#osdev @ OPN/FreeNode from 3apr2001 to 23may2021
all other channels are on OPN/FreeNode from 2004 to present
http://bespin.org/~qz/search/?view=1&c=osdev2&y=22&m=6&d=13
00:00:00 <doug16k> it suddenly has essentially infinite decode bandwidth in loops
00:00:00 <mrvn> if it can fit it in the cache
00:00:00 <doug16k> yeah but that is like 1000+ insns now
00:01:00 <doug16k> it's amazing
00:01:00 <mrvn> why is the simple "loop" slower then if it would just stream the decoded opcodes?
00:02:00 <doug16k> loop is deliberately bad. I have been avoiding loop since pentium
00:03:00 <doug16k> the story I heard was, it is used in delay loops, so intel deliberately kept it bad
00:04:00 <doug16k> since pentium, the simpler instructions are faster
00:04:00 <doug16k> that is a microcoded instruction
00:05:00 <mrvn> cound down loops are such a common thing the cpu really should have a fast opcode for them.
00:05:00 <doug16k> well they kind of do - there is macroop fusion that fuses dec jnz to one op
00:06:00 <mrvn> doug16k: dec + jnz == loop, makes no sense
00:06:00 <doug16k> it's just super flexible so you just use the existing test inc dec add sub whatever followed by branch to encode one fused op
00:06:00 <mrvn> it's not like delay loop would work with all the MHz changes the cou went through
00:07:00 <moon-child> what I heard is that loop is bad because it has to deal with being interrupted
00:07:00 <moon-child> you decrement rcx, then potentially jump somewhere; if that somewhere isn't mapped, you #pf, but you also need to restore the previous value of rcx
00:07:00 <doug16k> you are right, by now it doesn't even slow it down that bad if it was serializing per iteration. it gradually became almost irrelevant how fast it is
00:07:00 <mrvn> like an IRQ between dec and jnz?
00:07:00 <moon-child> mrvn: no, you can be atomic wrt that
00:08:00 <mrvn> ahh, #pf. yeah, that might be tricky.
00:08:00 <mrvn> With dec + jnz the #pf would have an IP pointing at the jump.
00:08:00 <moon-child> yep
00:09:00 <doug16k> interrupts are serializing, so it will wait until everything is retired, then start the interrupt
00:09:00 <mrvn> It's just so annoyingly complex to use dec+jnz. Adds 6 opcodes to the 6 the loop itself is.
00:10:00 <mrvn> or 4 actually, dec only changes CF, not OF
00:10:00 <moon-child> yeah, no problem with interrupts. Just exceptions from the loop insn itself
00:12:00 <mrvn> it's stupid that x86 always changes the flags. Other archs have a bit for that.
00:12:00 <moon-child> agreed
00:13:00 <doug16k> because x86 instructions are small
00:13:00 <doug16k> other arch have wasted bits they want to use
00:13:00 <mrvn> although dec+jnz needs to change the flags. :(
00:14:00 <moon-child> well. Ideally you could ask dec to only set the zero flag
00:14:00 <moon-child> doug16k: what if it were a prefix you could apply, optionally, only where you could profit?
00:14:00 <doug16k> you can use lea to inc dec
00:14:00 <doug16k> no flag change at all
00:14:00 <doug16k> then what :D
00:15:00 <moon-child> then it wouldn't waste bits
00:15:00 <mrvn> moon-child: dec is so nice and keeps the CF flag. But it overwrites OF.
00:15:00 <doug16k> no flag inc rax -> lea 1(%rax),%rax
00:15:00 <moon-child> yes but we want only-zero-flag inc rax
00:16:00 <doug16k> I know - then what. how do you branch or not. you could jmp *%xxx and cmov to that
00:16:00 <mrvn> doug16k: I need something that jcc can use that leaves CF/OF alone.
00:16:00 <doug16k> then you need to wreck flags lol
00:17:00 <mrvn> hence the whish to set the Zero flag
00:17:00 <doug16k> I think you should try setc then cmp that reg at the top to get it back
00:17:00 <doug16k> it should overlap your loop inc jcc for free
00:17:00 <mrvn> doug16k: I need OF, dec doesn't change CF
00:18:00 <doug16k> oh
00:19:00 <mrvn> Or maybe the whole idea of using adcx/adox for additions is crazy and it's only useful for multiplications.
00:19:00 <doug16k> if you did seto then neg that, then at top, cmp that with something that makes it OF or not?
00:21:00 <moon-child> mrvn: I think the original impetus was for crypto. Not gonna have branches in crypto code anyway (not constant time), and you know the sizes, so you can just fully unroll
00:21:00 <doug16k> ah crap - trashes CF. I see the problem
00:21:00 <mrvn> moon-child: you aren't going to unroll an 4096 bit key.
00:22:00 <doug16k> can't use lahf sahf ?
00:22:00 <moon-child> use neg to set carry?
00:22:00 <mrvn> "It is valid in 64-bit mode only if CPUID.80000001H:ECX.LAHF-SAHF[bit 0] = 1." Not sure.
00:23:00 <mrvn> moon-child: carry is unaffected by "dec"
00:23:00 <doug16k> moon-child, the idea was to make "1" from "seto" to become all 1's so it is negative, then you can cmp that with something that makes OF be set
00:24:00 <doug16k> I didn't sit down and figure out the polarity, but it's useless because it would ruin CF we are trying to carry around the loop
00:24:00 <mrvn> doug16k: adox -1, reg should do
00:26:00 <mrvn> lahf works here
00:27:00 <doug16k> 3950x supports it
00:27:00 <doug16k> yours is intel?
00:28:00 <mrvn> that would make it: sahf; N times 6 opcodes; lahf; 4x lea; dec; jnz
00:28:00 <mrvn> amd
00:28:00 <mrvn> and whatever compiler explorer has
00:33:00 <mrvn> I should start measuring. It might all be moot because the limiting factor is ram.
00:36:00 <moon-child> most people have no use for numbers so large they don't fit in cache :P
00:36:00 <doug16k> you should unroll it enough that the iteration count is low enough for the predictor to learn the pattern
00:36:00 <doug16k> for the final branch
00:36:00 <doug16k> if you can
00:37:00 <doug16k> if you can get it to speculate right into the return, can't beat that
00:40:00 <doug16k> then it won't speculate into excessive iterations and not realize until the last dependency chain completely retires
00:41:00 <moon-child> isn't 'learning the pattern' mainly a function of whether your bigints actually have predictable sizes, which is an application concern?
00:44:00 <doug16k> if a loop branch is taken a predictable number of times, and it's not too many, it can correctly predict the final not taken and speculate correctly and return and start speculating correctly there, instead of it speculating into the weeds and not realizing until the very last iteration retires
00:44:00 <doug16k> then flushing pipeline and starting all over
00:45:00 <doug16k> it can learn taken taken taken taken ... not taken and be perfect
00:46:00 <mrvn> doug16k: the problem is that this will often be called from "mul", which recursively divides. So you get size 2, 4, 8, 16, 32, ...
00:46:00 <doug16k> if you are lucky, the branch history that leads to that can make it have another separate history remembered for that size
00:47:00 <mrvn> Sizes 2, 4, 8 should probably be spezial and fully unrolled. And then 16, 32, ... as loops
00:48:00 <doug16k> there can be one branch that mispredicts, but it's right from then on, because that is a different branch history, and the following stuff is using different history values
00:48:00 <doug16k> and it's right from then on for a while
00:48:00 <doug16k> the sequence of taken/not taken that have recently occurred cause it to select a different set of history memories
00:49:00 <doug16k> you hope
00:49:00 <doug16k> there is aliasing that can cancel it
00:49:00 <doug16k> usually it works
00:52:00 <doug16k> for example, if you had if (debug_enabled) print("stuff"); repeatedly, then when that is taken or not, it uses different hisotry memory for the following branches, and learns the if is always false or if is always true pattern
00:52:00 <mrvn> One hope with the tiny loop is that if the loop runs 1024 times the one mispredict at the end is irrelevant.
00:52:00 <doug16k> and all the following ifs that go the same way predict correctly
00:53:00 <mrvn> If you unroll then it's taken less often and might be worse in predicting
00:53:00 <doug16k> I mean, even if you flicked debug_enabled on and off, the first if's mispredict would cause it to select the other branch histories and predict the rest correctly
00:54:00 <doug16k> the mispredict at the end isn't irrelevant if it would have speculated into a load in the caller, and got started on it sooner
00:54:00 <doug16k> way sooner
00:57:00 <doug16k> if you keep it speculating well, it doesn't matter what the instructions do very much, it will decompose everything into a dataflow graph and start everything asap
01:00:00 <doug16k> I would think of it as the pipeline having excessive integer execution pipelines, and one thread can't even saturate them with realistic instructions
01:02:00 <doug16k> I picture the carry dependency chain to be the determining factor, and everything else goes through for free in other execution units
01:02:00 <doug16k> and they have no effect
01:09:00 <mrvn> doug16k: now rethig that agai with all cores running the same loop
01:09:00 <mrvn> rethink even
01:12:00 <doug16k> it must be an epic amount of adding to have two dependency chains back to back on a modern amd
01:12:00 <doug16k> why not avx2?
01:12:00 <mrvn> doug16k: because it has no addx. Getting the carry is complex.
01:13:00 <doug16k> yeah but you can probably get so many carries at once that it isn't bad
01:13:00 <mrvn> only 4
01:15:00 <doug16k> ok, now do 2 dependency chains of that like you did with adc/adx
01:15:00 <doug16k> that would be insane
01:18:00 <doug16k> carry is just compare less than, then subtract that from destination and it will subtract -1 or 0, adding one or not
01:19:00 <doug16k> unsigned
01:19:00 <mrvn> gcc/clang are a bit stupid there. They compare and then mask it with 1 so they can later add it.
01:20:00 <moon-child> can probably do as many as 4 at once
01:22:00 <mrvn> moon-child: but will it be faster than doing them in sequence?
01:22:00 <moon-child> try it and see
01:22:00 <doug16k> I think you will get the adc and adx going through in the same cycle and everything else is nothing
01:22:00 <doug16k> assuming cache hits
01:22:00 <mrvn> mixing adc + avx is another idea.
01:23:00 <moon-child> I think adc will steal execution ports from avx
01:23:00 <doug16k> why?
01:23:00 <moon-child> so if avx is faster, there'd be no point
01:24:00 <doug16k> I don't think it using the avx opcode space means it uses fpu pipelines
01:24:00 <doug16k> if it did then there would be bypass delays that aren't mentioned in bmi
01:24:00 <mrvn> but if I only need to add 5 numbers I can do avx for 4 and adc for the last.
01:24:00 <moon-child> doug16k: yeah but it's int ops
01:25:00 <moon-child> don't scalar and vector int ops use the same ports?
01:25:00 <doug16k> amd has completely decoupled integer and float
01:25:00 <moon-child> there's no float here though
01:25:00 <doug16k> avx would be
01:26:00 <doug16k> I think we agree anyway
01:26:00 <mrvn> if avx with interger a float or int operation?
01:26:00 <mrvn> is ...
01:26:00 <moon-child> if you use avx to do bigint addition, then you're doing int ops
01:26:00 <doug16k> adx is a 3-operand instruction isn't it?
01:26:00 <moon-child> 'I think we agree anyway' maybe :P
01:26:00 <doug16k> it's floating point pipelines if integer avx yeah
01:27:00 <doug16k> they obviously don't have 256 bit alus in the integer ones
01:27:00 <doug16k> and no gigantic registers
01:27:00 <moon-child> oh hmmm
01:27:00 <moon-child> mrvn: what if you just repeat the adox?
01:27:00 <doug16k> if one thread was avx and other was integer, it would be glorious
01:28:00 <moon-child> wait no that doesn't make sense
01:28:00 <mrvn> moon-child: then I can just adc, it's a shorter opcode
01:28:00 <moon-child> no I meant from the previous iteration
01:28:00 <moon-child> but that doesn't work
01:29:00 <doug16k> avx can do a hell of a lot fewer loads/stores
01:29:00 <doug16k> that alone is huge
01:29:00 <mrvn> doug16k: fewer opcodes, same volume.
01:29:00 <mrvn> if you are waiting for memory it doesn't matter
01:29:00 <doug16k> you can fit way more bandwidth into the same amount of reorder buffer slots
01:30:00 <doug16k> speculate further
01:30:00 <moon-child> mrvn: a simd load and a scalar load have the same throughput
01:30:00 <doug16k> one load can be a byte or 256 bits. which one is faster
01:30:00 <moon-child> in terms of # loads/cycle
01:30:00 <moon-child> but the former does a lot more work
01:30:00 <moon-child> (ditto store)
01:30:00 <mrvn> both take 200 cycles to fetch a cache line from memory
01:31:00 <moon-child> if you're waiting for memory than nothing else matters anyway
01:31:00 <mrvn> that's what I said,.
01:31:00 <moon-child> but you want to optimise
01:31:00 <moon-child> optimisations only matter when you don't hit memory. So focus on that case
01:32:00 <doug16k> you said it was AMD so that means it is going to hit the cache all the time, unless you have more than the huge L3 of bigints
01:32:00 <doug16k> you have gigantic caches
01:33:00 <mrvn> That totally depends on the size of the Bignums you have. If the numbers have a million bits then cache become a bit limited.
01:34:00 <moon-child> then you're bandwidth limited
01:35:00 <mrvn> If you do 4 AVX streams in parallel thats 32MBit or 4MB of data for a million bit numbers.
01:35:00 <moon-child> memory is p fast
01:35:00 <doug16k> one CCX L3 is 16MB
01:35:00 <mrvn> 6MB for a + b instead of a += b
01:35:00 <gamozo> what? memory is so slow!
01:36:00 <doug16k> guessing which gen though
01:36:00 <heat> i know some of these words
01:36:00 <heat> computer go brrr
01:37:00 <moon-child> gamozo: bandwidth
01:37:00 <gamozo> that's fair!
01:37:00 <doug16k> if you know for a fact that it won't fit in the cache, then you should be using non-temporal loads
01:37:00 <doug16k> and stores
01:37:00 <moon-child> ^ that too
01:37:00 <gamozo> memory bandwidth got so much better
01:37:00 <gamozo> tbh, non-termporal is kinda spotty? I've yet to find many good situations for it, even with streaming writes
01:37:00 <gamozo> I don't understand computers
01:38:00 <zid> NT's just very likely to make things worse unless it DEFINITELY makes them better
01:38:00 <zid> because of prediction and caches and stuff
01:38:00 <zid> it's just hard to use in real programs
01:38:00 <moon-child> a colleague recently worked out how to use nt stores in matrix multiplication
01:38:00 <gamozo> the main issue si that _most_ compute you can batch results and keep it in cache rather than going to non-temporal memory
01:39:00 <zid> Like, when was the last time you did a prefetchw
01:39:00 <heat> is nt defined if you use nt and non-nt accesses?
01:39:00 <moon-child> haven't implemented it yet, but I made a model of it. Didn't seem to help. But there are second order effects
01:39:00 <doug16k> gamozo, yeah, it has to be a perfect use case for it to win. everyone uses the data too soon after and that makes it look awful
01:39:00 <zid> and NT is arguably harder
01:39:00 <moon-child> heat: I think you can get either the stored value or the previous value
01:39:00 <heat> zid, most prefetchws are wrong anyway
01:39:00 <zid> yup
01:39:00 <moon-child> and if you do a write, not specified which one gets written
01:39:00 <moon-child> until the next sfence
01:41:00 <doug16k> gamozo, if you know that your multiword adc chain is over 16MB though
01:41:00 <gamozo> :gasp:
01:41:00 <gamozo> That fits in l3!
01:41:00 <zid> Not in my l3 :(
01:41:00 <gamozo> :(
01:41:00 <zid> I have 10M or 12M available
01:42:00 <zid> unless I figure out how to do dual socket with a pair of 1xxx xeons
01:42:00 <zid> of different skus
01:42:00 <doug16k> my gen is 16MB per 4 core CCX, so 64MB total
01:42:00 <gamozo> I just got new procs with 48 KiB of l1 and it's HOT
01:43:00 <zid> did you get the one with 768MB of L3 yet
01:43:00 <zid> Imagine paying £7000 for a 3.8GHz turbo cpu
01:44:00 <doug16k> what cpu is that? some epyc variant?
01:44:00 <zid> 7373x
01:44:00 <gamozo> I only have epycs in my storage server, I need avx-512 :(
01:44:00 <zid> no you don't
01:44:00 <gamozo> YES I DO!
01:44:00 <doug16k> gamozo, zen4 will have it
01:44:00 <doug16k> soon
01:44:00 <zid> unless you happen to be doing *exactly* an avx-512 on that cpu, all day, you don't :P
01:44:00 <zid> load*
01:45:00 <gamozo> zen4 wont have it right?
01:45:00 <gamozo> they only will have the 16-bit flaot stuff, but not even AVX-512F
01:45:00 <gamozo> it's gonna be a scuffed implementation I bet
01:45:00 <gamozo> (at least, that's how I read it)
01:45:00 <gamozo> They've been really slippery on answering questions
01:46:00 <doug16k> yeah I am just going by rumor bs
01:46:00 <zid> avx-512 is a whole family of shit
01:46:00 <zid> so god knows what you'd get even if they did add it
01:46:00 <gamozo> yeah
01:46:00 <doug16k> I am half expecting it to be 2 256-bit ops, but the mask regs would help, if it had avx512f
01:46:00 <gamozo> I mainly want avx-512f and avx-512bw
01:47:00 <gamozo> the mask regs are largely what I want, but I really wouold like all 512 bits
01:47:00 <mrvn> If your Bignum is 16MB then the adc chain will read 32MB and write 16MB.
01:47:00 <doug16k> it's a 3-operand add?
01:48:00 <mrvn> doug16k: frequently.
01:48:00 <doug16k> write allocate could make one load free
01:49:00 <mrvn> If you multiply then some of the sub-terms you need multiple times. So you need a non-destructive add
01:49:00 <mrvn> But with a+b = b+a you can probably shuffle stuff around a lot to use 2-operand add a lot.
01:52:00 <doug16k> I think CPUs are unnecessarily fast already
01:52:00 <doug16k> to the extreme
01:53:00 <zid> play dwarff ortress and say that
01:53:00 * mrvn throws an 6502 at doug16k
01:53:00 <doug16k> try portal on 2K@165Hz. it's so smooth and perfect, it's distracting
01:54:00 <doug16k> every time I do a 180 I am like "whoa that was sooo smooth... geez"
01:54:00 <zid> framerate is a hell of a drug
01:56:00 <gamozo> 165hz? where do you get 165hz monitors?
01:56:00 <gamozo> that's new to me~
01:56:00 <heat> stores
01:56:00 <heat> you even have 240hz ones lol
01:56:00 <heat> also 360hz as well I think
01:57:00 <doug16k> https://www.amazon.ca/gp/product/B08LZPXD4C/ref=ppx_yo_dt_b_asin_image_o08_s00?ie=UTF8&psc=1
01:57:00 <bslsk05> www.amazon.ca: LG UltraGear 32GN600-B 32 Inch(31.5) QHD VA 5ms with 1ms MBR 144Hz 165Hz Gaming Monitor AMD FreeSync, Black : Amazon.ca: Electronics
01:57:00 <doug16k> almost cheap now
01:58:00 <doug16k> it's about 4GB/s just to do the scanout for dual monitor
01:59:00 <zid> Give me my ramdac backs damnit
01:59:00 <zid> I don't care about 4k I want 1080 and a ramdac
02:03:00 <doug16k> I realized something funny yesterday. if I put a word processor on zoom to whole page, then the page on the screen is more than 8.5x11 lol
02:03:00 <zid> What's that in ISO
02:03:00 <doug16k> A4
02:03:00 <doug16k> I think
02:03:00 <zid> I think my monitor's about A4 tall
02:04:00 <zid> but i'm not sure it has amazing dpi
02:04:00 <zid> 72?
02:04:00 <zid> 112? I forget and I'm lazy
02:04:00 <zid> I'd have to do trig
02:04:00 <doug16k> sounds right
02:04:00 <doug16k> mine is 96 or near that IIRC
02:04:00 <doug16k> 2K
02:05:00 <zid> A = 23", whatever angle 16/9 makes.. err.. something something, dpi.
02:05:00 <zid> H=23 even
02:07:00 <doug16k> 2K is too much for 31", so I had to cancel out some of the extra room anyway, just slight better font rendering, mostly
02:08:00 <doug16k> cancel out with increasing font size I mean
02:08:00 <zid> yea, I'd like that, given I can't turn off truetyp
02:08:00 <zid> I had an anti-aliased font hook program at one point but it was pretty unreliable
05:35:00 <Jari--> hi all
05:36:00 <klys> hi jari--
07:17:00 <Jari--> klys: so hows OS business
07:18:00 <Jari--> All kernels seem to have this file system, even drivers access the root file system with open close read write lseek.
07:18:00 <Jari--> I am still manually poking with readwrite block getsize etcs.
07:19:00 <Jari--> vfs
07:24:00 <Jari--> I sometimes wondering what parts of the kernel should be using the internal LIBC and what parts should have direct access.
07:24:00 <Jari--> Drivers for example would probably be better with using internal device API.
08:33:00 <mrvn> Jari--: Join us in the microkernel world. None of them should have direct access.
08:36:00 <heat> D:
08:39:00 <mrvn> Jari--: If you are talking about firmware loading then maybe rethink the approach. Supply the firmware blob from userspace like Linux does. If you are talking about FSes then they kind of need block read/write but that usualy should go through the block cache and have some protection against writing outside the partition the FS is on.
08:39:00 <mrvn> or the FS is on raid or lvm and needs to access a virtual device.
08:39:00 <heat> no it's defo not firmware loading
08:40:00 <heat> you're overthinking this :P
08:41:00 <mrvn> heat: what other than firmware loading would access files?
08:41:00 <heat> who said anything about accessing files
08:41:00 <mrvn> open close read write lseek.
08:41:00 <heat> he's talking about the vfs
08:41:00 <heat> also seems confused
08:41:00 <heat> very unclear question
08:45:00 <Jari--> I want to run my file system driver on Linux text console, thats why I am thinking of adding some features to my VFS.
08:45:00 <heat> what's the linux text console, to you
08:45:00 <heat> what features are you lacking
08:45:00 <Jari--> a terminal
08:46:00 <heat> the terminal is just a pipe of text
08:46:00 <Jari--> lots of dependabilities non-POSIX
08:46:00 <heat> user process reads, user process writes
08:46:00 <heat> that's how the terminal works
08:46:00 <heat> the kernel just displays it
08:46:00 <Jari--> heat: console instead of virtual machine
08:47:00 <heat> oh so you want to run a driver as a userspace program?
08:47:00 <Jari--> heat: yes
08:47:00 <heat> ok, that's doable
08:47:00 <heat> wrap your internal API into libc functions
08:48:00 <Jari--> heat: my kernel is MS-DOS like, more than a microkernel
08:48:00 <Jari--> although it is linear memory space, non-segmented
08:49:00 <heat> how is it MS-DOS like=
08:49:00 <heat> ?
08:49:00 <Jari--> heat: well I wrote API to be MS-DOS compliant
08:49:00 <Jari--> MS-DOS and C applications
08:49:00 <heat> you might be screwed
08:49:00 <Jari--> DJGPP really
08:51:00 <mrvn> You can port your kernel to posix as "hardware", using signals, mmap, mprotect, settimer, ... to emulate all the hardware stuff. But it's a major undertaking. Or add a qemu-user-your-kernel backend to qemu.
08:52:00 <mrvn> having drivers access the hardware directly will make it basically impossible to do any of it though. You want to go through the API.
09:01:00 <Jari--> Sorry guys, I get migraine attacks so my talking is probably not the most consistent ever right now.
09:02:00 <mrvn> coding with a migrane is a bad idea. makes it worse and produces crap. better sleep it off.
09:05:00 <Jari--> mrvn: I keep rewriting same functionallities, so it is sort of spaghetti code at worst.
09:05:00 <Jari--> Especially writing interpreters is difficult.
09:06:00 <Jari--> mrvn: I want my OS able to run Commodore Basic token binary programs.
09:06:00 <Jari--> Basically what I am now writing on kernel is it to be Linux like as much as possible.
09:07:00 <Jari--> UN*X OS does not have to be enormous to function, like 386BSD kernel f.e.
09:07:00 <heat> 386BSD was already pretty complex
09:07:00 <heat> same with all the previous BSDs
09:10:00 <Jari--> heat: if I drink coffee, my migraine vaporizes instantly
09:10:00 <Jari--> must be lack of caffeine
11:50:00 <dostoyevsky2> isn't linux just like 250 syscalls?
11:59:00 <heat> 400 and something but yeah
12:00:00 <Mutabah> plus ioctl/etc
12:00:00 <heat> plus ioctl, plus pseudo fses, setsockopts, etc
12:01:00 <heat> and probably more that I can't think of right now :P
12:01:00 <heat> glorified eBPF interpreter? :P
12:07:00 <mrvn> L4 has 6 syscalls
12:07:00 <mrvn> just for comparison :)
12:10:00 <dostoyevsky2> if you have a C program that implements a couple of syscalls, how difficult is it to get that C program boot up in qemu? Do you need to write your own boot code in asm, or could you just reuse something?
12:12:00 <mrvn> use the multiboot format and you can use qemu --kernel mykernel.elf
12:41:00 <heat> dostoyevsky2, there's significant code behind loading a program
12:41:00 <heat> even more significant if you're doing it properly with the vfs and all that
13:10:00 <dostoyevsky2> heat: couldn't you just compile a -fPIE/PIC program and thereby be able to simply load that blob into your memory and just jump to the code without any fancy loading?
13:11:00 <mrvn> with or without -fPIE/PIC makes no difference
13:11:00 <heat> those still need to be loaded
13:11:00 <zid> problem imo is cpu modes
13:11:00 <heat> a PIC program isn't just a blob you can run directly
13:12:00 <mrvn> And you need to setup the C runtime environment, meaning you need a stack.
13:12:00 <dostoyevsky2> if you don't have position independent code you'd need to setup proper virtual memory addresses, no?
13:12:00 <mrvn> dostoyevsky2: -fPIC is not position independent code
13:12:00 <zid> It's just as easy with or without, if you can specify the load address
13:13:00 <zid> I do wonder what you intend to provide the syscalls for though if you're not expecting to be loading stuff properly
13:14:00 <mrvn> and what will the syscalls do without malloc or printf or anything else
13:15:00 <mrvn> dostoyevsky2: you might want to read https://wiki.osdev.org/Barebones
13:15:00 <bslsk05> wiki.osdev.org: Bare Bones - OSDev Wiki
13:34:00 <heat> syscalls are another question
13:34:00 <heat> depends on the syscall, of course ;)
13:36:00 <dostoyevsky2> does tcc -run/libtcc actually create an executable or does it just directly generate the executable in memory and jumps to it?
13:36:00 <heat> idk
13:36:00 <mrvn> dostoyevsky2: depends on what you accept as executable.
13:36:00 <mrvn> is a bash script an executable?
13:37:00 <heat> why are you being obtuse
13:37:00 <heat> it's clearly an ELF
13:37:00 <mrvn> heat: I think he is talking about "#!/usr/bi/tcc -run/libtcc"
13:39:00 <dostoyevsky2> The simplest loadable binary format could be like: https://dginasa.blogspot.com/2012/10/brainfuck-jit-compiler-in-around-155.html
13:39:00 <bslsk05> dginasa.blogspot.com: A Dumb Guy in a Smart Age: Brainfuck JIT Compiler in Around 155 Lines with GNU C
13:41:00 <zid> The simplest is just .com
13:41:00 <zid> but requires.. a loader
13:41:00 <heat> a.out?
13:41:00 <mrvn> coff coff
13:41:00 <heat> yesterday I saw a hobby OS that used a.out
13:42:00 <heat> that was weird
13:42:00 <zid> that is infact, weird
13:42:00 <mrvn> in name or for real?
13:42:00 <heat> for real
13:42:00 <heat> a.out loader
13:42:00 <heat> it's very 80s of them
13:47:00 <mrvn> I have the best format of them all: bin
13:47:00 <dostoyevsky2> > #define TCC_OUTPUT_MEMORY 0 /* output will be ran in memory (no output file) (default) */
13:48:00 <dostoyevsky2> tcc -run does not bother with a binary format
13:48:00 <heat> you sure?
13:48:00 <heat> that just means you don't write to a file, you write to memory
13:48:00 <mrvn> and then mprotect it to make it executable and call it
13:51:00 <dostoyevsky2> heat: https://github.com/TinyCC/tinycc/search?q=TCC_OUTPUT_MEMORY
13:51:00 <bslsk05> github.com: Search · TCC_OUTPUT_MEMORY · GitHub
14:00:00 <dostoyevsky2> heat: here for a use case, so no fancy loading necessary as tcc already arranged everything in memory for you to just jump to the code: https://github.com/TinyCC/tinycc/blob/82b0af74501bf46b16bc2a4a9bd54239aa7b7127/tests/libtcc_test.c#L104
14:00:00 <bslsk05> github.com: tinycc/libtcc_test.c at 82b0af74501bf46b16bc2a4a9bd54239aa7b7127 · TinyCC/tinycc · GitHub
14:42:00 <Jari--> I managed to crash lynx browser
14:45:00 <ddevault> this is bizzare
14:45:00 <ddevault> the page at 2000 gets overwritten when I zero out 120000
14:47:00 <j`ey> mapped it twice?
14:48:00 <ddevault> I can't imagine so
14:48:00 <ddevault> it's part of the first 64G I have memory mapped
14:48:00 <ddevault> identity mapped*
14:48:00 <heat> is 0x120000 there?
14:48:00 <ddevault> yep
14:48:00 <heat> info tlb pls
14:49:00 <ddevault> wut
14:49:00 <heat> info tlb
14:49:00 <j`ey> in qemu
14:49:00 <heat> the qemu command
14:49:00 * ddevault tries to remember how to summon the qemu console
14:49:00 <heat> tip: use -monitor stdio so you can easily copy stuff
14:49:00 <ddevault> I use stdio for serial
14:50:00 <j`ey> ctrl-a-c
14:50:00 <j`ey> iirc?
14:50:00 <heat> maybe
14:52:00 <ddevault> massive dump
14:52:00 <heat> pastebin it
14:52:00 <ddevault> overflows my terminal buffer
14:52:00 <j`ey> post it directly into irc
14:52:00 <ddevault> perfect
14:52:00 <heat> lol
14:53:00 <heat> tee it to a file and pastebin that?
14:53:00 <ddevault> trying to figure out how to capture it
14:53:00 <heat> ^^
14:55:00 <ddevault> https://paste.sr.ht/~sircmpwn/04c00875d1844e823bedc338f0b8cdc2c5aec30d
14:55:00 <bslsk05> paste.sr.ht: paste.txt — paste.sr.ht
14:56:00 <ddevault> looks fine to me :<
14:57:00 <heat> where is any of that mapped?
14:57:00 <ddevault> ffffff8000000000: 0000000000000000 --PDA---W
14:57:00 <ddevault> the region with the address being overwritten
14:57:00 <ddevault> ffffff8001200000: 0000000001200000 --P-----W
14:57:00 <ddevault> the region I'm supposed to be writing to
14:58:00 <heat> why and when are you zeroing memory
14:58:00 <ddevault> it's a page table
14:58:00 <heat> are you sure you're not overwriting something you're using by accident
14:59:00 <ddevault> not yet
14:59:00 <ddevault> ruling something else out now
14:59:00 <mrvn> don't map ffffff8000000000 so nullptr mapped to virtual crashes
14:59:00 <ddevault> good call, will follow up on that later
15:00:00 <mrvn> and map text and rodata read-only
15:00:00 <ddevault> yeah, our elf loader is very basic
15:00:00 <ddevault> will improve that later
15:00:00 <heat> huh?
15:00:00 <heat> 0xffffff8000000000 has nothing to do with nullptr
15:01:00 <mrvn> why is anything in lower half still mapped?
15:01:00 <mrvn> heat: nillptr mapped to virtual
15:01:00 <ddevault> got it
15:01:00 <heat> mrvn, what's the problem?
15:01:00 <ddevault> my userspace page allocator was clearing pages regardless of if they are device memory or not
15:02:00 <heat> userspace what
15:02:00 <ddevault> err
15:02:00 <mrvn> heat: it's just one of those addresses you can end up by accident
15:02:00 <ddevault> the code which gives pages to userspace
15:02:00 <heat> :whew:
15:02:00 <heat> mrvn, nothing wrong happens if you touch the null page
15:03:00 <heat> it's a page like any other
15:03:00 <ddevault> there is nothing useful there, though
15:03:00 <mrvn> heat: that's not what this is about.
15:03:00 <ddevault> so might as well take the extra defense against errors
15:03:00 <heat> a real worry would be to have the 0x0 page mapped
15:03:00 <mrvn> although mapping stuff below 1MB is riksy
15:03:00 <heat> the page frame? no, not a problem
15:06:00 <heat> risky how?
15:06:00 <mrvn> heat: there is reserved ram and mmio there you shouldn't mess with accidentally.
15:07:00 <heat> if you have a read/write primitive in the kernel you've already won
15:08:00 <heat> and then you're not looking to crash the machine, but to take it over
15:08:00 <mrvn> ddevault: do you have anything that parses the memory map and maps just the avaibale parts of memory?
15:09:00 <heat> also important to note that you'll get a nice speed advantage if you map everything
15:09:00 <mrvn> heat: if you are taking over the machine then you write your own page tables and none of this matters anyway.
15:09:00 <heat> huge pages are fast
15:09:00 <ddevault> mrvn: yes
15:10:00 <mrvn> that depends on your cpu. Some have only a few TLB entries for huge pages.
15:11:00 <mrvn> ddevault: then why is ffffff8000000000 a 2MB page? That shouldn't be available in its entirety.
15:12:00 <ddevault> hm
15:12:00 <ddevault> no clue
15:12:00 <ddevault> https://todo.sr.ht/~sircmpwn/helios/30
15:12:00 <bslsk05> todo.sr.ht: ~sircmpwn/helios#30: Why is ffffff8000000000 mapped as a 2MB page? — sourcehut todo
15:13:00 <heat> because you did it probably
15:13:00 * ddevault shrugs
15:13:00 <ddevault> will investigate later
15:14:00 <heat> in fact, i bet your page table mapping code is interpreting the huge page as a page table and writing to it
15:14:00 <heat> that makes so much sense and explains so much
15:14:00 <ddevault> feel free to dig into the code if you want
15:14:00 <ddevault> busy
15:27:00 <Jari--> ohh the simplicity of 32-bit memory management
15:27:00 <mrvn> how is that simpler?
15:27:00 <Jari--> its also easier test sandbox for a kmalloc
15:27:00 <Jari--> mrvn: well I am accustomed to it, no idea about 64-bit
15:28:00 <heat> 32-bit memory management is hell
15:28:00 <heat> it's the opposite of simple lol
15:28:00 <mrvn> gets bad when you have more than 1GB
15:28:00 <Jari--> yeah, I/O mapped memory crashes on more than 1 gig RAM
15:28:00 <Jari--> mrvn: heat: yeah
15:29:00 <Jari--> Ask UEFI for I/O memory mapped memory or PCI bus?
15:30:00 <heat> huh?
15:30:00 <Jari--> Where do you usually get the memory map on your system.
15:30:00 <Jari--> PCI gives I/O and memory addresses on my kernel.
15:30:00 <mrvn> from the bootloader
15:30:00 <heat> yes
15:31:00 <heat> what's the problem
15:31:00 <Jari--> multiple bridges might have issues
15:31:00 <heat> how so
15:31:00 <Jari--> heat, e.g. does AGP register up on PCI bridge?
15:32:00 <heat> you're talking to me about an old ass technology but yes afaik
15:32:00 <Jari--> Okay so PCI is really big business.
15:32:00 <Jari--> heat: on VMware I have ran the kernel with 3 gigs of RAM, with luck.
15:33:00 <mrvn> how do you even fit the PCI devices into 32bit address space? How do you handle a GPU with 8GB ram?
15:33:00 <Jari--> mrvn: memory extension might support up to 8 gigs of RAM, supported in standard 32-bit Linux kernels
15:33:00 <heat> usually, it's not all mapped as a BAR
15:33:00 <heat> see: resizable BAR extensions
15:34:00 <mrvn> Jari--: with PAE? Now you made bad even worse.
15:34:00 <heat> no, PAE is 64GB
15:34:00 <Jari--> really thats awesome
15:34:00 <Jari--> I havent seen many 128 gig systems so far
15:35:00 <heat> this is probably the opposite of awesome
15:38:00 <mrvn> Does anyone still produce systems with PAE and no long mode?
15:39:00 <heat> the answer must be yes
15:39:00 <heat> :)
15:41:00 <Jari--> 64K limited DMA?
15:41:00 <Jari--> Lol should upgrade my drivers soon.
15:42:00 <heat> hm?
15:42:00 <heat> that's some old ass DMA
15:42:00 <heat> ISA DMA? something like that
15:46:00 <mrvn> Jari--: with PAE and 64GB ram you already have a problem because you need some space for PCI:
15:46:00 <mrvn> total used free shared buff/cache available
15:46:00 <mrvn> Mem: 64851252 45326304 16804872 1968532 2720076 16868532
15:46:00 <mrvn> Swap: 67108860 602112 66506748
15:48:00 <heat> free -h pls
15:49:00 <mrvn> total used free shared buff/cache available
15:49:00 <mrvn> Swap: 63Gi 588Mi 63Gi
15:49:00 <mrvn> Mem: 61Gi 43Gi 15Gi 1.9Gi 2.6Gi 15Gi
15:49:00 <heat> thank
15:49:00 <heat> is that a 32-bit machine you're running 64GB of ram on?
15:50:00 <mrvn> no. That would be insane.
15:50:00 <heat> why 61 then?
15:50:00 <mrvn> stupid bios, pci memory hole, shared memory wiht the gpu
15:51:00 <heat> the PCI memory hole won't take away your ram though
15:51:00 <mrvn> heat: with a stupid bios it does
15:51:00 <heat> /unless/ you're running a 32-bit kernel
15:51:00 <heat> huh
15:51:00 <heat> total used free shared buff/cache available
15:51:00 <heat> Mem: 7.7Gi 5.1Gi 293Mi 1.1Gi 2.3Gi 1.2Gi
15:51:00 <heat> Swap: 8.4Gi 2.7Gi 5.7Gi
15:51:00 <heat> all good here
15:51:00 <qookie> with a stupid bios everything is possible :^)
15:52:00 <mrvn> why 7.7 and not8?
15:52:00 <heat> fuck do I know
15:52:00 <heat> pre-used memory probably
15:52:00 <mrvn> probably the same, A hole below 4GB for 32bit pci
15:52:00 <heat> hmm no
15:52:00 <heat> I don't think so
15:53:00 <heat> lets see
15:53:00 <mrvn> do you still have the memory map in the dmesg output?
15:53:00 <heat> let me journalctl
15:53:00 <qookie> on my system i have 15.5G available to the OS according to lsmem, and 14G usable according to free
15:53:00 <qookie> and that matches up with 512M of stolen memory for the igpu
15:54:00 <qookie> RANGE SIZE STATE REMOVABLE BLOCK
15:54:00 <qookie> 0x0000000000000000-0x00000000cfffffff 3.3G online yes 0-25
15:54:00 <qookie> 0x0000000100000000-0x000000040fffffff 12.3G online yes 32-129
15:54:00 <heat> oh yessss
15:54:00 <heat> probably stolen memory
15:54:00 <heat> https://gist.github.com/heatd/11e38ddc8df738bf058406f00f221512
15:54:00 <bslsk05> gist.github.com: gist:11e38ddc8df738bf058406f00f221512 · GitHub
15:57:00 <mrvn> it's odd, 0x400000000 is 16GB. So it didn't just punch a hole where the PCI regio is but remapped the ram. But then it should go up to 0x43fffffff. So something is stealing from that.
15:59:00 <mrvn> heat: you have a huge hole there in the lower 4GB
16:31:00 <ddevault> I bet I can port doom fairly soon
16:31:00 <ddevault> without audio, that is
16:48:00 <mrvn> ddevault: do you have pong? Snake? frogger?
16:48:00 <ddevault> no, but why walk when you can run
17:11:00 <doug16k> neat, I didn't know my system had RAM all the way up to 0xdfffffff. I wonder if the hole would be bigger if I booted with CSM
19:00:00 <geist> also if you're using a discrete GPU it tends to be bigger, since 'stolen' graphics ram tends to be just off the top of what appears to be the end of ram
19:00:00 <geist> sometimes you can probe past it and actually find the framebuffer
19:02:00 <geist> er smaller. . okay to rephrase, integrated graphics tend to steal ram off the top of where ram appears to stop
19:02:00 <geist> ie, it'll say ram goes up to 0xb000.0000 but actually there's a chunk at b... to 0xc000.0000 that's just not accounted for
19:03:00 <geist> but it actually ends up being a chunk of ram that is given to the graphics card
19:09:00 <qookie> do dgpus actually steal any CPU memory?
19:10:00 <geist> not that i know of, aside from whatever stuff the driver might allocate locally
19:10:00 <geist> dgpus have their own little address space in their universe, and their own mmu to see their own stuff
19:11:00 <qookie> ah I misunderstood what you said
19:11:00 <geist> yah that's cause i wrote it backwards on the first line. igpus are the one that steals cpu memory
19:11:00 <qookie> yeah
19:12:00 <qookie> but these days afaik they don't steal much, most memory is mapped in via the GTT
19:12:00 <qookie> on my system (integrated AMD Vega 6 or 7) only 512M is stolen, but games can use way more
19:12:00 <geist> yah, though usually something like 64-256MB in my experience
19:13:00 <geist> yah exactly. enough that it causes your end of lower ram to appear to stop shorter than it should
19:13:00 <geist> considering where PCI space starts up, etc
19:13:00 <qookie> yeah
19:13:00 <geist> i found a lot of this by fiddling with the TOLUD MSRs and whatnot on AMD and the intel equivalent
19:13:00 <geist> it's how some of that sausage is made basically
19:14:00 <geist> the registers that control where the cpu stops trying to decode DRAM and starts trying to decide mmio space
19:36:00 <geist> awww crap. a week after installing the new motherboard: server locked up, exactly the same way
19:37:00 <zid> Turns out once a week, the cleaner comes past the server, plugs a vacuum cleaner into the same outlet, and vacuums the floor
19:37:00 <geist> pretty much
19:37:00 <geist> this mobo is kinda neat though: it has the build in aspeed thing so i can log into a web page and see the console and reset it and whatnot
19:37:00 <geist> nothing interesting on the event log though
19:38:00 <zid> oh yea I've seen those controllers
19:38:00 <zid> It's a bit like ME but vendor specific I guess
19:38:00 <zid> https://www.aspeedtech.com/server_ast2500/#:~:text=AST2500%20is%20ASPEED's%206th%20generation,best%20performance%20server%20management%20solution
19:38:00 <zid> I happened to have looked at this one last week
19:38:00 <geist> yah i think they're fairly ubiquitous. has a little OS on it that is running some thing. yep exactly that one
19:39:00 <zid> I think the main reason is that if you're adding a 2MB VGA framebuffer from matrox, you might as well get this thing instead
19:39:00 <geist> if you have a DGPU on board it doesn't show up on PCI, but somehow the bios has some sort of knowledge to enable it's VGA feature if it doesn't see a DGPU
19:39:00 <geist> 100% and you can then be actually headless with it and still get a console
19:40:00 <geist> though the little web page it serves leaves a bit to be desired, and i would not expose it to anything from a security point of view
19:40:00 <zid> yea nor ME, as it turns out
19:40:00 <zid> it's had exploits before in its various stacks, which always ends up making the news
19:40:00 <geist> so now i'm starting to believe that zen 2s are simply not stable as a long term system. i've now seen a 3900x and 3950x fail the same way. they're running kinda hot but not super hot
19:41:00 <zid> I wonder how it gets its firmware and stuff, I guess the bios is just adapted to knowing its there and blats it in at device discovery time or something
19:41:00 <geist> i suppose it could be the PSU or memory though, so i guess i can start popping pairs of ram and see. need to establish what the new MTBF is in the new regime. ran for 8 days before locking up this time
19:41:00 <zid> You could still blame your PSU if you like
19:41:00 <zid> or ram yea
19:41:00 <geist> it just seems so unlikely
19:42:00 <zid> eh my machine is perfectly stable until you load enough of the cores for long enough
19:42:00 <zid> which on a "random server setup" is actually very unlikely
19:42:00 <zid> (the 3.3V rail issue I discussed before)
19:42:00 <geist> in this case it doesn't seem to be load related, or even related to being a VM host or not. seems to fail equally fast if i am running a bunch of qemu instances or not. usually fails when it's not loaded at all
19:43:00 <zid> Could be failing to come out of a sleep state then?
19:43:00 <zid> asks the VRMs for more juice, they try draw it from the psu, psu ramps it too slow
19:43:00 <geist> i suppose it's possible it could simply be a linux bug, but that seems highly unlikely
19:43:00 <zid> and everything is undervolted for a bit
19:43:00 <geist> maybe?
19:43:00 <zid> I can't rule it out, at least
19:44:00 <zid> so a psu is a thing you could certainly try
19:44:00 <geist> yah, actually plan on moving to another case tomorrow anyway, so i'll switch to another equal but different PSU at the same time
19:44:00 <geist> have basically the same case for a test machine that has more vents, so going to move it there and install some more fans so it can hopefully run cooler
19:45:00 <geist> it's a nice cheapo case. corsair 100r case. nice solid cheap case
19:45:00 <geist> lots of drive bays and can hold a full atx
19:45:00 <zid> My case is a bit of cheap whatever that I took the window out of to make the cooler fit :P
19:45:00 <zid> the ssds are hanging down from their psu cables
19:51:00 <geist> yeaaaaaaah
20:39:00 <heat> geist, TOLUD is a thing on chipsets for intel
20:39:00 <heat> you'll see it in your chipset docs and there are a bunch of references to it in i915 docs
20:39:00 <geist> yeah there's some AMD equivalent
20:39:00 <geist> like many things i think the AMD one is more straightforward, but its called something vaguely similar
20:40:00 <geist> ah TOPMEM
20:40:00 <heat> i've seen those "hidden" devices for my chipset
20:41:00 <heat> mine has a device that gets hidden after booting (somewhere in the SEC phase IIRC)
20:41:00 <geist> https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/zircon/kernel/platform/pc/pcie_quirks.cc#155 etc
20:41:00 <bslsk05> fuchsia.googlesource.com: zircon/kernel/platform/pc/pcie_quirks.cc - fuchsia - Git at Google
20:41:00 <heat> completely stops responding to PCI accesses
20:41:00 <heat> it's fascinating
20:41:00 <geist> thats te AMD equivalent. basically trying to read where the PCI allocation space starts, for both the 32 and 64bit regions
20:42:00 <heat> what for?
20:42:00 <geist> i think the idea at the time is if we have to allocate space for PCI devices we need to compute what the aperture is available to us
20:42:00 <geist> i dont think it's really used, but we went ahead and wrote the code anyway, in case it was needed
20:43:00 <geist> but looking at the proper TOLUD/TOPMEM was needed because of the stolen graphics memory thing i was talking about before
20:44:00 <geist> RAM may appear to stop at some address, but it may actually extend past it in stolen graphics that appears as a unused chunk in the e820 stuff
20:44:00 <geist> so if you check TOLUD/TOPMEM you can find where the proper end of DRAM is
20:44:00 <heat> TOPMEM just the lower part?
20:44:00 <heat> is TOPMEM*
20:45:00 <geist> yah and there's a TOPMEM2 that is the end of the 64bit mapping
20:45:00 <geist> i dont know how you find that second spot on intel hardware. this is where AMDs is much more straightforward
20:45:00 <geist> just a pair of MSRs that tell you precisely what you want to know
20:45:00 <heat> btw I've seen AMD is adding some stuff to the EFI memory map on their new supercomputer platform
20:46:00 <heat> basically the gpu mem gets put in the memory map as well
20:46:00 <geist> yah, wonder if that generally starts >4GB above TOPMEM2?
20:46:00 <geist> that would be fair game, since that's not decoded as ram
20:47:00 <heat> does TOLUD/TOPMEM include SMRAM?
20:47:00 <geist> dunno what SMRAM is
20:47:00 <heat> system management ram
20:47:00 <geist> dunno
20:48:00 <geist> does SMRAM even show up in the cpu's address space at all?
20:48:00 <heat> the fw steals a good chunk of memory to have as smm state and smm code
20:48:00 <heat> I believe so, you just can't touch it
20:48:00 * geist nods
20:49:00 <geist> in that case it's probably contained within TOLUD sice the idea is that's where he cpu stops trying to decode these as memory controller addresses
20:49:00 <heat> fun fact: it grows based on the number of cpus
20:49:00 <geist> (640k hole nothwithstanding)
20:49:00 <geist> i remembe ron AMD at least there's a set of MSRs that control the 640k hole. iirc a bitmap of 64k chunks. you can configure it such that the 640k hole doesn't exist if you want
20:50:00 <heat> how are those MSRs?
20:50:00 <heat> those should be chipset details afaik
20:50:00 <geist> yes, but all modern x86s are chipsets as well
20:50:00 <geist> this is the SOC side of the world
20:50:00 <heat> intel exposes everything on the PCI bus
20:50:00 <geist> AMD has fully embraced this and simply created a bunch of MSrs to configure stuff like this, root pci bus stuff, even the memory controller itself
20:50:00 <heat> just pci device registers everywhere
20:51:00 <geist> intel stucks to the old model and exposes it as pci stuff
20:51:00 <geist> yeah this is where AMDs solution is far mor straightforward
20:51:00 <heat> actually shouldn't you have catch any exceptions when reading those?
20:51:00 <heat> how do you know they're there?
20:51:00 <geist> when you look at it through that lens an AMD SOC looks a hell of a lot like a standard ARM SOC. a cpu with a bunch of system control registers to set up te world
20:52:00 <geist> intel at least pretends that there's some chip on the other side of the bus that configures everything
20:52:00 <geist> the pci bus that is
20:52:00 <heat> you're just checking if its an AMD cpu. how do you know if that specific platform has it?
20:52:00 <geist> you are the bios, you proibably simply know
20:52:00 <geist> this is bios level stuff
20:53:00 <geist> you can read the cpuid and see what cpu it is
20:53:00 <heat> I mean, this particular fuchsia code
20:53:00 <geist> checks for vendor AMD
20:53:00 <heat> and all AMDs have it?
20:53:00 <geist> all AMDs we care about it, but we also have a safe msr routine that catches the trap
20:54:00 <geist> it's not foolproof for sure. but i also am not sure this code is even called anymore since we moved PCI into user space
20:54:00 <geist> this particular routine may be vestigial
20:54:00 <geist> the general concern in my mind is when you boot on an virtual machine that exposes AMD but acts like something else
21:06:00 <geist> but in general not having a safe msr routine is annoying. i think down in the exception code we have some sort of mechanism for that
21:06:00 <geist> like if the #GP address is a particular non inlined msr instructon set an error code and return
21:06:00 <geist> annoying but basically necessary at some point
21:13:00 <qookie> geist: instead of poking at platform regs, i think you can just ask acpi about the root bus resources if you need to allocate bar space for devices etc
21:13:00 <geist> yah i think that's the actual correct way, i think. problem is of course i think that involves the complex, bytecode parsing parts of ACPI
21:13:00 <geist> which we dont want in the kernel
21:14:00 <geist> but now that we moved pci driver into user space it's possible to get that stuff
21:14:00 <qookie> and besides without driver support for having bars move somewhere else, you're at the mercy of the fw for how much space it assigned to the bridge if you're allocating to a device behind one
21:15:00 <qookie> linux just bails if it's not enough
21:17:00 <gorgonical> Porting this OS is a lot of work
21:17:00 <gorgonical> When do I get my certificate of genuine hackerman from geist
21:18:00 <geist> whatcha porting?
21:18:00 <gorgonical> A linux-ish kernel used in hpc stuff
21:18:00 <gorgonical> The ARM64 port that I half-did is also wildly incomplete in some areas. E.g. I'm 80% sure processes cannot receive signals on the arm64 port lol
21:20:00 <geist> ah
21:21:00 <gorgonical> We're doing some in-house risc-v design and having a kernel we can modify simply would be nice
21:21:00 <gorgonical> Linux works on a lot of the boards but good luck modifying any major subsystem
21:22:00 <geist> yah makes sense
21:25:00 <gorgonical> It's obvious to any of you I'm sure but there's just a lot of small details to resolve -- where does the TLS pointer go? How is the kernel stack arranged? What does context save/restore look like? What about traps/exceptions? All the hand-asm for things like atomics, etc. And they all vary by architecture
21:28:00 <qookie> speaking of TLS, is there any concrete docs on how it's supposed to work on aarch64? i only found a fuchsia.dev page about it and got the rest of the info i needed from looking at musl, guessing, and looking at our existing code for x86
21:29:00 <qookie> (the last part to figure out how TLS works in general :P)
21:29:00 <gorgonical> my understanding is that the TLS ptr is stored in tpidr_el0
21:29:00 <gorgonical> I don't know of any concrete docs. Is it just agreed-on convention between libc and the kernel? "Let's use tpidr_el0 and both of us agree not to clobber it?"
21:30:00 <geist> yah that's to me the fun part. figuring out all the arch specific details
21:30:00 <geist> seeing how one things maps to another thing
21:30:00 <qookie> yeah that much i figured, i mean the docs on how userspace expects it to be laid out
21:30:00 <geist> qookie: pretty sure it's well documented in the arm docs github
21:30:00 <geist> thats where the official ELF specs and whatnot exist
21:30:00 <qookie> i haven't found anything there about that in the elf abi supplement
21:30:00 <gorgonical> geist: it's definitely very exciting and I love the feeling of programming the machine itself. But each file I have to adapt and don't get to test makes me more afraid to run it
21:30:00 <qookie> for example, userspace expects that tls blocks start at TPIDR_EL0+0x10
21:30:00 <geist> i think it may be a similar one
21:31:00 <geist> ah yes. that may be where the ELF spec stops and the OS specific spec begins
21:31:00 <geist> or at least libc specific spec begins
21:31:00 <qookie> and the linker will sometimes hardcode an offset based on that assumption, for example in the local-exec model
21:31:00 <geist> yep
21:31:00 <gorgonical> and there's just a certain amount of work until the kernel will build at all, much less run
21:32:00 <qookie> but i have not found an official document from arm (nor GNU or anyone who makes toolchains) about that layout requirement
21:33:00 <gorgonical> Very rude network
21:34:00 <geist> but yes, FWIW TPIDR_EL0 is the user space TLS root
21:34:00 <geist> there's a TIPIDRRO_EL0 which as far as i know has no real use anywhere. not even sure linux uses it for anything
21:34:00 <geist> may do something like put the cpu # in it or whatnot
21:34:00 <qookie> linux uses it to stash a register in the interrupt handler in some code path
21:35:00 <geist> TPIDR_EL1 is of course up to the kernel to use, but generally it holds a pointer to either the current cpu sturcture or the current thread structure
21:35:00 <qookie> but yeah, we have tls more or less working in our libc, but i'm just annoyed i couldn't find any documentation about it
21:35:00 <geist> also note that x18 is free for the platform to use, in both kernel and user space. the abi says either it's temporary or platform use
21:36:00 <j`ey> qookie: its in the stack overflow path
21:36:00 <gorgonical> klange was saying you have to pass a compiler flag to make sure the compiler doesn't use it, right?
21:36:00 <gorgonical> about x18
21:36:00 <geist> -ffixed-x18 yes. otherwise it's up to the triple to default it to whatever use it has
21:37:00 <gorgonical> Ah which may be just a gpr
21:37:00 <geist> ya if the platform doesn't use it then it's another temporary
21:37:00 <geist> since x16, x17 are otherwise just interprocedural temporaries
21:37:00 <geist> x18 is too if it has no other use. it's basically the highest temporary, since x19 has some use
21:38:00 <geist> it's the first of the callee saved ones
21:44:00 <heat> I also want a certificate of genuine hackerman from geist
21:45:00 <heat> what's the final test
21:55:00 <kingoffrance> run on vax :D
21:56:00 <heat> "get a vax" sounds pay2win to me
21:57:00 <kazinsal> relax, send a fax to a vax; certified hax
22:01:00 <geist> noice
22:02:00 <mjg_> vaxcine
22:02:00 <mjg_> get it
22:02:00 <heat> hahahahahaha
22:02:00 <heat> omg
22:02:00 <heat> so funny
22:02:00 <mjg_> ikr
22:03:00 <heat> 😂😂😂😂😂😂
22:03:00 <heat> i should use emojis more
22:03:00 <heat> see if your shitty systems break
22:06:00 <mjg_> i'm on ubuntu so you are not far off
22:07:00 <heat> 🚫Ubuntu, 👍👌Arch linux, which I do use
22:09:00 <j`ey> btw
22:10:00 <psykose> btw
22:12:00 <Ermine> btw
22:44:00 <klange> < geist> there's a TIPIDRRO_EL0 which as far as i know has no real use anywhere. not even sure linux uses it for anything ← it's the thread pointer on macOS, contrary to everyone else :)
23:51:00 <graphitemaster> geist, see the new floppytron?
23:51:00 <graphitemaster> https://www.youtube.com/watch?v=kCCXRerqaJI
23:51:00 <bslsk05> 'The Floppotron 3.0 - Computer Hardware Orchestra' by Paweł Zadrożniak (00:03:25)
23:53:00 <mjg_> solid t-shirt