channel logs for 2004 - 2010 are archived at http://tunes.org/~nef/logs/old/ ·· can't be searched
#osdev2 = #osdev @ Libera from 23may2021 to present
#osdev @ OPN/FreeNode from 3apr2001 to 23may2021
all other channels are on OPN/FreeNode from 2004 to present
http://bespin.org/~qz/search/?view=1&c=osdev&y=19&m=2&d=21
12:30:33 <jmp9> now i'm in long mode
12:30:54 <jmp9> and trying to jump to address 0xFFFF003FC0000000
12:30:59 <jmp9> and i'm getting GPF
12:32:11 <ybyourmom> Then don't jump there lel
12:32:31 <jmp9> but there's my kernel
12:33:07 <ybyourmom> heh, that's a real conundrum then :')
12:33:14 <immibis> doesn't long mode have a rule where the top N bits of the address had to be equal (for some value of N that I thought was more than 16)?
12:33:48 <jmp9> okay i paging works
12:33:54 <jmp9> but at address 0xFFFF003FC0000000 it's something weird
12:33:59 <jmp9> not my code absolutely
12:34:19 <jmp9> can i with qemu see to which address virtual addr mapped to?
12:34:37 <ybyourmom> Yea, info tlb
12:34:54 <ybyourmom> And I think `page 0xXXXXXXXX`
12:37:32 <jmp9> wtf
12:37:35 <jmp9> very strange
12:37:46 <jmp9> there's only first 1 MB identity mapping
12:38:27 <eryjus> jmp9 your address is not canonical
12:38:32 <jmp9> uh
12:38:34 <jmp9> why
12:38:43 <jmp9> 0b1111111111111111 000000000 011111111 000000000 000000000 000000000000
12:39:13 <eryjus> bit 47 MUST be repeated to bits 48-63
12:39:26 <jmp9> rly?
12:39:34 <eryjus> really really
12:40:04 <jmp9> so they're zeros
12:41:35 <eryjus> well, they should be, but yes. every address must have that pattern
12:43:19 <jmp9> nvm
12:48:08 <jmp9> ok problem was in address
12:48:15 <jmp9> i implemented only 128 elements of PDP
12:48:23 <jmp9> and 011111111 means 255
12:48:24 <jmp9> not 127
12:50:47 <jmp9> how i get translation of virtual address
12:50:48 <jmp9> in qemu
12:51:03 <jmp9> oh
12:51:05 <jmp9> info tlb yes
12:54:54 <jmp9> holy shit
12:54:56 <jmp9> it works!!!!!
12:57:26 <ybyourmom> ur the man
12:57:50 <ybyourmom> https://www.youtube.com/watch?v=w3xcybdis1k
01:16:01 <jmp9> oh man
01:16:06 <jmp9> it works on real hardware too
03:23:14 <doug16k> knebulae, why use that when you can use `perf`
03:23:36 <doug16k> you can even measure your actual OS by telling perf to measure kvm --guest
03:23:55 <doug16k> ...from the host
03:23:55 <knebulae> @doug16k: I've never really benchmarked, so that tool looked really cool.
03:24:07 <knebulae> @doug16k: I have some very surprising gaps in my knowledge.
03:24:10 <doug16k> perf does that and a lot more I think
03:24:19 <doug16k> oh perf is a tad obscure, no worries about that
03:24:35 <knebulae> I just liked the simplicity of being able to throw some asm on the command line and have it run in kernel mode with a benchmark report.
03:24:39 <doug16k> really powerful perf analysis tool that lets you sample perf counters
03:24:54 <doug16k> knebulae, you on linux?
03:25:08 <knebulae> @doug16k: I have heard of perf, but I know nothing else more than you've said.
03:25:13 <knebulae> @doug16k: win10
03:25:19 <knebulae> But Linux abounds all around me.
03:25:23 <knebulae> Just not on my main box
03:25:40 <knebulae> *All the time :)
03:25:52 <doug16k> ah. well on a linux box you'd do this to just get a realtime perf analysis to see an example of it: sudo perf top
03:26:26 <doug16k> you can tell it to measure branch mispredicts, cache hits, and anything that a perf counter can measure
03:26:42 <doug16k> tells you what lines of code are doing it the most
03:27:36 <knebulae> @doug16k: my threadripper w/ubuntu 18.04 says perf: command not found. Fixed easily enough, but this is a Zoneminder box, and anyone that runs Zoneminder knows, every update breaks *something*. And if it's not that, it'll be my RAID driver that I have to rebuild from source on every kernel upgrade. But I love it. Lol. :/
03:28:07 <doug16k> ah you probably need to apt install it and probably linux-tools-...
03:28:09 <knebulae> @doug16k: so I will give perf a good workout shortly
03:28:17 <doug16k> ya it's amazing really
03:28:25 <knebulae> @doug16k: thanks for the heads up
03:29:23 <doug16k> to measure a vm guest (your OS maybe), do: sudo perf kvm --guest stat
03:29:47 <doug16k> or record instead of stat, wait a while to record samples, then ctrl-c it and: perf report
03:29:48 <ybyourmom> 10 ft tall
03:29:57 <doug16k> oops, sudo perf report (root owns the output file)
03:31:06 <knebulae> @doug16k: of course I have to figure out what package perf is in. I'll look at it here in an hour or so.
03:33:56 <doug16k> knebulae, install package named: linux-tools-$(uname -r)
03:34:41 <doug16k> expands to something line: linux-tools-4.15.0-45-generic
03:34:45 <knebulae> @doug16k: ok
03:34:58 <doug16k> no pressure though :D
03:37:07 <knebulae> @doug16k: very cool. No, I've admin'ed linux boxes for roughly 18 years. This was no problem. I'm just doing something else at the moment. It's installed. It's very cool. I just don't have time to mess further right this second.
03:37:47 <doug16k> oh then you probably know more about linux and posixy stuff than I do
03:38:19 <knebulae> @doug16k: I ran sudo perf top, and got a great readout of all of the active functions, but it failed to record, I'm guessing because I didn't type "sudo perf record" first. :/ But I will dink with it.
03:38:20 <doug16k> I know enough to fix install stuff and get drivers working and use it. I don't know all the gnarly stuff
03:39:05 <doug16k> knebulae, top is just a realtime thing like `top`. record writes to a file for later perusal in the same ui as top but with summary of whole run
03:39:21 <knebulae> @doug16k: a crappy mishmash of both. more linux, because that was what was available to me, other than the Sun's in the labs. But then more POSIXy because of my work on my OS over the years. So I'm "competent," not great.
03:39:41 <knebulae> @doug16k: ok, then I'm missing record.
03:40:00 <doug16k> say you had it on mispredicts. if you had a bad binary search tree comparison mispredicting all the time, it'd show that as the hottest thing and hitting enter on it takes you right to the offending branches next to source fragment
03:40:28 <knebulae> @doug16k: sorry, I'm responding to you so as not to be rude, but I'm not giving you my full attention. I'll be back with you in 10 minutes or so. Sorry.
03:40:36 <doug16k> then you forehead-slap and make it branch free
03:40:45 <doug16k> no worries. sorry I'll leave you to work
03:51:19 <knebulae> hey doug, sorry about that man. I was just finishing some actual work (excel reports).
03:51:43 <knebulae> It's kind of hard to focus on 2 excel sheets, my irc client, and putty with my video server going.
03:51:44 <knebulae> lol
03:53:10 <knebulae> anyway, so I am just misreading this thing. perf top runs top within perf. But when I exit top, and try to run perf report, it complains that it can't find perf.data (No such file or directory), and helpfully hints to (try 'perf record' first). And that's when I stopped and got back to work.
03:53:53 <doug16k> does it really run top?!
03:54:11 <knebulae> well, it looks like top-like functions within a list
03:54:12 <doug16k> that shouldn't happen, must be old or some configuration of it I haven't seen
03:54:18 <doug16k> oh yes it is top like
03:54:21 <doug16k> not actually top
03:54:26 <doug16k> whew
03:54:51 <knebulae> yeah, it's definitely perf. And exhibit A of why I had to stop for a minute ;)
03:54:52 <doug16k> ok sorry I thought you said it runs top, I took you way too literally
03:55:11 <knebulae> I wasn't paying close attention, and out of the corner of my eye, it looked similar to top
03:55:55 <knebulae> Although, it *should* be running top somewhere, shouldn't it?
03:56:07 <knebulae> Isn't that what I'm watching perf dance to?
03:56:20 <doug16k> no top
03:56:26 <knebulae> damn bro
03:56:49 <doug16k> perf is communicating with a kernel module and it is using the hardware counters and various kernel counts to get that
03:57:04 <doug16k> see the top? it's cycles
03:57:13 <doug16k> :ppp means max precision
03:57:25 <knebulae> I don't see ppp
03:57:35 <doug16k> I see cycles:ppp
03:57:47 <doug16k> maybe different arch where not needed or something
03:57:57 <doug16k> all this slightly varies with architecture
03:58:07 <doug16k> even within x86. I mean generations
03:58:46 <doug16k> anyway, what you mainly do with this thing, is do targeted measurements of different aspects of performance
03:59:10 <doug16k> mispredicts is a good one, telling you where some branch free code would help a lot (it's often easy to optimize)
03:59:27 <doug16k> you can measure tlb misses, or cache misses, lots of things
03:59:42 <doug16k> run this to see what you can measure: perf list
04:00:00 <doug16k> then you just -e that_thing and it measure that
04:00:18 <doug16k> perf -e that_thing
04:00:27 <doug16k> record, or top, or whatever after that
04:01:34 <doug16k> sorry, run `sudo perf list` to get a decent list. only user available ones show without
04:02:35 <knebulae> @doug16k: https://nebulae.online/perf4doug16k.png
04:02:59 <doug16k> ya, no :ppp. neat. what generation is the cpu?
04:03:03 <knebulae> @doug16k: I understand. I will definitely spend some time with it.
04:03:15 <knebulae> It's a Threadripper (new gen) 16/32
04:03:22 <knebulae> 29 whatever
04:03:38 <doug16k> ah, I'm comparing it to my laptop haswell (I'm away from home)
04:03:49 <doug16k> probably no ppp on my 2700x either
04:03:57 <knebulae> It's the first AMD I've put together since Athlon
04:04:22 <doug16k> intel has a thing where you can let event counts "skid" more (blame not quite the right instruction) but run faster
04:04:31 <knebulae> it's a 2950X
04:04:55 <knebulae> @doug16k: ok
04:05:02 <doug16k> ppp means "minimize skid"
04:05:13 <doug16k> precise precise precise
04:05:31 <knebulae> oh, ok. So probably no option on AMD (or a different cpuid flag or some such)
04:05:52 <doug16k> amd probably won't skid, or always minimizes skid, or something
04:06:34 <doug16k> I think the skidding is an artifact of intel's exact implementation
04:07:09 <doug16k> so perf lets you ask for more precision on intel
04:07:11 <knebulae> @doug16k: sounds like intel is playing fast and loose with the HPCs
04:07:22 <doug16k> oh the perf counters are really buggy
04:07:28 <doug16k> approximate at best
04:07:28 <knebulae> AMD too?
04:07:47 <doug16k> idk about theirs being buggy. probably less buggy if I had to guess
04:08:01 <doug16k> either that or less bugs are documented or known :D
04:08:20 <knebulae> @doug16k: for as much shit as the Intel guys give AMD guys, the AMD guys aren't all that bad. They just lack the resources of an 800lb gorilla.
04:08:43 <doug16k> overall they are about right but there are many spec updates with silicon bugs in perf counters (missed counts, multiple counts for one event can happen, etc)
04:09:21 <doug16k> ya amd has done well given its market share
04:09:22 <knebulae> @doug16k: it has to be challenging though with OO execution
04:09:46 <knebulae> @doug16k: a user wouldn't expect a counter for an instruction that is executed sooner to be later than another
04:10:06 <knebulae> @doug16k: or am I barking up the wrong tree (knowledge gaps)
04:10:29 <knebulae> Just trying to brainstorm reasons for the skew
04:10:32 <doug16k> OO will mess up counts somewhat if it counts in not quite the right places
04:10:54 <knebulae> sorry, skid
04:10:57 <doug16k> I doubt they count at retirement. so if they count at functional units you'll get mispredicts that flush and machine clear and you get multiple counts
04:11:14 <knebulae> gotcha.
04:11:17 <doug16k> missed counts, not sure how to explain that
04:12:31 <doug16k> I don't think it is exactly that either. it is definitely done such that they get away with a lot less complexity and they are not necessarily right on
04:14:25 <knebulae> @doug16k: I would guess they cheat with offsets to a base time at retirement, which may not always be accurate for internal reasons.
04:14:55 <knebulae> @doug16k: just slap the offset on it when it goes in flight
04:15:22 <knebulae> wait, offset at in flight, then re-offset again at retirement; that'd be the only way to stay accurate.
04:15:37 <knebulae> I don't know. I'm tired.
04:15:44 <doug16k> ya different events occur at different points in the pipeline, so they would need to either mux in an address from all those places, or cheat a bit
04:15:57 <doug16k> instruction pointer address*
04:16:45 <doug16k> or they simply rate limit how often it can read the register file, so if it spam reads it gets the wrong value a few times
04:17:03 <doug16k> just throw last value if spam asserted
04:17:12 <knebulae> @doug16k: right. I know, in a very broad sense, about modern cpus, but I am by no means an expert when it gets to the intricacies of HPCs, execution pipelines, branch (mis)prediction, etc. My understanding is functional at best; i.e. I know some of the words.
04:17:48 <knebulae> Well, maybe most of the words :)
04:19:02 <doug16k> oh man imagine if intel and AMD and everyone else opened the source code for their processors?
04:19:16 <doug16k> so you could just go see what it does? :P
04:19:30 <doug16k> that'll be the day
04:19:37 <klange> If You're So Goddamn Smart How About You Write the CPU Microcode?
04:19:42 <knebulae> @doug16k: some guy from Omaha would figure out some massive fundamental fuckup, fix it, and make the cpus twice as fast.
04:19:58 <doug16k> probably
04:20:01 <knebulae> A million monkeys
04:20:13 <knebulae> All programming languages did was narrow the focus :)
04:20:17 <doug16k> klange, ya!
04:20:46 <knebulae> And now we fling our poo to github to be judged!
04:20:54 <knebulae> Lol. I'm really kidding guys.
04:21:17 <klange> I'll wait for the sequel that adds GPUs and is available for the newest ~~console~~ architecture.
04:22:38 <knebulae> @klange: do you mean you want a console-like box at home to run GPU workloads on?
04:23:05 <klange> knebulae: I am making a joke about Super Mario Maker 2
04:25:17 <Telyra> Interestingly, someone I follow on twitter is writing her own microcode for the N64 RSP
04:25:22 <klange> When the first Mario Maker game was announced, there was a meme calling it "YOU ****IN' MAKE IT THEN IF YOU'RE SO GOD DAMN SMART".
04:25:55 <klange> And then they announced 2, and added *slopes* which people really wanted (and it's for the Switch rather than 3DS)
04:26:24 <klange> I'm sorry, my jokes are rarely funny if they require explanation.
04:26:34 <clever> ive seen a video about how the rules for the level to be "completeable" are different from when its actually played, and its possible to make levels that are physically imposible to complete
04:27:50 <ybyourmom> i saw a video too but im not telling you what my video was about
04:28:02 <clever> lol
04:28:45 <aalm> .theo
04:28:45 <glenda> Have you ever heard of the concept of helping yourself?
04:28:52 <bluezinc> klange: I get it now.
04:29:03 <immibis> Telyra: ... is that documented?
04:29:11 <knebulae> .theo
04:29:11 <glenda> And noone cares what you like or dislike.
04:29:18 <knebulae> Took me a couple days. I'm slow.
04:30:28 <knebulae> @klange: yes; damn I'm moving slow tonight. I missed your first post and only saw the GPU post, and my head was wrapped up talking with doug.
04:30:35 <bluezinc> .theo
04:30:35 <glenda> Things change.
04:32:15 <aalm> .ken
04:32:15 <glenda> That brings me to Dennis Ritchie. Our collaboration has been a thing of beauty.
04:33:12 <Telyra> immibis: Not officially.
04:34:22 <klange> I'm sure there's a binder in a disused cabinet in a dark storage room in Kyoto full of details.
04:34:56 <nyc`> Taipei seems more likely.
04:34:57 <immibis> working on real hardware, presumably, because the emulators just recognize common microcodes and use a C translation of the
04:34:59 <immibis> m
04:54:44 <nyc`> I had real hardware once. It was nice while it lasted.
05:02:21 <klys> speaking of microcode
05:02:33 <klys> does the cpu load microcode at boot time?
05:02:45 <klys> presuming it's x86_64
05:05:10 <nyc`> x86, VAX, it's all microcode.
05:06:39 <nyc`> (Only VAX was cleaner with memory-memory ops and had actual useful microcode instructions like polynomial evaluation.)
05:12:08 <nyc`> (I don't see a reason to bother suffixing it. Another bag of opcodes and registers they can get at isn't really any different from the 386 vs. 286 or other extensions.)
05:13:01 <Mutabah> klys: Afaik, no... well, there may be some internal ROM/registers for the microcode
05:14:37 <klys> I haven't really met anyone who was daring enough to consider modifying x86_64 microcode. presumably there's a problem once it gets modified.
05:15:56 <nyc`> Someone who can afford to make mistakes and brick the chip would be rare.
05:17:39 <klys> yeah anything with the intel microcode chip is going to have so many pins in the socket that it would be infeasible to run in-circuit emulation on it, because it would require so many copper layers to get at all the pins.
05:19:01 <immibis> klys: someone tried that, but on an obsolete AMD chip that doesn't have digitally signed microcode. because most chips have digitally signed microcode
05:19:17 <immibis> also most of the microcode is in ROM
05:19:34 <immibis> N64 RSP microcode isn't really microcode, it's a full CPU with a special vector instruction set
05:19:50 <immibis> basically another main CPU
05:20:24 <nyc`> Main anything's are unique.
05:20:43 <immibis> not true, my computer has 4 main CPUs (they happen to be on the same chip though)
05:20:54 <geist> klys: there's a default set, but you can upload microcode after the fact
05:21:28 <klys> geist, would you know anyone who has uploaded info to /dev/microcode without problems?
05:21:48 <geist> the cpu does a hard core crypto check on the block of bits you send it
05:21:53 <immibis> on the processor that someone fuzzed (AMD K6?), you could only upload a small number of patched instructions
05:21:58 <geist> and unless someone has cracked that, it will simply reject it
05:22:13 <geist> and yeah, i'm talking about fairly modern stuff. I thik K8 was the last one that was uncryptoed
05:22:23 <immibis> https://hackaday.com/2017/12/28/34c3-hacking-into-a-cpus-microcode/
05:22:49 <geist> yah
05:23:12 <immibis> I think it consisted of a bunch of address/instruction pairs and each one would override the same address in ROM. there were only a handful of those pairs in RAM
05:23:26 <immibis> so you can't rewrite the whole thing, only patch it
05:23:30 <geist> yah, wouldn't be surprised. yeah
05:23:56 <geist> some sort of content addressible ram so as it's fetching from the microcode rom it can be overridden
05:24:00 <klys> presumably if anyone had legitimate access, they might want to do a fairly major overhaul to get at all the caching and block transfer capabilities that it would imbue.
05:24:32 <geist> anyway, the mechanism is pretty similar on modern amd and intel machines
05:24:36 <nyc`> Microcode patches to make them run as ARM or MIPS or RISC-V or SPARC CPU's might be too much to hope for.
05:24:55 <immibis> nyc`: definitely.
05:25:05 <immibis> nyc`: if you could do that, it would be called an FPGA. And you CAN make a CPU on an FPGA!
05:25:08 <geist> you basically boot a cpu, and beofre too long you (in supervisor mode) figure out which blob of microcode you want to load, put it in ram, and point to it with a set of msrs and whatnot
05:25:18 <geist> you have to do it for every core you bring up
05:25:53 <immibis> (load different microcode on every CPU to make the programmer think his machine has ghosts)
05:26:01 <geist> the cpu loads its own microcode then does serializes everything and just falls through. it's pretty simple
05:26:22 <geist> but since te modern stuff is super encrypted and signed, you cant load the wrong firmware o the wrong cpu
05:43:12 <nyc`> immibis: A lot of VAX instructions were entirely in microcode, though I'm skeptical anyone's bothered to make an entire instruction set configurable in microcode.
05:43:40 <clever> nyc`: bigmessowires sorta did
05:43:54 <immibis> nyc`: allegedly the Xerox Alto had a partially writable microcode store. but not all of it
05:43:57 <clever> https://www.bigmessowires.com/bmow1/
05:44:14 <geist> yep. alto is a weird one. it runs basically microcode all the time, has a little multi-level OS in it
05:44:14 <immibis> the problem with a "fully configurable instruction set" is eventually you break some hardware assumptions. like, if you want two-word instructions on a RISC machine, tough shit
05:44:28 <geist> also dont forget the IBM5100. it ran a sort of microcode
05:44:34 <nyc`> clever: Interesting.
05:44:34 <clever> basically, the current opcode, plus a 4 bit microcode counter, and a 1 bit status flag, where "abused" as an address into a microcode rom
05:44:48 <geist> reprogrammable microcode is not such a weird thing. lots of systems played with it
05:44:51 <geist> just not microprocessors
05:44:58 <clever> nyc`: and then every data bit coming out of the microcode rom, was wired to the enable of a tri-state buffer
05:45:16 <clever> nyc`: so, the exact value in the microcode rom, controlled the routing of data between registers, external bus's, and the ALU
05:45:38 <immibis> i was thinking about writable microcode for my CPU (to save wiring by making things take more steps and putting more stuff in RAM, hopefully) and that's how I ended up back at Forth
05:45:48 <immibis> where every instruction can be considered microcode, or not
05:46:03 <clever> nyc`: for non-conditional opcodes, it was just implemented twice, so the variant with flag=0 and flag=1 where identical
05:46:22 <clever> nyc`: for conditional opcodes, it was implemented differently, so that bit of the address, affected what microcode it runs
05:46:52 <clever> one of the output bits, was an end-of-opcode flag, that resets the microcode counter to 0, and increments PC
05:47:23 <clever> in theory, that could emulate any cpu that fits within the register space
05:47:44 <clever> you could even have another special "microcode select" register, that just sets some more bits in the microcode rom address input
05:47:59 <clever> and then just context-switch to an entirely different arch
05:48:19 <geist> sure, this is not particularly novel. lots of systems did that
05:48:49 <geist> alto is a good example, the microcode ran a N layer OS essentially, where it was switching between a few different hardware tasks (draw the screen, deal with the hard drive, etc) in microcode
05:49:01 <clever> only thing i can see being difficult, is pipelines, and the icache
05:49:12 <geist> the lowest priority task it did when it had nothing else to do is interpret DG Nova instructions
05:49:57 <geist> this Gigatron TTL computer i built a while back does something like that too
05:50:07 <clever> geist: that reminds me, i think the c64 did a weird clocking system, where the cpu and "gpu" take turns accessing ram, one on the rising edge of the clock, the other on the falling edge
05:50:17 <geist> the default built in rom is a kind of microcode, with a fixed sized instruction that directly drives a lot of stuff
05:50:37 <geist> it runs a loop to draw the screen, do audio, and then interpret some 16 bit instructions that hav ea basic interpreter, etc
05:51:39 <geist> so as a user you can't 'see' the native instruction set. you only deal with this interpreted 16 bit virtual machine
05:52:01 <immibis> clever: have you seen forth?
05:52:43 <clever> immibis: ive helped somebody build a forth thing for a ti80 calculator, complete with emulator, yet i dont know a thing about forth, lol
05:53:28 <immibis> clever: basically it runs on a stack machine where most instructions are calls and functions manipulate data on the stack. CALL push_1; CALL push_1; CALL add; CALL print
05:53:49 <clever> weird
05:53:50 <immibis> (more realistically: CALL push; DW 1; CALL push; DW 1; CALL add; CALL print)
05:54:04 <clever> lua is also stack based, but not really to that extreme
05:54:11 <clever> push wasnt a function call, lol
05:54:15 <nyc`> My skepticism may have been unwarranted.
05:54:19 <clever> (not a lua level function)
05:54:25 <immibis> so on my design, instructions like "subtract" don't exist. Subtract is implemented as: subtract: CALL negate; CALL add; RET
05:55:03 <clever> immibis: that reminds me, the parser for nix (a functional langage) will translate 10 - 5, into __sub 10 5
05:55:14 <immibis> would you call this implementation microcode? given that the "CALL subtract" instruction looks identical to "CALL add
05:55:20 <clever> and __sub is one of the primitive functions, implemented in c++
05:55:37 <clever> but, you can overwrite __sub with your own function, lol
05:57:18 <geist> http://www.righto.com/2016/09/xerox-alto-restoration-day-5-smoke-and.html is a little blurb about it, fascinating
05:57:19 <immibis> actually I have a special format for primitive instructions, and ADD is something like: add: *SP->TEMP; SP--; *SP->ALU_B; TEMP+=ALU_B; TEMP->*SP; RET.
05:57:20 <immibis> is that microcode?
06:07:14 <geist> i think a fairly key property of most microcode is it usually doesn't assume a linear flow through increasing addresses
06:07:25 <geist> most microcode has a next instruction field in the microcode
06:07:57 <geist> and usually microcode has a fairly wide instruction word, is fixed size, and tends to drive signals on the bus directly
06:07:59 <clever> immibis: how would you compare forth and lua? and would a native lua cpu be possible?
06:08:02 <immibis> right, my design has an actual microcode ROM (with that property)
06:08:10 <geist> yah
06:08:39 <immibis> i can imagine making one without a microcode ROM, but it would take more components
06:08:50 <immibis> e.g. the ALU wouldn't be reused to increment the program counter
06:08:57 <geist> yah
06:09:34 <immibis> clever: when you say "native lua" you don't mean lua source code, right? Just an ISA optimized to run compiled Lua (with a custom compiler, not necessarily normal Lua bytecode). I think that is entirely possible - there were some attempts to make "native java" CPUs
06:10:23 <clever> immibis: yeah, you could have something to translate the lua bytecode (or lua source) into more of a lua opcode
06:10:40 <immibis> I think you have to be careful about that sort of question because no sane CPU runs a source language directly
06:10:52 <immibis> it's always machine code, but it can be machine code optimized to make a certain source language easy to implement
06:11:00 <clever> yeah
06:11:08 <clever> a related field, is dalvik
06:11:26 <clever> which is mostly just java bytecode, translated to a variant called dalvik, that is then ran under android
06:11:41 <geist> a key difference is dalvik was a register based bycode
06:11:43 <geist> instead of stack based
06:11:45 <clever> but ive heard that modern android versions, will further translate the dalvik to native, at install time
06:12:14 <clever> yeah, dalvik is register based, and the register file is of variable size, so you define the size in the function prologue
06:12:57 <clever> the lua stack behaves in a similar manner, but is of dynamic size, and you must push function args onto the stack before calling a function
06:13:13 <clever> while dalvik lets you just directly pass any registers to a function call
06:13:19 <geist> yah the idea is it's much easier to jit with a register based design like dalvik
06:14:07 <immibis> you can certainly imagine ISA features to make Lua easier to implement, like tagged values. and the (say) "less than" opcode would either directly compare the operands (if both are numbers) or else it would call a handler function defined in the runtime, based on the types of the arguments
06:14:15 <clever> yeah, you can just pass it thru something like llvm, and let it remap the fixed set of registers to hardware, which llvm already does in llvm IR
06:14:31 <immibis> then numeric comparisons will be free from dynamic typing overhead - there won't be an explicit type check involved, but non-numeric comparisons will also work
06:14:57 <clever> hmmm, isnt llvm IR basically a stack machine as well?
06:15:31 <immibis> i'm not sure if there's a clear dividing line between "CPU designed to speed up lua" and "lua interpreter written in microcode"
06:16:48 <immibis> you can move as much of the complexity into the microcode as you want
06:16:54 <immibis> and out of the runtime system
06:17:38 <clever> i think it was in open-computers (a MC mod), they re-implemented lua, in lua
06:17:56 <yrp> you guys ever come back to a channel after not reading a while
06:18:14 <yrp> and immediately want whatever drugs the current conversational participants are on?
06:18:30 <immibis> i wrote a lua interpreter in lua for ComputerCraft (a minecraft mod) but it's pretty pointless
06:19:21 <clever> immibis: open-computers is much more of a retro-computer emulator, complete with a bios rom, floppy drives, hard-drives, networking, and a unix-like OS
06:19:37 <immibis> that already comes with lua doesn't it?
06:19:49 <clever> immibis: the floppy and hard-disk even make the right noises, as the pc does activity on them
06:19:59 <clever> immibis: it has 2 CPU's you can install, one runs lua
06:20:44 <clever> immibis: a random page in its docs: https://ocdoc.cil.li/component:computer
06:20:51 <geist> yrp: hah
06:21:34 <clever> immibis: the basic setup when building your first pc, is to install one of the template bios roms, and then boot from the openos install floppy (which makes proper floppy drive noises), then install the OS to a hard-disk
06:22:01 <immibis> sounds like they implemented lua on a virtual PC then
06:22:11 <clever> unlike computercraft, the lua VM is limited in how much ram it can use, based on what ram sticks you installed into the computer case
06:22:34 <clever> the main lua engine is in java, and it is able to serialize the state of the lua engine, and save it to NBT data
06:22:46 <clever> so when chunks unload, your cpu just suspends
06:23:02 <clever> unlike computercraft, where the whole pc is just shutdown, and you must restart it when reloading
06:27:31 <bluezinc> immibis: computercraft was a lot of fun.
06:28:12 <clever> bluezinc: id say opencomputers is better, more functions, more realistic limitations, more computer-like, not just a single magic block
06:37:29 <bluezinc> clever: I can see that you have strong preferences on the subject.
06:37:40 <bluezinc> Pretty sure opencomputers was after my time.
06:39:22 <nyc`> clever: I thought it was more SSA.
06:40:01 <clever> nyc`: SSA?
06:40:24 <nyc`> clever: Static single assignment
06:40:34 <clever> ah
06:40:45 <clever> i might be mis-remembering llvm ir
07:47:44 <sginsberg> hello
07:50:58 <sginsberg> Im studying x86 PIC BIOS X87 etc
07:52:28 <sginsberg> am trying to get to a BIOS interfacing quickly
07:52:40 <sginsberg> on a so called bare-machine
07:55:49 <sginsberg> in emulation/simulation/virtualisation that is
07:57:02 <mahaxmo> a theoretical apparatus!?
07:59:01 <klys> sginsberg, are you interested an any particlar sort of device?
08:02:26 <klys> sginsberg, supposing you were to get a new device, what vendor would you get it from?
08:02:50 <bcos_> sginsberg: Hello. I'm shoving pieces of dead animals into a hole in my head while thinking about doing some shopping later
08:04:21 <geist> oy
08:04:34 <sginsberg> I mean just interfacing (coding) together-with the BIOS to "get back to the code" as quickly as possible
08:04:43 <geist> dead face animals is for good your body
08:05:02 <klys> "Tortoise thought to be extinct for 113 years has been rediscovered on the Galapagos" - Faux News
08:05:15 <sginsberg> that is a project of what is the bare minimum for a self-sustaining system of bootloader/operative/compiler
08:05:16 <geist> turtle face face now
08:06:15 <klys> sginsberg, perhaps you would like the 32-bit multiboot1 interface for your new i386 program?
08:06:36 <sginsberg> multiboot
08:07:09 <immibis> so you just want to know how to call bios functions?
08:07:36 <sginsberg> more like boot.iso -> x86 -> BIOS -> Debugging -> Complete
08:07:54 <sginsberg> have BIOS reference standard already
08:08:08 <sginsberg> standard/manual
08:08:17 <klys> E L T O R I T O
08:08:17 <geist> bios -> debugging -> complete. that's a summary of everything right there
08:08:31 <geist> klys: EL TORITO!
08:09:12 <sginsberg> I mean my own boot.iso, all from scratch
08:09:31 <sginsberg> bare minimum for a system of x86
08:09:50 <mischief> try jmp
08:09:52 <klys> sginsberg, do you have xorriso from the msys2 package or a gnu/linux distribution?
08:11:40 <bcos_> sginsberg: If you have any questions about anything, don't be afraid to ask
08:33:21 <sginsberg> I am using Bochs right now, keeping it simple
08:44:54 <knebulae> @sginsberg: what stage in the process are you at?
08:48:04 <sginsberg> knebulae: documenting, researching, structuring
08:48:18 <knebulae> @sginsberg: anything you want to share?
08:49:20 <aalm> sharing is not part of the homework
08:49:29 <klys> .theo
08:49:29 <glenda> So glad to have the expert speak.
08:49:37 <sginsberg> knebulae: sure
08:50:15 <aalm> .roa
08:50:15 <glenda> 16 A deal is a deal.
08:50:48 <knebulae> .theo
08:50:48 <glenda> You haven't justified it.
08:50:52 <sginsberg> https://pastebin.com/0UDqTVPt @knebulae
08:52:09 <klys> looks like nasm code
08:52:48 <knebulae> well, zeroing all the registers doesn't do much, that's for sure.
08:53:03 <klys> then again the comment looks like cpp
08:53:07 <knebulae> ia32 registers that is
08:53:31 <sginsberg> it does mostly sanitize the context
08:53:43 <sginsberg> like context switching
08:53:59 <aalm> sure
08:54:24 <klys> sginsberg, you'll need mm routines before you can do sched
08:54:25 <knebulae> @sginsberg: well, I'm not exactly sure why you'd want to do that, especially in that manner.
08:54:47 <sginsberg> just a clear context here
08:55:10 <knebulae> and I don't think you can zero the instruction pointer
08:55:34 <aalm> does good for the pipeline
08:55:45 <klys> well you could, though that would trash the real-mode ivt.
08:56:28 <aalm> better keep the pipes in line
08:56:45 <knebulae> @klys: well, in 2019, I just couldn't think of any reason most of that code would be necessary. BUT- that being said, it is either *very* late where I am, or *very* early. And I haven't slept.
08:56:47 <sginsberg> it does sort of Refresh the pipeline
08:56:58 <sginsberg> xor eip, eip together-with ret
08:57:18 <aalm> are the pipes loaded?
08:57:38 <knebulae> oh, the pipes are definitely loaded somewhere
08:58:13 <sginsberg> more specifically it blanks the pipeline
09:03:56 <sginsberg> https://pastebin.com/bvFuZgfx does same to cache
09:05:34 <klys> sginsberg, you're on a runaway cs descriptor if you can do that to cr0 without knowing physical memory
09:05:41 <aalm> you should indent
09:07:39 <sginsberg> cs does not matter like that if claiming the storage first
09:08:28 <klys> the decriptors are in the gdt, and you have to jump to real-mode with a known location. you should also set up idt.
09:10:36 <sginsberg> I am claiming the entire TSS too
09:11:33 <immibis> what is claiming?
09:11:44 <klys> yeah, if all you ask for is bits and pieces then that's all you get
09:12:03 <sginsberg> https://pastebin.com/jvB2xnpY klange
09:12:05 <sginsberg> klys*
09:12:34 <aalm> o ok
09:12:55 <klys> sginsberg, I don't think those are real x86 instructions.
09:13:04 <immibis> what is "claiming"?
09:13:59 <klys> I am particularly puzzled by this xor eip, eip
09:14:10 <aalm> .theo
09:14:10 <glenda> The rest of what you are yapping about does not matter.
09:15:15 <sginsberg> immibis: as in taking control entirely
09:15:33 <sginsberg> I do have this verified by a co
09:15:42 <immibis> oh so you want to make sure you know all the register values
09:16:35 <immibis> you have to make sure they have valid values... do you know what EIP does?
09:16:47 <sginsberg> just think sanitizing entire context
09:16:52 <klys> sginsberg, https://ideone.com/ select nasm and test your codes.
09:17:42 <sginsberg> does it fail with nasm?
09:18:36 <klys> that is, Assembler (32-bit) nasm from the drop box that says "Java" in he lower left corner.
09:18:38 <immibis> if I remember correctly, you cannot access EIP
09:18:48 <immibis> except by jump instructions, calls, etc
09:18:58 <immibis> you can't use it in an xor instruction
09:19:05 <immibis> I could be wrong though
09:19:51 <sginsberg> im assured the coding works as intended
09:28:55 <bcos_> sginsberg: I assure you otherwise - it's impossible to do anything even vaguely similar to "xor eip,eip"
09:29:22 <bcos_> (or "lea idt, 0xFFFFFFFF", or....)
09:30:37 <nyc`> One can jmp $0 or whatever the syntax for immediates is.
09:30:57 <bcos_> "jmp 0" is fine but doesn't set flags like an "xor eip,eip" would..
09:32:16 <immibis> sginsberg: have you tried it?
09:32:20 <nyc`> Flags must be a terrible bottleneck for the pipeline and OOOE.
09:32:38 <bcos_> nyc`: Tricky, scams and hacks, all the way down!
09:33:01 <immibis> I think they are
09:33:25 <bcos_> (why calculate flags when you can pretend that you did and just store details of the previous "flags effecting" op in case you actually want flags later)
09:34:43 <klys> sginsberg, actually ideone is only good for userspace stuff like this: https://ideone.com/ji9wrU so get you a copy of nasm for your own use.
09:36:19 <bcos_> Hrm - don't forget CPU has "power on default variable values", so you know what will be in any register without doing anything
09:37:47 <bcos_> ..and for real firmware the main annoyance is that RAM doesn't work (and you can't use stack, or call or ret or iret) until the firmware configures the memory controller, etc
09:38:43 <bcos_> (or at least does a horrible "cache as RAM" hack)
09:42:18 <sginsberg> the whole thing https://pastebin.com/dK6dVik6
09:43:27 <bcos_> Over 50% of those instructions are nonsense
09:44:03 <bcos_> ..and maybe 100% of the comments
09:44:11 <klys> sginsberg, folks are right to look at that pseudocode and think of it as wishful thinking. you need to apply yourself. click here https://www.msys2.org/
09:48:13 <immibis> ** Whole: Interrupt-Ento-Itself <- what does this mean?
09:48:36 <immibis> this reminds me of free-energy-conspiracy-theory-speak
09:48:52 <immibis> where it doesn't mean anything but they're trying to create the impression of not being stupid
09:55:28 <klys> curiously, that ideone program worked with a simple nasm -f elf -o 0002.o ideone-0001.asm; i686-linux-gnu-ld -o 0002 0002.o; ./0002
09:57:54 <bcos_> klys: Show me a disassembly from "objdump".. ;-)
09:58:52 <klys> https://paste.debian.net/1069301/
09:59:31 <immibis> why is it curious?
09:59:39 <klys> first times
10:00:45 <klys> it comes to mind that one might make a userspace program that emulates a basic machine with simple kernel calls like this.
10:01:06 <immibis> what do you mean?
10:01:50 <klys> the curiosity of making a simple machine (turing-complete) emulator struck me when I was able to run it.
10:03:26 <immibis> you can make an emulator, they aren't that hard, especially if it's for a simple made-up cpu architecture
10:04:27 <klys> it ought to be attached to the project of designing an alu in vhdl
12:23:10 <nyc`> immibis: They need to read about free energy to understand that it is liber, not gratis.
12:24:37 <lkurusa> Can someone ban GsBush23?
12:32:01 <klange> lkurusa: sorry, was a bit occupied :)
12:33:10 <lkurusa> klange: ah no worries :)
12:33:14 <lkurusa> thanks!
03:17:17 <knebulae> .theo
03:17:17 <glenda> I really don't give a shit what you admire or not.
03:17:23 <knebulae> perfect.
03:17:33 <knebulae> I got tired of playing with the bot quickly last night.
03:17:49 <knebulae> *this morning
03:19:01 <None86x> :)
03:24:45 <knebulae> I was so tired, at first, I thought he was trying to do something clever in the context switch code, but I rubbed my eyes and I'm like those are all xor instructions. And dude, wtf are you doing with eip? Then I was just like f*ck it. This is either a bot or someone f*cking with us...
03:26:32 <lkurusa> Better.theo
03:26:34 <lkurusa> .theo
03:26:34 <glenda> Your attitude stinks. Good luck with life.
03:26:39 <lkurusa> .bullshit
03:26:39 <glenda> exit status 127
03:26:40 <lkurusa> .ken
03:26:40 <glenda> It's always good to take an orthogonal view of something. It develops ideas.
03:26:57 <lkurusa> .linus
03:27:00 <lkurusa> :(
03:33:07 <None86x> In how much time do you guys switched from VGA to something else ?
03:34:00 <lkurusa> I switched to the Bochs video thing after a while
03:34:18 <lkurusa> not sure how long exactly, but i wanted to show QR codes and BGA could do high res enough, whereas VGA could not
03:35:36 <None86x> ok i see
04:04:03 <knebulae> @None86x: I went uefi and get excellent video. I mean, Intel is taking away bios. Why run in quicksand?
04:05:38 <rakesh4545> hello, are there any channel for compiler design?
04:06:50 <lkurusa> rakesh4545: #proglangdesign
04:07:27 <rakesh4545> thanks.
04:08:44 <lkurusa> rakesh4545: np
05:35:04 <doug16k> my android phone ran text-to-speech of the MTP spec last night (a lullaby for me). 8 hours straight of that and it only used 34% of battery. pretty impressive
05:35:39 <doug16k> I want to hug that UK text-to-speech lady. so clear
07:24:06 * geist yawns
08:00:32 <geist> mrvn: not to necro an old discussion, but i was just reading the ARM manuals for particular popular cores (cortex-a53, -a76, etc) and they say which TLB sizes are directly implemented
08:01:10 <geist> predictably the TLB directly seems to always support the common sizes (4K, 64K, 2MB) and then usually at least one extra size, like 512MB (1GB entries are split)
08:01:36 <geist> the higher end core (a76) gets 16K and 32MB as well
08:01:59 <geist> so basically except for te really large sizes (1GB+) there's basically full TLB support for the different sizes
08:34:23 <mrvn> geist: and how much of that is used at all?
08:37:02 <geist> what do you mean?
08:37:40 <geist> tyou mean on real systems? good question. since most big oses that run on arm are i'm going to bet almost exclusively linux, it's whatever linux does there
08:37:49 <geist> it'll use big pages for the kernel and whatnot, for sure
08:37:59 <geist> but question is whether or not linux actually uses 64k combined pages or not
08:38:15 <geist> i know that iOS uses 16K page granules too, i guess
08:40:31 <mrvn> geist: exactly. Other than the kernel mapping all the physical ram as large as possible anything but 4k is hugely underused if at all.
08:41:10 <mrvn> will linux combine pages at all? What sizes will it use?
08:41:37 <mrvn> programms that explicitly use hugetlb support are pretty much zilch.
08:42:14 <mrvn> I know linux has some work done on using 64k page granularity. So maybe that at least is used on arm.
08:43:47 <mrvn> anyway, so far I'm a long way away from optimizing things like page sizes.
08:45:17 <nyc``> The thing that murdered me with 64K pages was read() on small files.
09:02:09 <nyc> Things reading small files would get massive page zeroing overhead from larger base page sizes (granules in ARM parlance).
09:12:41 <mrvn> nyc: you don't read 64k into cache and then memcpy to userspace?
09:13:32 <nyc> The pagecache page would be fully zeroed. The memcpy would be some small number of bytes because the file wasn't 64KB long.
09:13:54 <mrvn> nyc: ahh, so you didn't remember partially fiulled pages
09:14:04 <nyc> So it would blow the cache and burn a noticeable number of cycles.
09:14:19 <mrvn> I'm cleaning pages on free.
09:15:06 <mrvn> I'm thinking of DMAing them though.
09:17:26 <nyc> Store a zero page on disk somewhere and do disk IO to zero out pages without blowing the cache?
09:17:54 <mrvn> nyc: use the memory to memory DMA engine of the cpu
09:18:06 <nyc> They have those?
09:18:11 <aalm> sure
09:18:16 <mrvn> RPi does. Not sure if all ARM do.
09:18:28 <aalm> most do
09:19:02 <nyc> It seems architecture-dependent.
09:19:04 <aalm> most cortex-A's atleast
09:19:10 <aalm> no
09:19:24 <aalm> it is SoC dependent :p
09:24:16 <doug16k> nyc, block storage devices and filesystems work on blocks. why would you ever zero page cache pages unless it is allocating a free cluster and preparing to write-back cache it
09:25:05 <nyc> doug16k: They have to be zeroed because in theory they could be mmap()'d.
09:25:06 <mrvn> doug16k: because a block is 512-4096 byte and a page 64k.
09:25:32 <nyc> The thought behind DMA'ing a zero page from disk is basically to zero out pages without blowing the cache.
09:25:39 <mrvn> doug16k: and if you have tail compression you would have individual bytes to copy into the cache page
09:25:39 <doug16k> ah I didn't account for the cluster being smaller than the page. nevermind :)
09:27:34 <doug16k> a chunk of what I meant still stands though - you'd be doing DMA to read actual whole clusters. how could data already in those pages ever matter?
09:28:09 <doug16k> you mean only in the case of a 10 byte write to a 4KB file?
09:29:11 <doug16k> s/file/cluster/
09:30:06 <mrvn> doug16k: not all the data in the cluster might belong to the file. The user should have no access to that
09:31:29 <doug16k> user? I thought we were talking about caching at block device and filesystem level
09:31:51 <mrvn> doug16k: and the user mmaps them for example
09:32:14 <eryjus> mrvn are you referring to the old stacker algorithms?
09:32:15 <doug16k> oh ueer mmaps 4KB with 64KB pages? illegal in posix
09:32:37 <nyc> Someone is very confused.
09:32:53 <mrvn> doug16k: no, 64kb. But you just loaded 62k of non related data into that page.
09:32:55 <doug16k> you can't map 4KB if the pages are 64KB. sorry
09:33:34 <mrvn> doug16k: or you have data left over from the last user if you don't DMA the full 64k from disk.
09:33:51 <nyc> That's not the case where there was waste.
09:34:01 <nyc> The case where there was waste was just short files.
09:34:31 <mrvn> that's what I'm talking about
09:34:37 <nyc> $ ls -ls /etc/debian_version
09:34:37 <nyc> 4 -rw-r--r-- 1 root root 11 Jun 25 2017 /etc/debian_version
09:34:41 <nyc> There's an example.
09:35:11 <mrvn> that would copy 4 byte or DMA 512 byte into the 64k page. The rest you have to zero.
09:35:45 <nyc> Apparently so many of those things are accessed that when you jack up the page sizes e.g. on IA64 which has all the powers of 4 up from 4KB, you get large increases in profile time for page zeroing.
09:36:31 <mrvn> nyc: are they mmaped though? For read/write you don't have to zero the rest of a page.
09:36:54 <nyc> They are not mmap()'d, they're almost all accessed by read().
09:38:31 <doug16k> so your scenario is, the filesystem code is buggy and lets you read past the end of a file into data you didn't DMA in, so you have to zero the rest of the cache page. is that right?
09:38:43 <nyc> No.
09:39:17 <doug16k> I guess I am confused then, I'll get out of the way
09:39:20 <mrvn> doug16k: no. just the case where you mmap a file
09:39:51 <nyc> You could, in principle, tail pack in memory and wait for mmap() of tiny files to zero fill.
09:39:55 <mrvn> doug16k: most small files aren't mmaped but you have to handle that case.
09:40:38 <nyc> I was never going to get that past Linus, so I didn't bother.
09:46:03 <doug16k> do you have to handle that? isn't it a bus error if you try to write back dirty mmap pages past the end of the underlying file (past because of page size granularity)?
09:47:00 <sandlotasn> Hey guys, having a weird problem that someone might be able to help me with. I'm trying to enable timer interrupts for a RISC-V32 arch (SiFive HiFive board). I am able to read the timer_{hi,lo} values, and add an increment to them that's a higher value than the current timer, but then the timer just passes the comparison value and doesn't trigger a trap which I think is really weird
09:47:14 <sandlotasn> Any ideas?
09:47:50 <doug16k> sandlotasn, something as silly as interrupts masked?
09:48:19 <doug16k> easy to miss something like that when you are up to your ears in new stuff
09:49:05 <mrvn> doug16k: the problem is that last page that is still valid
09:49:35 <mrvn> doug16k: you can't leak infos in the unused parts of the last page of a file
09:50:36 <sandlotasn> Hmm, I don't think that's the case. At the moment, my trap handler just saves register context, prints out the cause of the trap, and restores context and returns
09:51:05 <sandlotasn> So unless the masking is happening on it's own, I wouldn't expect that to be true
09:52:09 <doug16k> mrvn, ok I got you. yes you have to clear the partial page in the last page if the underlying file doesn't go to the end of the cluster
09:52:31 <doug16k> s/cluster$/page/
09:59:29 <mrvn> doug16k: With 64k pages it probably makes sense to optimize for that case and only do it for mmap.
10:00:52 <nyc> With powers of 4 like various arches have, I think there are 256KB, 1MB, 4MB, 16MB, etc. upward from that.
10:01:17 <nyc> But anyway, tail packing in memory helps a lot.
10:04:47 <doug16k> mrvn, ya
10:05:20 <nyc> The parts you really can't avoid are ZFOD pages. malloc() has a minimum it can request from the kernel.
10:06:00 <mrvn> nyc: but all those parts are getting used.
10:06:11 <nyc> No, they're not.
10:06:24 <mrvn> you only fault in the pages that get accessed. so they are used.
10:06:34 <nyc> Also likewise with COW affairs for the GOT or whatever.
10:07:08 <nyc> Bad definition of used. Use that and you deem the whole first 64KB page of a file with only 9B in it as used.
10:07:26 <mrvn> nyc: malloc() has no file
10:08:28 <mrvn> What sucks is if you have a write buffer. First byte written has to zero a page and fault it in. Then the app overwrites it all without ever reading a byte.
10:08:33 <nyc> It doesn't matter. The actually accessed parts of the 64KB page are often small.
10:09:08 <doug16k> he question is, do you memset right then or do you try to use idle time to fill dirty free pages? if you do the latter, there is already a fountain of zeroed pages right there
10:09:42 <mrvn> Or even smarter. Keep a stack of freed pages linked in each process. Then when the process needs memory give it back it's own pages dirty.
10:10:23 <nyc> I have my own answers.
10:10:33 <mrvn> Free to much and you return some to the generic cleaner for general reuse. Allocate too much and you grab some zeroed pages from the general pool. Otherwise no overhead.
10:12:19 <doug16k> afaik, old computers liked zeroing in idle time, new computers like warming up the cache on first touch (from the 0 fill). it'll write-allocate cache lines for the page that has just been touched 150ns ago when it demand faulted
10:13:11 <doug16k> for mmap backed by a file, the cost of a memset is going to be 0.00 compared to that I/O
10:14:45 <nyc> It isn't just the pagecache page.
10:14:52 <nyc> Also the write buffer in userspace.
10:17:04 <nyc> Er, sorry, read/write buffer (usually read).
10:17:44 <doug16k> if you are posix at all you have to give zeroed pages for mmap, so their libc heap will be backed by zeroed pages anyway
10:18:19 <nyc> doug16k: The issue is granularity or internal fragmentation.
10:18:36 <doug16k> and I'd add that if you aren't posix you are crazy to not zero the pages :)
10:18:48 <nyc> doug16k: https://en.wikipedia.org/wiki/Fragmentation_(computing)#Internal_fragmentation
10:19:31 <doug16k> what, now? I want to see enough fragmentation to blow 100TB+
10:19:33 <mrvn> doug16k: posix doesn't have to zero pages at all
10:20:23 <mrvn> doug16k: as fo "compared to that I/O". My 200E intel SSD does 2GiB/s.
10:20:49 <doug16k> "The system shall always zero-fill any partial page at the end of an object"
10:21:24 <doug16k> therefore anonymous mapping has to be zeroed altogether because the "object" is zero sized
10:21:30 <mrvn> My old desktop does <900MiB/s out of cache.
10:22:03 <mrvn> doug16k: accessing past the size of the file is undefined.
10:22:04 <doug16k> my 1TB samsung 960 Pro nvme hits 3.2GB/s read
10:22:23 <mrvn> doug16k: I said mine was cheap. :)
10:22:33 <doug16k> mine wasn't. you win there lol
10:22:46 <mrvn> 2TB for 208E
10:23:34 <mrvn> Funny thing is specs say it only does 1800MB/s.
10:23:46 <mrvn> Maybe reading the zero page is faster
10:24:03 <mrvn> (haven't written to the disk yet)
10:24:38 <doug16k> it would extend latency horribly wouldn't it?
10:25:01 <nyc> DMA zeros off a known location on disk to do background zeroing without polluting the cache.
10:25:04 <doug16k> what would you do? enqueue an I/O on a block storage interface and have that thread wait for I/O for a memory access?
10:25:11 <mrvn> 220k iops/s according to specs
10:25:42 <doug16k> ok. now how many pages can you memset per second
10:25:45 <mrvn> doug16k: by zero page I ment that nothing on the disk has been allocated yet. It's totally blank.
10:26:12 <mrvn> doug16k: the SSD knows the data block is zero so it doesn't have to actually read anything
10:26:48 <doug16k> should be in the tens of GB/sec to memset zeros
10:27:36 <doug16k> in L2 it's ludicrous GB/sec
10:27:44 <mrvn> doug16k: the cache filling takes forever.
10:28:32 <mrvn> it's also a lot about what you push out of the cache by zeroing
10:28:45 <doug16k> out where? to L3?
10:29:04 <doug16k> 4KB won't even come close to evicting L2
10:29:11 <mrvn> doug16k: 64k
10:29:20 <doug16k> even that is about 1/4 of it
10:29:32 <mrvn> it's usually all of L1 cache
10:29:43 <doug16k> everything is all of L1
10:29:49 <mrvn> 4k isn't
10:29:56 <doug16k> true
10:30:25 <doug16k> there's a reason everyone is using 4KB pages
10:30:30 <mrvn> So reading 4k pages stays in L1 cache, 64k pages drops to L2 page, 2M pages dropps to L3 cache. It all costs something.
10:30:33 <doug16k> and it isn't because they suck
10:30:55 <doug16k> the pages or the everyone!
10:32:14 <mrvn> Maybe instead of some algorithm of combining pages to 64k or 2M the system should have something to break up pages into 4k chunks for tails of files instead.
10:32:46 <mrvn> pretend to use 64k pages but for those tails actually map in 4k granularity.
10:33:28 <doug16k> I keep thinking, where's the perf analysis showing that the TLB is the bottleneck
10:33:38 <doug16k> where's the workload
10:33:47 <doug16k> and where's the implementation? :D
10:34:36 <doug16k> planning is good and all but isn't this just premature optimization?
10:35:04 <mrvn> doug16k: just run tests on systems with different TLB sizes and you will see.
10:36:31 <mrvn> Where I saw a problem for x86 was that most CPUs have one TLB for combined pages and one TLB for 4k pages. Only using 4k pages wastes some cache. But not using any 4k pages is worth because the 2M/1G TLB has way fewer entries.
10:36:49 <mrvn> No idea how ARM organizes the TLB for different sizes
10:36:59 <mrvn> s/worth/worse/
10:37:38 <mrvn> doug16k: nyc mentioned he profiled his kernel and saw page zeroing rise.
10:37:58 <nyc> mrvn: /whois nyc --- I'm not a he
10:38:16 <mrvn> /whois nyc
10:38:32 <mrvn> so you aren't.
10:38:50 <nyc> Nick: nyc
10:38:50 <nyc> Real name: Nadia Yvette Chambers
10:38:57 <mrvn> nyc: now you outed yourself. all the boys will fall at your feet and grab at them.
10:39:21 <mrvn> .oO(There goes your annonymity :)
10:39:21 <nyc> mrvn: I'm fat and ugly. I don't have to worry about that.
10:39:22 <FreeFull> Don't listen to her
10:39:27 <FreeFull> She's actually New York City
10:39:36 <mrvn> fat and ugly, right.
10:39:51 <mrvn> (the city, not nyc)
10:40:17 <nyc> If you don't think I'm a real person, you can find credits to me in Linux.
10:40:32 <doug16k> who wants anonymity?
10:40:44 <mrvn> nyc: I have no opinion on that and truely don't care.
10:41:06 <FreeFull> Honestly beauty is overrated
10:41:24 <mrvn> FreeFull: beauty is skin care product deep?
10:41:31 <nyc> mrvn: It is public evidence that at least the person I am claiming to be exists.
10:41:39 <FreeFull> mrvn: Or good genetics
10:41:52 <mrvn> FreeFull: well, maybe about that one should care
10:42:00 <mrvn> Darwin and all
10:42:19 <doug16k> nyc, no OS repos there, yet!
10:42:44 <nyc> doug16k: I should do an upload to a private repo real quick.
10:43:13 <nyc> https://techcrunch.com/2019/01/07/github-free-users-now-get-unlimited-private-repositories/
10:43:33 <mrvn> Oh, I should use that
10:43:46 <FreeFull> I keep all my repos public
10:43:51 <FreeFull> That way people can be thoroughly unimpressed
10:43:59 <doug16k> ya public ftw
10:44:22 <doug16k> give your planet a few more research points
10:45:07 <aalm> .theo
10:45:08 <glenda> I love this conversation.
10:45:35 <nyc> I probably need to rename it to yeos or some such.
10:48:13 <nyc> Right now I've been using nmcp.
10:49:15 <nyc> I'll probably s/nmcp/krn/g everywhere in there so it doesn't matter what the project name is.
10:54:41 <nyc> $ find . -iname '*krn*' | xargs echo
10:54:41 <nyc> ./sys/sparc/krn.lds ./sys/power/krn.lds ./sys/arm/krn.lds ./sys/riscv/krn.lds ./sys/mips/krn.lds ./bld/sparc/sys/sparc/krn.map ./bld/sparc/sys/sparc/krn ./bld/power/sys/power/krn.map ./bld/power/sys/power/krn ./bld/arm/sys/arm/krn.map ./bld/arm/sys/arm/krn ./bld/riscv/sys/riscv/krn.map ./bld/riscv/sys/riscv/krn ./bld/mips/sys/mips/krn.map ./bld/mips/sys/mips/krn
10:55:20 <nyc> $ find . -name '*nmcp*'
10:55:20 <nyc> $ grep -lr nmcp .
10:55:20 <nyc> $
11:14:06 <nyc> Okay, it's checked in.
11:20:51 <jmp9> Hi guys
11:20:58 <jmp9> i have question about instruction MOVSQ
11:21:09 <jmp9> does it increments RDI and RSI both or only RSI?
11:21:53 <isaacwoods> jmp9: all your questions can be answered by the manual
11:22:04 <jmp9> https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
11:23:16 <jmp9> (R|E)SI ← (R|E)SI + 8; (R|E)DI ← (R|E)DI + 8;
11:23:18 <jmp9> wtf is this
11:24:40 <isaacwoods> "take the value of RSI, add 8, move into RSI" and the same for RDI
11:25:30 <eryjus> that's telling you it is incrementing both the *SI and *DI registers by the size of the data you are moving (quadwords)
11:34:04 <jmp9> kmain.c:(.text+0x5): relocation truncated to fit: R_X86_64_32 against `.rodata' kobj/stdlib.o: In function `ksscanf':
11:34:06 <jmp9> what's this
11:35:09 <eryjus> it looks like you are mixing 32-bit and 64-bit code in the same binary
11:36:10 <eryjus> as a result the linker cannot find the functionw you are looking for in the right bitness
11:38:44 <jmp9> every compiled file is ELF64
11:38:47 <jmp9> object file
11:45:00 <jmp9> -mcmodel=large fixed problem
11:47:59 <eryjus> gotta remember that one
11:50:23 <nyc> Is there a channel to bikeshed about licenses?